Preparing an expanded transcriptome reference for quantification with alevin-fry#
The USA mode in alevin-fry requires an expanded index reference, in which sequences represent spliced and unspliced transcripts. Pyroe provides CLI programs and python functions to build the pre-defined expanded references, the spliced + intronic (splici) reference, which includes the spliced transcripts plus the (merged and collapsed) intronic sequences of each gene and the spliced + unspliced (spliceu) reference, which consists of the spliced transcripts plus the unspliced transcript (genes’ entire genomic interval) of each gene. The make_splici_txome()
and make_spliceu_txome()
python functions are designed to make the splici and spliceu reference by taking a genome FASTA file and a gene annotation GTF file as the input. Furthermore, the
Preparing a spliced+intronic transcriptome reference#
The splici index reference of a given species consists of the transcriptome of the species, i.e., the spliced transcripts and the intronic sequences of the species. Within a gene, if the flanked intronic sequences overlap with each other, the overlapped intronic sequences will be collapsed as a single intronic sequence to make sure each base will appear only once in the intronic sequences of each gene.
To prepare a splici reference using pyroe, in addition to a genome FASTA file and a gene annotation GTF file, you also need to specify the read length argument read_length
of the experiment you are working on. Besides, you can provide a flank trimming length flank_trim_length
. Then, a final flank length will be computed as the difference between the read_length
and flank trimming length and will be attached to the ends of each intron to absorb the intron-exon junctional reads. Details about splici can be found in the supplementary section S2 of the alevin-fry paper.
Following is an example of calling pyroe
in the command line to make the splici index reference. The final flank length is calculated as the difference between the read length and the flank_trim_length, i.e., 5-2=3. This program allows you to add extra spliced and unspliced sequences to the splici index reference, which will be useful when some unannotated sequences, such as mitochondrial genes, are important for your experiment. Note : To make pyroe work more quickly, it is recommended to have the latest version of bedtools (Aaron R. Quinlan and Ira M. Hall, 2010) installed.
pyroe make-spliced+intronic \
extdata/small_example_genome.fa \
extdata/small_example.gtf 5 splici_txome \
--flank-trim-length 2 \
--filename-prefix splici \
--dedup-seqs
The pyroe make-spliced+intronic program writes three files to your specified output directory splici_txome. They are
A FASTA file that stores the extracted splici sequences.
A three-column transcript-name-to-gene-name file that stores the name of each reference sequence in the splici index reference, their corresponding gene name, and the splicing status (S for spliced and U for unspliced) of those transcripts.
A two-column TSV file that maps gene ids (used as the keys in eventual alevin-fry output) to gene names. This can later be used with the
pyroe convert
command line program to convert gene ids to gene names in the count matrix.
Full usage#
usage: pyroe make-spliced+intronic [-h] [--filename-prefix FILENAME_PREFIX]
[--flank-trim-length FLANK_TRIM_LENGTH]
[--extra-spliced EXTRA_SPLICED]
[--extra-unspliced EXTRA_UNSPLICED]
[--bt-path BT_PATH] [--no-bt]
[--dedup-seqs] [--no-flanking-merge]
[--write-clean-gtf]
genome-path gtf-path read-length output-dir
positional arguments:
genome-path The path to a genome fasta file.
gtf-path The path to a gtf file.
read-length The read length of the single-cell experiment being
processed (determines flank size).
output-dir The output directory where splici reference files will
be written.
options:
-h, --help show this help message and exit
--filename-prefix FILENAME_PREFIX
The file name prefix of the generated output files.
--flank-trim-length FLANK_TRIM_LENGTH
Determines the amount subtracted from the read length
to get the flank length.
--extra-spliced EXTRA_SPLICED
The path to an extra spliced sequence fasta file.
--extra-unspliced EXTRA_UNSPLICED
The path to an extra unspliced sequence fasta file.
--bt-path BT_PATH The path to beadtools v2.30.0 or greater.
--no-bt A flag indicates whether bedtools will be used for
generating splici reference files.
--dedup-seqs A flag indicates whether identical sequences will be
deduplicated.
--no-flanking-merge A flag indicates whether flank lengths will be
considered when merging introns.
--write-clean-gtf A flag indicates whether a clean gtf will be written
if encountered invalid records.
The pyroe make-spliced+intronic
command line program calls the make_splici_txome()
python function under the hood. One can also directly call this function from a python instance to build a splici index. Here we provide helping messages of the make_splici_txome()
python function.
Construct the splici (spliced + introns) transcriptome for alevin-fry.
Required Parameters
genome_path : str
The path to a genome fasta file.
gtf_path : str
The path to a gtf file.
read_length : int
The read length of the single-cell experiment being processed.
output_dir : str
The output directory, where the splici reference files will be written.
Optional Parameters
flank_trim_length : int (default: 5)
The flank trimming length. The final flank length is obtained by subtracting the flank_trim_length from the read_length.
filename_prefix : str (default: splici)
The file name prefix of the generated output files. The derived flank length will be automatically appended to the provided prefix.
extra_spliced : str
A path to a fasta file. The records in this fasta file will be regarded as spliced transcripts.
extra_unspliced : str
The path to a fasta file. The records in this fasta file will be regarded as introns.
dedup_seqs : bool (default: False)
If True, the repeated sequences in the splici reference will be deduplicated.
no_bt : bool (default: False)
If true, biopython, instead of bedtools, will be used for generating splici reference files.
bt_path : str
The path to bedtools v2.30.0 or greater if it is not in the environment PATH.
no_flanking_merge : bool (default: False)
If true, overlapping introns caused by the added flanking length will not be merged.
Returns
Nothing will be returned. The splici reference files will be written to disk.
Preparing a spliced+unspliced transcriptome reference#
Recently, He et al., 2023 introduced the spliced + unspliced (spliceu) index in alevin-fry. This requires the spliced + unspliced transcriptome reference, where the unspliced transcripts of each gene represent the entire genomic interval of that gene. Details about the spliceu can be found in the preprint. To make the spliceu reference using pyroe, one can call the make_spliceu_txome()
python function or pyroe make-spliced+unspliced
or its alias pyroe make-spliceu
from the command line. The following example shows the shell command of building a spliceu reference from a given reference set in the directory spliceu_txome
.
pyroe make-spliced+unspliced \
extdata/small_example_genome.fa \
extdata/small_example.gtf \
spliceu_txome \
--filename-prefix spliceu
Full usage#
usage: pyroe make-spliced+unspliced [-h] [--filename-prefix FILENAME_PREFIX]
[--extra-spliced EXTRA_SPLICED] [--extra-unspliced EXTRA_UNSPLICED]
[--bt-path BT_PATH] [--no-bt] [--dedup-seqs]
genome-path gtf-path output-dir
positional arguments:
genome-path The path to a genome fasta file.
gtf-path The path to a gtf file.
output-dir The output directory where Spliceu reference files will be written.
options:
-h, --help show this help message and exit
--filename-prefix FILENAME_PREFIX
The file name prefix of the generated output files.
--extra-spliced EXTRA_SPLICED
The path to an extra spliced sequence fasta file.
--extra-unspliced EXTRA_UNSPLICED
The path to an extra unspliced sequence fasta file.
--bt-path BT_PATH The path to bedtools v2.30.0 or greater.
--no-bt A flag indicates whether bedtools will be used for generating Spliceu reference
files.
--dedup-seqs A flag indicates whether identical sequences will be deduplicated.
The pyroe make-spliced+unspliced
command line program calls the make_spliceu_txome()
python function under the hood. One can also directly call this function from a python instance to build a spliceu index. Here we provide helping messages of the make_spliceu_txome()
python function.
Construct the spliceu (spliced + unspliced) transcriptome for alevin-fry.
Required Parameters
genome_path : str
The path to a genome fasta file.
gtf_path : str
The path to a gtf file.
output_dir : str
The output directory, where the spliceu reference files will be written.
Optional Parameters
filename_prefix : str (default: spliceu)
The file name prefix of the generated output files. The derived flank length will be automatically appended to the provided prefix.
extra_spliced : str
A path to a fasta file. The records in this fasta file will be regarded as spliced transcripts.
extra_unspliced : str
The path to a fasta file. The records in this fasta file will be regarded as introns.
dedup_seqs : bool (default: False)
If True, the repeated sequences in the spliceu reference will be deduplicated.
no_bt : bool (default: False)
If true, biopython, instead of bedtools, will be used for generating spliceu reference files.
bt_path : str
The path to bedtools v2.30.0 or greater if it is not in the environment PATH.
Returns
Nothing will be returned. The spliceu reference files will be written to disk.
Notes
The input GTF file will be processed before extracting unspliced sequences. If pyroe finds invalid records, a clean_gtf.gtf file will be generated in the specified output directory. **Note** : The features extracted in the spliced + unspliced transcriptome will not necessarily be those present in the clean_gtf.gtf file — as this command will prefer the input in the user-provided file wherever possible. More specifically:
If the required metadata fields contain missing values, pyroe will impute them if possible, or return an error if not.
**Pyroe will always extract unspliced sequences according to the boundaries defined in the transcript/gene feature records unless there is no transcript/gene feature record in the GTF file.** In this case, pyroe imputes all transcripts/genes boundaries as the bounds of the corresponding exons to extract unspliced sequences.
If the transcript/gene feature records do not match their exon feature records, pyroe will still use transcript/gene feature records, but correct those transcript/gene feature records in the celan_grf.gtf according to exon feature records.
If using bedtools, a temp.bed and a temp.fa will be created and then deleted. These two files encode the introns of each gene and the exons of each transcript of each gene.
Notes on the input gene annotation GTF files for building an expanded reference#
Pyroe builds expanded transcriptome references, the spliced + intronic (splici) and the spliced + unspliced (spliceu) transcriptome reference, based on a genome build FASTA file and a gene annotation GTF file.
The input GTF file will be processed before extracting unspliced sequences. If pyroe finds invalid records, a clean_gtf.gtf
file will be generated in the specified output directory. Note : The features extracted in the spliced + unspliced transcriptome will not necessarily be those present in the clean_gtf.gtf
file — as this command will prefer the input in the user-provided file wherever possible. One can rerun pyroe using the clean_gtf.gtf
file if needed. More specifically:
The non-gene level records, those whose
feature
field value is not “gene, ” must have a validtranscript_id
. If this is not satisfied, pyroe returns an error and writes only the records with a validtranscript_id
to theclean_gtf.gtf
file. One can rerun pyroe using the clean_gtf.gtf file to ignore those invalid records if needed.For
gene_id
andgene_name
metadata field,If these two fields are entirely missing in the GTF file, An error will be returned. At the same time, in the
clean_gtf.gtf
, the two fields will be imputed using thetranscript_id
field.If one of
gene_id
andgene_name
is completely missing, pyroe will print a warning, impute the missing field using the other one, and move to the next step with the imputed data.if some records have missing
gene_id
,gene_name
, or both, pyroe will print a warning and move to the next step after imputing the missing values by the following rules: For records missinggene_id
orgene_name
, pyroe imputes the missing one using the other one; If both are missing, pyroe imputes both of them using itstranscript_id
, which cannot be missing.
If the GTF file does not contain transcript or gene level records, those whose
feature
field value is “transcript” or “gene”, pyroe will print a warning and impute those missing records using the exon level records of transcripts and genes, in which theStart
andEnd
fields will be imputed as the bounds of the corresponding exons.If the boundaries of transcripts/genes defined in the “transcript” or “gene” level records – those whose
feature
field value is either “transcript” or “gene” – do not match those implied by their exons’ feature records, or the transcript/gene level records of some transcripts/genes’ are missing, pyroe will report a warning, fix all those gene/transcript level records using their exon level records and write them to theclean_gtf.gtf
file, but still extract unspliced sequences based on the existing transcript/gene level records.