Running Sort

Basic Configuration

At minimum, sort requires a set of trimmed reads and a genome to align to from the species the small RNAs came from. If trim is in the same config file, the set to trimmed reads will be taken from the output of trim. Examples are as follows.

Without running trim.

smallRNA_fastq: smallrna_trimmed.fastq
sort:
  genome: genome.fasta

With running trim.

smallRNA_fastq: smallrna_untrimmed.fastq
trim:
  ...
sort:
  genome: genome.fasta

Once this config is written, it can be run as usual using:

$ hlsmallrna config.yml

Tip

Remember if you install via apptainer or docker you will need to prepend the command from the install page. For apptainer this command becomes:

$ apptainer run -B $HOME:$HOME hlsmallrna.sif hlsmallrna config.yml

Optional Parameters

align_to_cds

If true aligns reads that don't align to the genome to the specified CDS file, keeping any that align. This allows for small RNA that are fragments of genes and align accross introns to be kept as they may not align to the genome.

cds: cds_sequences.fasta
...
sort:
  ...
  align_to_cds: True

min_length

Sets a minimum sequence length to look at (default is 18 nt). Anything shorter is filtered during sorting. e.g. to filter anything below 10 nt set the following:

sort:
  min_length: 10

max_length

Sets a maximum sequence length to look at (default is 30 nt). Anything longer is filtered during sorting. e.g. to filter anything above 50 nt set the following:

sort:
  max_length: 50

mismatches

Number of mismatches to allow in bowtie2 when aligning to the genome and CDS, defaults to 0. See the Considerations page for information about why you'd want to change this. The example below sets this to allow 1 mismatch:

sort:
  mismatches: 1

Warning

The small RNA pipeline supresses gaps by makeing them equal to 100 mismatches. If you set this over 100, some gaps may be included in the alignment.

Output Files

alignment_report.tsv

This is a table showing an overview of how many small RNA reads successfully aligned to the genome and optionally the CDS. It contains the following metrics:

Total Reads
Total Mapped
Percentage Mapped
Total Unmapped
Percentage Unmapped
Mapped to Genome
Percentage Mapped to Genome
Unmapped to Genome
Percentage Unmapped to Genome
Mapped to CDS
Percentage Mapped to CDS (but not to genome)
Unmapped to CDS or genome
Percentage Unmapped to CDS or genome

counts.tab

Read counts is TSV format for each small RNA sequence, designed as input to a differential expression analysis, if that wanted.

rna_length_report.csv

Table of counts of small RNA for each length and first base, in a easily human readable format.

baseplot.png

Plot of frequency of length and first base in the dataset.

baseplot_data.csv

That length and first base data used in baseplot.png - normalised to percentages. Should be used if the user wants to replot that graph or feed into downstream analysis.

binned_length_rna

Directory containing FASTQ files for each RNA length - named lengthX.fastq where X is the length of small RNA in the file.

binned_group_rna

Directory containing FASTQ files for each RNA group - defined by length and first base - named {length}{first_base}.fastq (e.g. 22G.fastq) where {length} is the length of small RNA in the file and {first_base} is the letter corresponding to the first base.