The Config File
HuntLab-smallRNA 2 uses an overhauled config system that is different to HuntLab-smallRNA 1. This is based on a YAML file and the idea that anything that affects the analysis should be in the config file, while anything that doesn't is a command line argument. Alongside this it has been made easier to run multiple stages in one command. An annotated version of a minimal config file to run all 4 parts of the analysis is as follows:
# The small RNA FASTQ file to use as input to the pipeline
smallRNA_fastq: smallrna.fastq
# The trim section, as it is present trim will be run
trim:
# Use the bulit-in adapaters for the Qiagen kit
kit: qiagen
# The sort section, as it is present sort will be run
sort:
# The genome to align to filter small RNA
genome: genome.fasta
# The unitas section, as it is present unitas will be run
unitas:
# files to pass to unitas refseq and the labels to give them
refseq:
- miRNA: test/miRNA.fasta
- piRNA: test/piRNA.fasta
- tRNA: test/tRNA.fasta
- TE: test/transposable_elements.fasta
# The targetid section, as it is present targetid will be run
targetid:
# Files containing the interesting targets to align to
target_files:
- test/file1.fasta
- test/file2.fasta
Absolute and Reletive Paths
When parsing the config file, the small RNA pipeline assumes that all paths in the config file are reletive to the directory the config file is in. So if you move the config file, things will break. If you need to move the config file, use absolute paths (that start with /
) instead.
To run this, you could save the file as hlsmallrna_config.yml
and run:
$ hlsmallrna hlsmallrna_config.yml
Tip
Remember if you install via apptainer or docker you will need to prepend the command from the install page. For apptainer this command becomes:
$ apptainer run -B $HOME:$HOME hlsmallrna.sif hlsmallrna hlsmallrna_config.yml
The results would then appear in the directory hlsmallrna_output/
, presuming all the sequence data referenced is real. A more comprehensive file to do a similar analysis to the first one is:
# The small RNA FASTQ file to use as input to the pipeline
smallRNA_fastq: smallrna.fastq
# CDS FASTA of the species of interest, can be used in sort, unitas and targetid
cds: cds.fasta
# unspliced transcriptome FASTA of the species of interest, can be used in unitas and targetid
unspliced_transcriptome: unspliced.fasta
# The trim section, as it is present trim will be run
trim:
# The 5’ adapter sequence to trim
5_prime: ACGTTTAG
# The 3’ adapter sequence to trim
3_prime: CGTAGGAT
# The quality filter cutoff to use
min_quality: 20
# The sort section, as it is present sort will be run
sort:
# The genome to align to filter small RNA
genome: genome.fasta
# If True, also align to the CDS file above
align_to_cds: True
# Minimum length of small RNA sequence to keep
min_length: 15
# Maximum length of small RNA sequence to keep
max_length: 50
# Max number to allow when aligning to the genome and CDS
mismatches: 0
# The unitas section, as it is present unitas will be run
unitas:
# files to pass to unitas refseq and the labels to give them
refseq:
# add genes to unitas, special keyword to add using cds and unspliced_transcriptome specified earlier
- gene
- miRNA: test/miRNA.fasta
- piRNA: test/piRNA.fasta
- tRNA: test/tRNA.fasta
- TE: test/transposable_elements.fasta
# species name to pass to unitas
species: x
# The targetid section, as it is present targetid will be run
targetid:
# Minimum length of small RNA, higher number will speed up runtime, used to calculate bowtie2 seed length
min_seq_length: 5
# Number of mismatches to allow when aligning to the targets
mismatches: 0
# Files containing the interesting targets to align to
target_files:
- test/file1.fasta
- test/file2.fasta
# If present also enrich GO terms, KEGG pathways and Pfams of targets
enrich:
# Path to the eggnog-mapper data dir
eggnog_data_dir: /home/user/eggnog-mapper-data
# Target files to ignore during enrichment
exclude_files:
- test/file2.fasta
If you want to skip a step, just leave it out of the file. e.g. if you only want to run sort and unitas, you could use the file:
smallRNA_fastq: smallrna.fastq
sort:
genome: genome.fasta
unitas:
refseq:
- miRNA: test/miRNA.fasta
- piRNA: test/piRNA.fasta
- tRNA: test/tRNA.fasta
- TE: test/transposable_elements.fasta
Command Line Arguments
While most things are now specified in the config file, there are still a few CLI arguments that are as follows:
-o
, --ouput
- directory to write pipeline output to (default: hlsmallrna_output)
-t
, --threads
- number of threads to use when running analysis tools (default: 4)
-k
, --keep-files
- If set, don't delete intermediate files, useful for debugging or if you need an intermediate file later
-v
, --verbose
- makes the pipeline print out the output of the intermediate commands
The following sections will explain each parameter in the config file in detail.