-
Notifications
You must be signed in to change notification settings - Fork 0
3). Running
There is only one mandatory parameter for running LOMA, an input file (format detailed below).
./run_loma --input input.tsv
The input file (e.g. 'input.tsv') is a five column tab-separated file with the following structure:
RUN_ID BARCODE_ID SAMPLE_ID SAMPLE_TYPE /FULL/PATH/TO/FASTQ_FILE
- RUN_ID: Run identifier, will determine the highest level directory name in the results directory
- BARCODE_ID: Sample barcode
- SAMPLE_ID: Sample identifier, will determine the subdirectory where results are stored per-sample
- SAMPLE_TYPE: Sample description, will be added to the reports, but doesn't change how the sample is processed.
- /FULL/PATH/TO/FASTQ_FILE: Location of input FASTQ files.
Any number of samples can be included provided they do not have both identical RUN_ID and SAMPLE_ID's (either is fine though). If any of the columns contain a period ('.'), they'll be automatically replaced with an underscore ('_') in the output.
Example input file:
RUN01 RB01 SAMPLE_1 BLOOD /data/projects/metagenome_ont/SAMPLE_1.BLOOD.fq.gz
RUN01 RB02 SAMPLE_2 BLOOD /data/projects/metagenome_ont/SAMPLE_2.BLOOD.fq.gz
RUN02 RB01 SAMPLE_3 SALIVA /data/projects/metagenome_ont/SAMPLE_3.NASOPHARYNGEAL.fq.gz
RUN03 XBD SAMPLE_1 SKIN /data/projects/metagenome_ont/SAMPLE_3.SKIN.fq.gz
Further examples can be found here
.
Commonly used optional parameters include:
Help options:
--help Display help text.
--validationShowHiddenParams Display all parameters.
--print_modules Print the full list of available modules.
Run options:
-profile Executor to use. (accepted: singularity, apptainer, docker) [default: singularity]
-resume If possible resume the pipeline at the last completed process.
--outdir The output directory where the results will be saved (use absolute paths on cloud infrastructure). [default: results]
Resource usage:
--max_cpus Maximum number of CPUs that can be requested for any single job. [default: 36]
--max_memory Maximum amount of memory that can be requested for any single job. [default: 90.GB]
--max_time Maximum amount of time that can be requested for any single job. [default: 240.h]
Execution options
--skip_assembly Skip read assembly.
--skip_taxonomic_profiling Skip read-based taxonomic profiling.
--skip_bacterial_typing Skip metagenome assembled genome analyses.
Less commonly used parameters include:
READ_QC
--FILTLONG.min_length Filtlong: Discard any read shorter than length (bp). [default: 650]
--FILTLONG.keep_percent Filtlong: Percentage of dataset retained. By default removes the worst 5% of read bases. [default: 95]
--SEQTK_FQCHK.endseq_len Number of basepairs on either end of reads to plot for quality an adapter content quality control. [default: 300]
READ_DECONTAMINATION
--READ_DECONTAMINATION.host_assembly Path to host genome assembly.
--READ_DECONTAMINATION.host_krakendb Path to host specific Kraken2 database.
TAXONOMIC_PROFILING
--TAXONOMIC_PROFILING.krakendb Path to Kraken2 database.
--TAXONOMIC_PROFILING.centrifugerdb Path to Centrifuger database.
--TAXONOMIC_PROFILING.sylphdb Path to Sylph database.
--TAXONOMIC_PROFILING.dbdir Path to directory containing: nodes.dmp, names.dmp and merged.dmp files.
--TAXONOMIC_PROFILING.target_species List of target pathogens.
--new_param_3 GTDB metadata file (typically gtdb_r*_metadata.tsv.gz).
--PARSE_KRAKEN2HITS.min_target_reads Minimum number of reads required for a Kraken2 hit (processed with Bracken) to be reported. [default: 1]
--PARSE_KRAKEN2HITS.min_target_fraction Minimum proportion of reads required for a Kraken2 hit (processed with Bracken) to be reported. [default: 0]
--PARSE_CENTRIFUGERHITS.min_target_reads Minimum number of reads required for a Centrifuger hit (processed with Bracken) to be reported. [default: 1]
--PARSE_CENTRIFUGERHITS.min_target_fraction Minimum proportion of reads required for a Centrifuger hit (processed with Bracken) to be reported. [default: 0]
--PARSE_SYLPHHITS.min_target_reads Minimum number of reads required for a Sylph hit to be reported. [default: 1]
--PARSE_SYLPHHITS.min_target_fraction Minimum proportion of reads required for a Sylph hit to be reported. [default: 0]
ASSEMBLY
--FLYE.read_type Read type for Flye to process. (accepted: --nano-raw, --nano-corr, --nano-hq) [default: --nano-raw]
--ASSEMBLY.racon_rounds Number of rounds of metagenome assembly polishing to perform with Racon. [default: 4]
--ASSEMBLY.medaka Perform metagenome assembly polishing with Medaka.
ASSIGN_BINS
--SEMIBIN_SINGLEEASYBIN.environment Prebuilt SemiBin2 model. (accepted: human_gut, dog_gut, ocean, soil, cat_gut, human_oral, mouse_gut, pig_gut, built_environment, ...) [default: human_gut]
--DASTOOL_DASTOOL.score_threshold DAS Tool score threshold until selection algorithm will keep selecting bins. [default: 0.5]
CONTIG_QC
--SKANI_SEARCH.db Skani database directory.
--GENOMAD_ENDTOEND.db geNomad database directory.
BIN_QC
--GTDBTK_CLASSIFYWF.mash_db GTDB mash reference sketch database.
--GTDBTK_CLASSIFYWF.gtdb_db GTDB-Tk reference database.
--CHECKM_LINEAGEWF.db CheckM reference database.
--GTDBTK_CLASSIFYWF.min_perc_aa Minimum percentage of amino acids that must be shared in the multiple sequence alignments. [default: 10]
--GTDBTK_CLASSIFYWF.min_af Minimum alignment fraction to assign genome to a species cluster. [default: 0.65]
--GTDBTK_CLASSIFYWF.pplacer_scratch Reduce pplacer memory usage by writing to disk (slower). [default: true]
BIN_TAXONOMY
--BIN_TAXONOMY.medaka_mag Polish individual metagenome assembled genomes with Medaka.
--ASSIGN_TAXONOMY.ani_cutoff Minimum average nucleotide identity required to retain a match. [default: 0.75]
--ASSIGN_TAXONOMY.aln_frac Minimum alignment fraction required to retain a match. [default: 0.75]
--ASSIGN_TAXONOMY.definitiontable Custom definition table to define targets and downstream parameters.
PROKARYA_TYPING:AMR_TYPING
--AMRFINDERPLUS_RUN.db AMRFinderPlus database.
--RESFINDER.db ResFinder database.
--POINTFINDER.db PointFinder database.
PROKARYA_TYPING:SEQUENCE_TYPING
--SEQUENCE_TYPING.cc_definitions Clonal complex definition file.
--MLST.yersinia_blastdb Custom BLASTN database for Yersinia MLST.
--MLST.yersinia_datadir Custom database for Yersinia MLST.
PROKARYA_TYPING:TARGETED_TYPING
--VIRULENCEFINDER.db Virulencefinder database.
--TARGETED_TYPING.genedbdir Directory of custom gene databases for Genefinder and/or BLASTN.
CREATE_REPORT
--CREATE_REPORT.amr_tool Display results for which AMR tool in the output summary. (accepted: abricate, abritamr, amrfinderplus, resfinder, rgi, all)
--CREATE_REPORT.template Template HTML file for the summary report.
When specified, the following parameters will skip substantial sections of the pipeline, saving resources if the results are not of interest:
--skip_assembly Skip read assembly.
--skip_taxonomic_profiling Skip read-based taxonomic profiling.
--skip_prokarya_typing Skip metagenome assembled genome analyses.
Excluding taxonomic databases will skip the associated step, reducing overall runtime.
--TAXONOMIC_PROFILING.krakendb="" Skip Kraken2 taxonomic profiling
--TAXONOMIC_PROFILING.centrifugerdb="" Skip Centrifuger taxonomic profiling
--TAXONOMIC_PROFILING.sylphdb="" Skip Sylph taxonomic profiling
Assembly error correction is a very time consuming step. To save time you can reduce the number of rounds of Racon polishing:
--ASSEMBLY.racon_rounds 1 Runs 1 round of Racon polishing (default:4, range: 0-4)
If you find the per-base accuracy of your MAGs are low, even after polishing with Racon. You can enable Medaka polishing (very slow, so disabled by default):
--ASSEMBLY.medaka Perform metagenome assembly polishing with Medaka.
However, a quicker approach is to only polish the MAGs of interest. This can be done by specifying:
--BIN_TAXONOMY.medaka_mag Polish individual metagenome assembled genomes with Medaka.
Depending on your available computing resources, it may be necessary to change the preset resource usage defaults. The max RAM and CPU usage can be changed with command line arguements as follows:
--max_cpus 24 Maximum number of CPUs that can be requested for any single job. [default: 36]
--max_memory "80.GB" Maximum amount of memory that can be requested for any single job. [default: 90.GB]
If you are finding that you are running out of memory, or if you have limited swap memory, it's possible to alter the preset resource usages for individual processes in conf/base.config
. By raising the RAM and CPU requirements for intensive processes to make the requirements >50% of the total CPU/RAM allocation, you can stop LOMA from running multiple intensive jobs simultaneously.
Specifically, in the section:
withLabel:process_high {
cpus = { check_max( 22 * task.attempt, 'cpus' ) }
memory = { check_max( 86.GB * task.attempt, 'memory' ) }
time = { check_max( 16.h * task.attempt, 'time' ) }
}
Skip geNomad neural network-based classification, this will reduce runtime at the cost of accuracy:
--GENOMAD_ENDTOEND.args="--disable-nn-classification"