Skip to content

3). Running

Duncan Berger edited this page Sep 10, 2024 · 3 revisions

Table of contents

Usage

There is only one mandatory parameter for running LOMA, an input file (format detailed below).

./run_loma --input input.tsv

Input file structure

The input file (e.g. 'input.tsv') is a five column tab-separated file with the following structure:

RUN_ID  BARCODE_ID  SAMPLE_ID   SAMPLE_TYPE /FULL/PATH/TO/FASTQ_FILE
  • RUN_ID: Run identifier, will determine the highest level directory name in the results directory
  • BARCODE_ID: Sample barcode
  • SAMPLE_ID: Sample identifier, will determine the subdirectory where results are stored per-sample
  • SAMPLE_TYPE: Sample description, will be added to the reports, but doesn't change how the sample is processed.
  • /FULL/PATH/TO/FASTQ_FILE: Location of input FASTQ files.

Any number of samples can be included provided they do not have both identical RUN_ID and SAMPLE_ID's (either is fine though). If any of the columns contain a period ('.'), they'll be automatically replaced with an underscore ('_') in the output.

Example input file:

RUN01	RB01	SAMPLE_1	BLOOD	/data/projects/metagenome_ont/SAMPLE_1.BLOOD.fq.gz
RUN01	RB02	SAMPLE_2	BLOOD	/data/projects/metagenome_ont/SAMPLE_2.BLOOD.fq.gz
RUN02	RB01	SAMPLE_3	SALIVA	/data/projects/metagenome_ont/SAMPLE_3.NASOPHARYNGEAL.fq.gz
RUN03	XBD     SAMPLE_1	SKIN	/data/projects/metagenome_ont/SAMPLE_3.SKIN.fq.gz

Further examples can be found here.

Optional parameters

Commonly used optional parameters include:

Help options:
  --help                                       Display help text.
  --validationShowHiddenParams                 Display all parameters.
  --print_modules                              Print the full list of available modules.

Run options:
  -profile                                     Executor to use. (accepted: singularity, apptainer, docker) [default: singularity] 
  -resume                                      If possible resume the pipeline at the last completed process. 
  --outdir                                     The output directory where the results will be saved (use absolute paths on cloud infrastructure). [default: results] 
   
Resource usage:
  --max_cpus                                   Maximum number of CPUs that can be requested for any single job. [default: 36]
  --max_memory                                 Maximum amount of memory that can be requested for any single job. [default: 90.GB]
  --max_time                                   Maximum amount of time that can be requested for any single job. [default: 240.h]

Execution options
  --skip_assembly                              Skip read assembly.
  --skip_taxonomic_profiling                   Skip read-based taxonomic profiling.
  --skip_bacterial_typing                      Skip metagenome assembled genome analyses.

Less commonly used parameters include:

READ_QC
  --FILTLONG.min_length                        Filtlong: Discard any read shorter than length (bp). [default: 650]
  --FILTLONG.keep_percent                      Filtlong: Percentage of dataset retained. By default removes the worst 5% of read bases. [default: 95] 
  --SEQTK_FQCHK.endseq_len                     Number of basepairs on either end of reads to plot for quality an adapter content quality control. [default: 300] 

READ_DECONTAMINATION
  --READ_DECONTAMINATION.host_assembly         Path to host genome assembly.
  --READ_DECONTAMINATION.host_krakendb         Path to host specific Kraken2 database.

TAXONOMIC_PROFILING
  --TAXONOMIC_PROFILING.krakendb               Path to Kraken2 database.
  --TAXONOMIC_PROFILING.centrifugerdb          Path to Centrifuger database.
  --TAXONOMIC_PROFILING.sylphdb                Path to Sylph database.
  --TAXONOMIC_PROFILING.dbdir                  Path to directory containing: nodes.dmp, names.dmp and merged.dmp files.
  --TAXONOMIC_PROFILING.target_species         List of target pathogens.
  --new_param_3                                GTDB metadata file (typically gtdb_r*_metadata.tsv.gz).
  --PARSE_KRAKEN2HITS.min_target_reads         Minimum number of reads required for a Kraken2 hit (processed with Bracken) to be reported. [default: 1] 
  --PARSE_KRAKEN2HITS.min_target_fraction      Minimum proportion of reads required for a Kraken2 hit (processed with Bracken) to be reported. [default: 0] 
  --PARSE_CENTRIFUGERHITS.min_target_reads     Minimum number of reads required for a Centrifuger hit (processed with Bracken) to be reported. [default: 1] 
  --PARSE_CENTRIFUGERHITS.min_target_fraction  Minimum proportion of reads required for a Centrifuger hit (processed with Bracken) to be reported. [default: 0] 
  --PARSE_SYLPHHITS.min_target_reads           Minimum number of reads required for a Sylph hit to be reported. [default: 1]
  --PARSE_SYLPHHITS.min_target_fraction        Minimum proportion of reads required for a Sylph hit to be reported. [default: 0]

ASSEMBLY
  --FLYE.read_type                             Read type for Flye to process. (accepted: --nano-raw, --nano-corr, --nano-hq) [default: --nano-raw] 
  --ASSEMBLY.racon_rounds                      Number of rounds of metagenome assembly polishing to perform with Racon. [default: 4]
  --ASSEMBLY.medaka                            Perform metagenome assembly polishing with Medaka.

ASSIGN_BINS
  --SEMIBIN_SINGLEEASYBIN.environment          Prebuilt SemiBin2 model. (accepted: human_gut, dog_gut, ocean, soil, cat_gut, human_oral, mouse_gut, pig_gut, built_environment, ...) [default: human_gut] 
  --DASTOOL_DASTOOL.score_threshold            DAS Tool score threshold until selection algorithm will keep selecting bins. [default: 0.5]

CONTIG_QC
  --SKANI_SEARCH.db                            Skani database directory.
  --GENOMAD_ENDTOEND.db                        geNomad database directory.

BIN_QC
  --GTDBTK_CLASSIFYWF.mash_db                  GTDB mash reference sketch database.
  --GTDBTK_CLASSIFYWF.gtdb_db                  GTDB-Tk reference database.
  --CHECKM_LINEAGEWF.db                        CheckM reference database.
  --GTDBTK_CLASSIFYWF.min_perc_aa              Minimum percentage of amino acids that must be shared in the multiple sequence alignments. [default: 10] 
  --GTDBTK_CLASSIFYWF.min_af                   Minimum alignment fraction to assign genome to a species cluster. [default: 0.65]
  --GTDBTK_CLASSIFYWF.pplacer_scratch          Reduce pplacer memory usage by writing to disk (slower). [default: true]

BIN_TAXONOMY
  --BIN_TAXONOMY.medaka_mag                    Polish individual metagenome assembled genomes with Medaka.
  --ASSIGN_TAXONOMY.ani_cutoff                 Minimum average nucleotide identity required to retain a match. [default: 0.75]
  --ASSIGN_TAXONOMY.aln_frac                   Minimum alignment fraction required to retain a match. [default: 0.75]
  --ASSIGN_TAXONOMY.definitiontable            Custom definition table to define targets and downstream parameters.

PROKARYA_TYPING:AMR_TYPING
  --AMRFINDERPLUS_RUN.db                       AMRFinderPlus database.
  --RESFINDER.db                               ResFinder database.
  --POINTFINDER.db                             PointFinder database.

PROKARYA_TYPING:SEQUENCE_TYPING
  --SEQUENCE_TYPING.cc_definitions             Clonal complex definition file.
  --MLST.yersinia_blastdb                      Custom BLASTN database for Yersinia MLST.
  --MLST.yersinia_datadir                      Custom database for Yersinia MLST.

PROKARYA_TYPING:TARGETED_TYPING
  --VIRULENCEFINDER.db                         Virulencefinder database.
  --TARGETED_TYPING.genedbdir                  Directory of custom gene databases for Genefinder and/or BLASTN.

CREATE_REPORT
  --CREATE_REPORT.amr_tool                     Display results for which AMR tool in the output summary. (accepted: abricate, abritamr, amrfinderplus, resfinder, rgi, all) 
  --CREATE_REPORT.template                     Template HTML file for the summary report.

Tips for improving speed and efficiency

Skipping major analysis steps

When specified, the following parameters will skip substantial sections of the pipeline, saving resources if the results are not of interest:

  --skip_assembly                                 Skip read assembly.
  --skip_taxonomic_profiling                      Skip read-based taxonomic profiling.
  --skip_prokarya_typing                          Skip metagenome assembled genome analyses.

Skipping read-based taxonomic annotation

Excluding taxonomic databases will skip the associated step, reducing overall runtime.

  --TAXONOMIC_PROFILING.krakendb=""               Skip Kraken2 taxonomic profiling
  --TAXONOMIC_PROFILING.centrifugerdb=""          Skip Centrifuger taxonomic profiling
  --TAXONOMIC_PROFILING.sylphdb=""                Skip Sylph taxonomic profiling

Skipping/reducing polishing

Assembly error correction is a very time consuming step. To save time you can reduce the number of rounds of Racon polishing:

  --ASSEMBLY.racon_rounds 1                       Runs 1 round of Racon polishing (default:4, range: 0-4)

If you find the per-base accuracy of your MAGs are low, even after polishing with Racon. You can enable Medaka polishing (very slow, so disabled by default):

  --ASSEMBLY.medaka                               Perform metagenome assembly polishing with Medaka.

However, a quicker approach is to only polish the MAGs of interest. This can be done by specifying:

  --BIN_TAXONOMY.medaka_mag                       Polish individual metagenome assembled genomes with Medaka.

Adjust RAM/CPU usage

Depending on your available computing resources, it may be necessary to change the preset resource usage defaults. The max RAM and CPU usage can be changed with command line arguements as follows:

  --max_cpus 24                                   Maximum number of CPUs that can be requested for any single job. [default: 36]
  --max_memory "80.GB"                            Maximum amount of memory that can be requested for any single job. [default: 90.GB]

If you are finding that you are running out of memory, or if you have limited swap memory, it's possible to alter the preset resource usages for individual processes in conf/base.config. By raising the RAM and CPU requirements for intensive processes to make the requirements >50% of the total CPU/RAM allocation, you can stop LOMA from running multiple intensive jobs simultaneously.

Specifically, in the section:

withLabel:process_high {
   cpus   = { check_max( 22    * task.attempt, 'cpus'    ) }
   memory = { check_max( 86.GB * task.attempt, 'memory'  ) }
   time   = { check_max( 16.h  * task.attempt, 'time'    ) }
}

Other parameters

Skip geNomad neural network-based classification, this will reduce runtime at the cost of accuracy:

  --GENOMAD_ENDTOEND.args="--disable-nn-classification"