Skip to content

6). Running LOMA with example data

Duncan Berger edited this page Jan 3, 2025 · 18 revisions

Table of contents

Downloading the example data

  • Download the test FASTQ data, stored on Zenodo (5.4 GB).
  • Note: This is a non-randomly subset (~10% original dataset size) of Oxford Nanopore sequencing reads of the ZymoBIOMICS HMW DNA Standard. Derived from a publicly available dataset. The original reads are available here and the relevant publication is available here.
wget https://zenodo.org/records/13731176/files/ERR7287988.subset.fastq.gz

You'll then need to edit the examples/ERR7287988_input.tsv, to contain the full path to ERR7287988.subset.fastq.gz.

Running LOMA

You can then run LOMA with the following command:

./run_loma --input examples/ERR7287988_input.tsv --outdir ERR7287988_results

If you need to restart the run you can add, '-resume', as below:

./run_loma --input examples/ERR7287988_input.tsv --outdir ERR7287988_results -resume

Summary report

We will start with the summary report, found in: ERR7287988_results/SAMEA10644977/summary/SAMEA10644977.ERR7287988.summary_report.html (example here)

a). Sample background

These sections provide basic information on pipeline execution and sample metadata.

docs_images_loma_tutorial_P1


b). Read quality control

This section shows the pre- and post filtering read quality control information. Here for example we can see we started with 5.68 Gb of reads and retained ~80% of all bases post-QC.

docs_images_loma_tutorial_P1


c). Taxonomic summary

We see broadly the expected species represented in this summary, as we move to the lower abundance results we do see representatives of the right genera but wrong species shown but in very small proportions (expected with Kraken2).

docs_images_loma_tutorial_P1


d). Binning summary

Here we see the assembly and polishing stages have produced a 40 Mb assembly, in 87 contigs with a fairly large contig N50. Looking at the binning summary we can see the assembly has been placed into 7 bins, representing the majority of bases and all are high-quality bins. The small proportion of contigs assigned to bins can be explained by two factors: first, these will predominately be shorter contigs (<5 kb) and/or these contigs will represent the eukaryotic species (Saccharomyces cerevisiae) found in the sample, which do not get binned by the tools used by LOMA. More detail on binning can be found below.

docs_images_loma_tutorial_P1


e). Bin quality control

We can see each bin for 7/8 species in this sample (excluding Saccharomyces cerevisiae), all of which are high-quality (based on CheckM completeness and contamination scores), and the right assembly size and GC content.

docs_images_loma_tutorial_P1


f). in silico phenotyping

Depending on the samples identified, multiple metrics will automatically be evaluated to provide further details on individual bins.

  • In this section we can see the results of sequence typing (using MLST and Krocus) and clonal complex assignment (where available). You may find that when re-run you will not consistently get sequence type assignments for all species, this is due to variations in the assembly process and the per-base accuracy of Nanopore reads - although you should expect to see most of the samples assigned a sequence type.

  • The next section summarizes the genes identified within specific bins, in this case we only provided targets for two species.

  • We have also have the results of antimicrobial resistance typing (default: ResFinder & PointFinder), for all species of interest. A more detailed report across 4 different profilers can be found below.

docs_images_loma_tutorial_P1


g). Species-specific typing

Finally, there are three species-specific subworkflows reporting relevant metrics available for the three species given.

  • We can see the results of E. coli / Shigella spp. typing (summarized across multiple tools) which suggest this is not Shiga toxin-producing Escherichia coli (STEC), enteroinvasive E. coli (EIEC) or other pathogenic E. coli. It lists the classification as 'Unknown' as few tools report positive identification of non-pathogenic E. coli but this is likely the case.

  • We also have the results of consensus typing of Salmonella samples, in this case it appears to be Salmonella enterica subspecies enterica with antigenic profiles also reported.

  • Similarly, the results for Listeria monocytogenes serotyping using LisSero, are also reported.

docs_images_loma_tutorial_P1


Read QC report

Report found in: ERR7287988_results/SAMEA10644977/summary/SAMEA10644977.ERR7287988.readqc_report.html (example here)

a). Quality control summary

The first section contains a summary of impact of quality control (adapter removal, read quality filtering and human-read removal) on the example dataset.

docs_images_loma_tutorial_P1


b). Read-length vs read-quality

The second section shows a summary plot of read-length vs read-quality pre- (left) and post-treatment (right), in this case most of the dataset has been retained, so there is only a minor increase in read quality.

docs_images_loma_tutorial_P2


c). Read nucleotide composition

The final plot shows the impact of read quality control on the ends of reads, in this case, we can see the pre quality control reads (top row) how lower PHRED quality scores at both ends and deviate in GC content, suggesting there are adapters at the 5' ends of the reads. Post quality control (bottom row), these are much less apparent.

docs_images_loma_tutorial_P3


Taxonomic abundance report

Report found in: ERR7287988_results/SAMEA10644977/summary/SAMEA10644977.ERR7287988.taxonomy_report.html (example here)

a). Kraken2 results

Assuming LOMA was run with a Kraken2 database, this report will include figures showing sequence (per-read) classification results with Kraken2.

  • The first figure shows the proportion of reads assigned at various taxonomic ranks, so in this case we can see ~40% of reads could be assigned a species and ~36% could be assigned at the subspecies level. ~22% were missing a rank which typically means the taxonomy ID hasn't been assigned a rank in the relevant database.

  • The second figure shows the 50 most abundant species, in this case we can see Enterococcus faecalis is the most well represented species by read count.

docs_images_loma_tutorial_P1


b). Bracken (Kraken2) results

  • This figure shows the results of processing the Kraken2 results with Bracken (Bayesian Reestimation of Abundance with KrakEN), resulting in accurate abundance estimates.

docs_images_loma_tutorial_P1


c). Sylph results

  • This figure shows the results of Slyph taxonomic profiling, which uses the Genome Taxonomy Database (GTDB) and so will only report bacterial hits.

docs_images_loma_tutorial_P1


Binning report

Report found in: ERR7287988_results/SAMEA10644977/summary/SAMEA10644977.ERR7287988.summary_binning_report.html (example here)

a). Summary figure

This report shows a summary of the results of metagenomic binning and per-contig statistics. In the first figure, we can see 7 different bins (represented by the non-grey coloured circles, representing 7 species. In all cases these are single contig metagenome assembled genomes (i.e. 1 contig per bin).

docs_images_loma_tutorial_P1


Hovering the cursor over any of the bins will show detailed information on various metrics for the selected contig (top) and metrics across the bin (bottom). In this case we can see, for example, the contig length and the contigs nearest taxonomic hit (using Skani against GTDB). Across the entire bin, we can also see that the bin is high-quality (low/no contamination and high completeness).

docs_images_loma_tutorial_P1

We can also see a cluster of short contigs at the bottom of the plot (~38% GC content), which most likely represents Saccharomyces cerevisiae, but this species isn't in GTDB and so not reported. Another other major reason for contigs not being binned are that they are bacterial plasmids and therefore are harder to bin as they don't necessarily have the same coverage and GC content as the chromosomal DNA. We use geNomad to calculate the plasmid score, as you can see in the figure below the nearest hit to the highlighted contig is Salmonella enterica and the plasmid score is 0.9946, strongly suggested this is a plasmid linked to bin 000004.


docs_images_loma_tutorial_P1


b). Quality summaries

The table and figures below provide a summaries of bin quality either as summary subplots - to make it easier to identify data quality issues - or as per-bin metrics.

docs_images_loma_tutorial_P1


Antimicrobial resistance report

Report found in: ERR7287988_results/SAMEA10644977/summary/SAMEA10644977.ERR7287988.amr_report.html (example here)

LOMA can report the results of up to 4 antimicrobial resistance (AMR) associated genes/mutations: ABRicate, AMRFinderPlus, ResFinder, RGI.

a). Summary report

Example results of running AMRFinderPlus on the test data are shown in the figure below. In the first section, the results for each metagenome-assembled genome, are summarized, including the resistance genes identified and the drug class to which it is thought to confer resistance.

docs_images_loma_tutorial_P1


b). Detailed report

In the second section, results are reported per-gene and includes the amino acid mutation where applicable.

docs_images_loma_tutorial_P2