failed to get modbase info for record 6bcfdea5-e9eb-4fc9-a72e-81744a77df80, Skipped: AUX data not found #335

Tang-pro · 2025-01-08T01:02:42Z

@ArtRand @rmp
Hey,
I received a file named pass.fq.gz, which the company called using Dorado. I then mapped the reads from the .fq.gz file to the transcriptome, when I run this command modkit pileup ../reftrans/Y1_5_1.bam --ref /public/home/DRS/data_241224/Isoquant/Y1/Y1quant/Ref_trans/Y1_transcripts.fa Y1_5_1/Y1_5_1.bed --with-header -t 20 --motif DRACH 2
the error is occured
calculated chunk size: 30, interval size 100000, processing 3000000 positions concurrently

attempting to sample 10042 reads
failed to get modbase info for record 1e347aaf-677a-48db-98d9-99b45a8f148d, Skipped: AUX data not found
failed to get modbase info for record 741035d2-194d-493b-801b-712218eb6635, Skipped: AUX data not found
failed to get modbase info for record 3ca5a253-eaea-4ec8-99c4-5a70674c55a1, Skipped: AUX data not found

The text was updated successfully, but these errors were encountered:

ArtRand · 2025-01-08T14:37:22Z

Hello @Tang-pro,

The modified base information is contained in MM/ML auxiliary information tags, usually in SAM format. There is a specification here, see page 7. Although it is possible to preserve this information in FASTQ files, it tends to be brittle and I don't recommend it. I would try and get an unaligned BAM file from Dorado and align that with dorado aligner (or perform the alignment during basecalling). Please find the documentation here.

Tang-pro · 2025-01-09T02:02:45Z

@ArtRand

Thank you! The dorado step was performed by a partner company. However, I still have a question. I have two species, which are allotetraploids belonging to the same genus. If I want to compare the two species later, should I use their respective reference genomes or a unified reference genome?"

ArtRand · 2025-01-09T15:15:56Z

Hello @Tang-pro,

That's an interesting question!

For most of the functions in modkit you'll want to use the "unified reference genome". For example the differential methylation commands all require a bedMethyl (pileup) as input and that these tables all use the same reference. However, if you think the unified reference looses some information, you could use modkit stats (docs) on the pileups to the species' respective genomes and compare homologous regions. Of course, you have to decide what those regions are. Let me know how it goes and if you have any more questions.

Tang-pro · 2025-01-10T10:32:46Z

@ArtRand

Thank you for your patient reply!

I have another question: The pass.fq.gz file provided by the sequencing company already contains methylation modification information and polyA tail length. Can I follow this workflow: first, align the fq file with these modifications to the reference genome to extract a reference transcriptome, and then align the fq file to the reference transcriptome?

However, during this second alignment step, the resulting bam file no longer retains the polyA tail length and methylation information. Some people have suggested that this might be due to not using the -y parameter in minimap2. Could the errors mentioned earlier be related to the use of the -y parameter?

As for why I need to take such a roundabout approach: unfortunately, the company did not provide a bam file, and I do not have sufficient GPU resources to rerun Dorado.

ArtRand · 2025-01-10T18:42:41Z

Hello @Tang-pro,

The pass.fq.gz file provided by the sequencing company already contains methylation modification information and polyA tail length.

It's hard for me to recommend a method to use data in this form. Can you request the unaligned BAM from the sequencing provider? If so, you can pass this file directly to dorado aligner, which will retain these metadata such as the poly-A tail length. You could use modkit repair (docs) to re-establish the base modification information to the reads, but this command won't add the poly-A tail length information. I imagine you could write a script that adds the metadata back. In my experience, other methods that pass these data around with FASTQ comments are difficult to troubleshoot (and really outside of the scope of Modkit).

ArtRand added the troubleshooting workflow and data preparation questions label Jan 8, 2025

ArtRand added question Looking for clarification on inputs and/or outputs and removed troubleshooting workflow and data preparation questions labels Jan 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

failed to get modbase info for record 6bcfdea5-e9eb-4fc9-a72e-81744a77df80, Skipped: AUX data not found #335

failed to get modbase info for record 6bcfdea5-e9eb-4fc9-a72e-81744a77df80, Skipped: AUX data not found #335

Tang-pro commented Jan 8, 2025

ArtRand commented Jan 8, 2025

Tang-pro commented Jan 9, 2025

ArtRand commented Jan 9, 2025

Tang-pro commented Jan 10, 2025

ArtRand commented Jan 10, 2025

failed to get modbase info for record 6bcfdea5-e9eb-4fc9-a72e-81744a77df80, Skipped: AUX data not found #335

failed to get modbase info for record 6bcfdea5-e9eb-4fc9-a72e-81744a77df80, Skipped: AUX data not found #335

Comments

Tang-pro commented Jan 8, 2025

ArtRand commented Jan 8, 2025

Tang-pro commented Jan 9, 2025

ArtRand commented Jan 9, 2025

Tang-pro commented Jan 10, 2025

ArtRand commented Jan 10, 2025