Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failed to get modbase info for record 6bcfdea5-e9eb-4fc9-a72e-81744a77df80, Skipped: AUX data not found #335

Open
Tang-pro opened this issue Jan 8, 2025 · 5 comments
Labels
question Looking for clarification on inputs and/or outputs

Comments

@Tang-pro
Copy link

Tang-pro commented Jan 8, 2025

@ArtRand @rmp
Hey,
I received a file named pass.fq.gz, which the company called using Dorado. I then mapped the reads from the .fq.gz file to the transcriptome, when I run this command modkit pileup ../reftrans/Y1_5_1.bam --ref /public/home/DRS/data_241224/Isoquant/Y1/Y1quant/Ref_trans/Y1_transcripts.fa Y1_5_1/Y1_5_1.bed --with-header -t 20 --motif DRACH 2
the error is occured
calculated chunk size: 30, interval size 100000, processing 3000000 positions concurrently

attempting to sample 10042 reads
failed to get modbase info for record 1e347aaf-677a-48db-98d9-99b45a8f148d, Skipped: AUX data not found
failed to get modbase info for record 741035d2-194d-493b-801b-712218eb6635, Skipped: AUX data not found
failed to get modbase info for record 3ca5a253-eaea-4ec8-99c4-5a70674c55a1, Skipped: AUX data not found

@ArtRand
Copy link
Contributor

ArtRand commented Jan 8, 2025

Hello @Tang-pro,

The modified base information is contained in MM/ML auxiliary information tags, usually in SAM format. There is a specification here, see page 7. Although it is possible to preserve this information in FASTQ files, it tends to be brittle and I don't recommend it. I would try and get an unaligned BAM file from Dorado and align that with dorado aligner (or perform the alignment during basecalling). Please find the documentation here.

@ArtRand ArtRand added the troubleshooting workflow and data preparation questions label Jan 8, 2025
@Tang-pro
Copy link
Author

Tang-pro commented Jan 9, 2025

@ArtRand

Thank you! The dorado step was performed by a partner company. However, I still have a question. I have two species, which are allotetraploids belonging to the same genus. If I want to compare the two species later, should I use their respective reference genomes or a unified reference genome?"

@ArtRand
Copy link
Contributor

ArtRand commented Jan 9, 2025

Hello @Tang-pro,

That's an interesting question!

For most of the functions in modkit you'll want to use the "unified reference genome". For example the differential methylation commands all require a bedMethyl (pileup) as input and that these tables all use the same reference. However, if you think the unified reference looses some information, you could use modkit stats (docs) on the pileups to the species' respective genomes and compare homologous regions. Of course, you have to decide what those regions are. Let me know how it goes and if you have any more questions.

@ArtRand ArtRand added question Looking for clarification on inputs and/or outputs and removed troubleshooting workflow and data preparation questions labels Jan 9, 2025
@Tang-pro
Copy link
Author

@ArtRand

Thank you for your patient reply!

I have another question: The pass.fq.gz file provided by the sequencing company already contains methylation modification information and polyA tail length. Can I follow this workflow: first, align the fq file with these modifications to the reference genome to extract a reference transcriptome, and then align the fq file to the reference transcriptome?

However, during this second alignment step, the resulting bam file no longer retains the polyA tail length and methylation information. Some people have suggested that this might be due to not using the -y parameter in minimap2. Could the errors mentioned earlier be related to the use of the -y parameter?

As for why I need to take such a roundabout approach: unfortunately, the company did not provide a bam file, and I do not have sufficient GPU resources to rerun Dorado.

@ArtRand
Copy link
Contributor

ArtRand commented Jan 10, 2025

Hello @Tang-pro,

The pass.fq.gz file provided by the sequencing company already contains methylation modification information and polyA tail length.

It's hard for me to recommend a method to use data in this form. Can you request the unaligned BAM from the sequencing provider? If so, you can pass this file directly to dorado aligner, which will retain these metadata such as the poly-A tail length. You could use modkit repair (docs) to re-establish the base modification information to the reads, but this command won't add the poly-A tail length information. I imagine you could write a script that adds the metadata back. In my experience, other methods that pass these data around with FASTQ comments are difficult to troubleshoot (and really outside of the scope of Modkit).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Looking for clarification on inputs and/or outputs
Projects
None yet
Development

No branches or pull requests

2 participants