Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fetching sequence failed error message when running pileup #308

Open
niradsp opened this issue Nov 27, 2024 · 7 comments
Open

fetching sequence failed error message when running pileup #308

niradsp opened this issue Nov 27, 2024 · 7 comments
Labels
question Looking for clarification on inputs and/or outputs

Comments

@niradsp
Copy link

niradsp commented Nov 27, 2024

Here is my procedure. First I ran dorado basecaller, where I required all modifications to be called:
dorado basecaller -v sup,inosine_m6A,pseU,m5C --min-qscore 9 --emit-moves -b 64 --mm2-opts {params} --estimate-poly-a --reference {input.reference} -x cuda:all {input.input_files} > {output.output}

Next, I used samtools to sort this file:
samtools sort -o {output} -@50 {input}
Followed by indexing:
samtools index {input}

Next I run pileup by setting the motif parameter as DRACH 2.

modkit pileup --ref {input.genome } --log-filepath test.log --motif DRACH 2 {input.bam} test.bed

Note that the reference used in the basecaller and in the pileup is the same one. I am however getting an error message:

fetching sequence failed, Unknown sequence name: chr1.
fetching sequence failed, Unknown sequence name: chr1.
fetching sequence failed, Unknown sequence name: chr1.
fetching sequence failed, Unknown sequence name: chr1.

And on and on.
If I run the program without using --motif and --ref, it will run.

Thanks,
Nirad

@ArtRand
Copy link
Contributor

ArtRand commented Nov 28, 2024

Hello @niradsp,

These steps look fine. This problem can occur when the reference that you aligned to and the one you're using for pileup don't match. Are you sure that {input.reference} is the same FASTA as {input.genome} and that {input.genome} has an appropriate FASTA index?

@banskotan2
Copy link

Hello @ArtRand ,
Yes, the problem was because the .fai index file was 0 byte for some reason. After recreating the fai file, everything worked fine.
Also, I had a quick question about DMR. Is there a way to run sort of a regression model? I have 3 timepoints, 0, 24 and 48. I want to see if there are patterns that can be identified using regression? If not, a likeihood ratio type model (provided by DESeq2 for example) would work well too. If this type of analysis cannot be done using DMR, do you suggest any package?
Thanks in advance.

@ArtRand
Copy link
Contributor

ArtRand commented Dec 11, 2024

Hello @banskotan2,

Modkit has a dmr module for pairwise comparisons. You can perform all pairwise comparisons with modkit dmr multi and look for trends or changes at specified regions, there are more details in the documentation. If you are interested in changes in methylation (i.e. methylated vs canonical) there is an effect-size model described in the docs that works on individual positions and will (optionally) segment the genome to de novo different regions.

@ArtRand ArtRand added the question Looking for clarification on inputs and/or outputs label Dec 11, 2024
@banskotan2
Copy link

@ArtRand
Thank you for your help. We are now facing a couple of dilemmas here. For m6A analysis, what do you think works the best? We notice that for most bases with m6A, there also appears to be a peak for Inosine.

  1. Pileup without filtering, but keeping only DRACH. Here I notice m6A along with Inosine. DMR produced large number of significant hits, but the pvalue seems to be joint Inosine and m6A?
  2. Pileup keeping DRACH. Then I use awk to keep m6A only. I run DMR. It appears to produce lower significant hits.
  3. Pileup, but using --ignore to ignore Inosine. Will this work better than number 2?
    4)Rerun Dorado by ignoring Inosine. This is probably not required, right? I would like to keep all modifications if possible.

What if I want non-canonical m6A and Inosine is interfering?

Thanks,
Nirad

@ArtRand
Copy link
Contributor

ArtRand commented Jan 9, 2025

Hello @banskotan2, @niradsp,

We notice that for most bases with m6A, there also appears to be a peak for Inosine.

I apologize, but could you give me a few more details about the "peak for Inosine" is in this context? Do you mean positions with high percentage of m6A also have high percentages of Inosine? Do you expect that there is Inosine in your samples?

  1. Pileup without filtering, but keeping only DRACH. Here I notice m6A along with Inosine. DMR produced large number of significant hits, but the pvalue seems to be joint Inosine and m6A?

Pileup without filtering is probably going to generate a lot of false positive calls, so I generally would not recommend using --no-filtering unless you have very shallow coverage.

DMR produced large number of significant hits, but the pvalue seems to be joint Inosine and m6A?

That's correct, the effect-size model that produces MAP-based p-values is a "modified vs canonical" model, so it will treat Inosine and m6A as alternate to canonical (explained in point 4 here).

  1. Pileup keeping DRACH. Then I use awk to keep m6A only. I run DMR. It appears to produce lower significant hits.

I'm pretty sure that using awk to filter out the Inosine records from the bedMethyl then feeding the reduced bedMethyl to modkit dmr should produce an error at a lot of the genomic positions. Maybe check the logs? I would not recommend this approach in any case.

  1. Pileup, but using --ignore to ignore Inosine. Will this work better than number 2?

This approach will work better than (2). You could also use --combine-mods. Leading to the final point:

  1. Rerun Dorado by ignoring Inosine. This is probably not required, right? I would like to keep all modifications if possible.

I would not re-run Dorado if you already have used the latest models.

I'm interested in understanding your use case so I can help you most effectively. What coverage do you have at the DRACH positions? Are you trying to find differentially modified positions, regions, or whole transcripts? I'm trying to figure out how researchers like yourself are using dmr for epitranscriptomics.

@banskotan2
Copy link

banskotan2 commented Jan 9, 2025

Hello @ArtRand

I am having discussions internally to figure out the best method to run the pipeline.

Hello @banskotan2, @niradsp,

We notice that for most bases with m6A, there also appears to be a peak for Inosine.

I apologize, but could you give me a few more details about the "peak for Inosine" is in this context? Do you mean positions with high percentage of m6A also have high percentages of Inosine? Do you expect that there is Inosine in your samples?

Yes, from what I remember. When I ran modkit without any filtering, each high-probability m6A also had high probability Inosine. Because I saw both Inosine and m6A in the results for just the DRACH motif, I decided to then try to remove inosine. For the DRACH, I feel we should just see m6A.

  1. Pileup without filtering, but keeping only DRACH. Here I notice m6A along with Inosine. DMR produced large number of significant hits, but the pvalue seems to be joint Inosine and m6A?

Pileup without filtering is probably going to generate a lot of false positive calls, so I generally would not recommend using --no-filtering unless you have very shallow coverage.

DMR produced large number of significant hits, but the pvalue seems to be joint Inosine and m6A?

That's correct, the effect-size model that produces MAP-based p-values is a "modified vs canonical" model, so it will treat Inosine and m6A as alternate to canonical (explained in point 4 here).

  1. Pileup keeping DRACH. Then I use awk to keep m6A only. I run DMR. It appears to produce lower significant hits.

I'm pretty sure that using awk to filter out the Inosine records from the bedMethyl then feeding the reduced bedMethyl to modkit dmr should produce an error at a lot of the genomic positions. Maybe check the logs? I would not recommend this approach in any case.

I see this type of log output. Are these the errors you are referring to?

[src/dmr/single_site.rs::211][2024-12-11 09:56:15][DEBUG] batch failed, failed to aggregate counts, invalid data, valid coverage (227) is not equal to the sum of canonical and modified counts (226), chrom: chr1 starting at 6635134
[src/dmr/single_site.rs::211][2024-12-11 09:56:15][DEBUG] batch failed, failed to aggregate counts, invalid data, valid coverage (84) is not equal to the sum of canonical and modified counts (83), chrom: chr1 starting at 15668868
[src/dmr/single_site.rs::211][2024-12-11 09:56:15][DEBUG] batch failed, failed to aggregate counts, invalid data, valid coverage (1) is not equal to the sum of canonical and modified counts (0), chrom: chr1 starting at 12455253

Here is how I am filtering the pileup:

I provide 2 snakemake rules:

        input:
                bam="Sorted_Bam/{condition}_sorted.bam",

                ref="/data/banskotan2/Annotation/HG38/Homo_sapiens_HG38_GRCH38_104.fa",

                bai="Sorted_Bam/{condition}_sorted.bam.bai",

        output:
                bed="pileup_bed/{condition}_pileup.bed",

                log="pileup_log/{condition}_pileup.log",

        params:
                motif="DRACH 2"

        threads:50
        shell:
                "modkit pileup {input.bam} {output.bed} -t 50 --log-filepath {output.log} --motif {params.motif} --ref {input.ref}"`


rule m6A_only:

        input:
                "pileup_bed/{condition}_pileup.bed"

        output:
                "pileup_bed/{condition}_pileup_m6A.bed"

        shell:
                "awk -v FS=\"\\t\" '$4==\"a\"' {input} >{output}"

First, I generated pileup data using DRACH. Next, I used "awk -v FS="\t" '$4=="a"' {input} >{output}" to keep just the "a".
This m6A_only output is what I am feeding to DMR. This is incorrect then?

  1. Pileup, but using --ignore to ignore Inosine. Will this work better than number 2?

This approach will work better than (2). You could also use --combine-mods. Leading to the final point:

Which method do you recommend? Combine-mods or --ignore? Specifically, for DRACH, do you think we get more significant hits (in terms of pvalue) with combine?

  1. Rerun Dorado by ignoring Inosine. This is probably not required, right? I would like to keep all modifications if possible.

I would not re-run Dorado if you already have used the latest models.

Yes, I agree. I want all modifications.

I'm interested in understanding your use case so I can help you most effectively. What coverage do you have at the DRACH positions? Are you trying to find differentially modified positions, regions, or whole transcripts? I'm trying to figure out how researchers like yourself are using dmr for epitranscriptomics.

Regarding DRACH, some regions are showing values(must be count) as high as 100. Honestly, at this point, we would like to explore everything.
We have multiple projects of Nanopore epitranscriptomic data and I am looking to create the most effective pipeline.
There is also a big interest in pseudouridine. Do I need to do any filtering here?

Actually, we are trying to figure out what is the most effective way to analyze the data. Any suggestions will be helpful at this stage. I am primarily looking at single positions, but also ran DMR using segmentation.

@ArtRand
Copy link
Contributor

ArtRand commented Jan 10, 2025

Hello @banskotan2,

Yes, from what I remember. When I ran modkit without any filtering, each high-probability m6A also had high probability Inosine. Because I saw both Inosine and m6A in the results for just the DRACH motif, I decided to then try to remove inosine. For the DRACH, I feel we should just see m6A.

Something about this statement doesn't make sense to me. The percent modification at given position can only sum to 100%, if this isn't the case that may be a bug. Do you have an example? Seeing DRACH motifs with high %-inosine is potentially a systematic error in the modification caller, do you have an example of one of these positions we could look at?

As we have established, to remove the inosine calls, the recommended method is to use --ignore 17596 when generating the bedMethyl. Filtering the bedMethyl may remove records where a DRACH motif has high %-inosine and low %-m6A and you probably don't want to filter out these records.

I see this type of log output. Are these the errors you are referring to?

Yes exactly. There is a check in the DMR program that validates that it has all of the records for a genomic position. When you use awk to filter the bedMethyl it won't find these records. See the previous point for how to do prepare the bedMethyl for 6mA-only analysis.

First, I generated pileup data using DRACH. Next, I used "awk -v FS="\t" '$4=="a"' {input} >{output}" to keep just the "a". This m6A_only output is what I am feeding to DMR. This is incorrect then?

No, you can't do this. I'll update the documentation to mention that this won't work in case people don't see the errors in the log.

Which method do you recommend? Combine-mods or --ignore? Specifically, for DRACH, do you think we get more significant hits (in terms of pvalue) with combine?

To get the highest per-molecule per-site accuracy, we recommend using --ignore. You may get more significant hits with --combine-mods, however, since lower confidence (but still passing) inosine calls will be used.

Regarding DRACH, some regions are showing values(must be count) as high as 100.

You must be looking at percent modified (schema), just to be sure.

We have multiple projects of Nanopore epitranscriptomic data and I am looking to create the most effective pipeline.

Sounds exciting! You're off to a good start. I would fix the filtering before DMR. If you can, run a parallel step where you add --ignore 17596 to the pileup command. You could also add DMR over transcripts as the regions. I've got some ideas for how to add a p-value to the regions that I need to test out. I've also seen some interesting results with entropy over transcripts. The localize command may also give you an interesting, qualitative, look at changes in 6mA levels. Someone as already requested that localize outputs something more like a metagene plot. I've added this to the todo list. I'm also thinking about adding an epi-allele command so you can cluster transcripts based on their methylation patterns... lots of things to do.

There is also a big interest in pseudouridine. Do I need to do any filtering here?

Take a look at modkit sample-probs --hist and look at the output probability histograms. The official recommendation is that the default filtering will give the expected accuracy results, but increasing the threshold for pseU will decrease the FPR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Looking for clarification on inputs and/or outputs
Projects
None yet
Development

No branches or pull requests

3 participants