Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I have some questions regarding the bed methyl file, would you help? #340

Open
ralanany opened this issue Jan 14, 2025 · 1 comment
Open
Labels
question Looking for clarification on inputs and/or outputs

Comments

@ralanany
Copy link

ralanany commented Jan 14, 2025

Q1: Would you explain why there is difference?

I have bed methyl file for cytosine modification, what I have understood for each modified site, it measures the frequency of hydroxy methyl or methyl cytosine modification.

bed methyl file

1- chr1 10468 10469 h 6 . 10468 10469 255,0,0 6 0.00 0 1 5 0 1 0 0
2- chr1 10468 10469 m 6 . 10468 10469 255,0,0 6 83.33 5 1 0 0 1 0 0
3- chr1 10470 10471 h 6 . 10470 10471 255,0,0 6 0.00 0 1 5 0 1 0 0
4- chr1 10470 10471 m 6 . 10470 10471 255,0,0 6 83.33 5 1 0 0 1 0 0
5- chr1 10637 10638 h 7 . 10637 10638 255,0,0 7 28.57 2 0 5 0 0 0 0
6- chr1 10637 10638 m 7 . 10637 10638 255,0,0 7 71.43 5 0 2 0 0 0 0
7- chr1 10640 10641 h 6 . 10640 10641 255,0,0 6 16.67 1 0 5 0 1 0 0
8- chr1 10640 10641 m 6 . 10640 10641 255,0,0 6 83.33 5 0 1 0 1 0 0
9- chr1 10562 10563 h 6 . 10562 10563 255,0,0 6 0.00 0 5 1 0 0 1 0
10- chr1 10562 10563 m 6 . 10562 10563 255,0,0 6 16.67 1 5 0 0 0 1 0

when I used wc -l, The total no was 41503950

  • Then I separated the file into 2 other files based on the modification code (m/h). I found that the file with m modifications has 20751975, while file with h modifications has 19175142.

Would you explain why there is difference, although each site should mention both mods.

Q2: In some sites, the percent for hydroxy and m modifications is not equal to 0 and 100, as mentioned in the above lines (5-10). How can I decide whether this position have hydroxy or methyl modification?, How can I find the most significant modified positions as from arrays?

Q3: Regarding the summary modbam: Does the modification calls table represent the whole sample (whole modbam file) or only certain region (region chr 20) as mentioned in the totals table (the first part of the summary)?

summary output

parsing region chr20 # only present if --region option is provided
sampling 10042 reads from BAM # modulated with --num-reads
calculating threshold at 10% percentile # modulated with --filter-percentile
calculated thresholds: C: 0.7167969 # calculated per-canonical base, on the fly

bases C
total_reads_used 9989
count_reads_C 9989
pass_threshold_C 0.7167969
region chr20:0-64444167
base code pass_count pass_frac all_count all_frac
C m 1192533 0.58716166 1305956 0.5790408
C h 119937 0.0590528 195335 0.086608544
C - 718543 0.3537855 754087 0.33435062

Thanks inadvance

@marcus1487
Copy link
Contributor

Thanks for your questions! Here are some clarifications:

  1. Discrepancy in row counts: Can you confirm how you counted the number of h rows? The total number of rows in your file doesn’t match the combined counts for m and h, which suggests something might be off in the filtering or counting step. Each site should indeed have both m and h lines, so the counts should be equal.
  2. Percentages for m and h: Unlike canonical bases, modified bases like 5mC and 5hmC typically exist as percentages across reads. It’s not unusual to see sites with non-zero values for both m and h. The significance of a site being modified depends on your system and the biological question.
    • For example, if you’re looking for highly modified sites, you might filter by a specific threshold (e.g., sites where the percentage for m or h exceeds a given value).
    • Modkit offers tools for advanced analysis, such as modkit dmr for comparing samples or modkit motif for motif discovery. If you can share more about your downstream goals, I’d be happy to guide you further.
  3. Summary Modbam: If you used the --region argument, the results in the summary file are restricted to that region (e.g., chr20). Without --region, the results would represent the entire sample.

Let me know if you need additional help!

@ArtRand ArtRand added the question Looking for clarification on inputs and/or outputs label Jan 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Looking for clarification on inputs and/or outputs
Projects
None yet
Development

No branches or pull requests

3 participants