Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assertion right_sib[last_root] == NULL_NODE failed #983

Open
nspope opened this issue Dec 8, 2024 · 16 comments
Open

Assertion right_sib[last_root] == NULL_NODE failed #983

nspope opened this issue Dec 8, 2024 · 16 comments
Milestone

Comments

@nspope
Copy link

nspope commented Dec 8, 2024

I wanted to reinfer an inferred tree sequence using site times, but am hitting this error (with current development version):

INFO:tsinfer.formats:Number of sites after applying mask: 1654836
INFO:tsinfer.formats:Sites chunks used: 190 - of 318
INFO:tsinfer.formats:Number of individuals after applying mask: 2548
INFO:root:Max encoded genotype matrix size=1005.3 MiB
INFO:tsinfer.inference:Starting addition of 1654836 sites
INFO:tsinfer.inference:Finished adding sites
INFO:tsinfer.inference:Ancestor builder peak RAM: 473.5 MiB
INFO:tsinfer.inference:Starting build for 772118 ancestors
INFO:tsinfer.inference:Finished building ancestors
INFO:tsinfer.inference:Mismatch prevented by setting constant high recombination and low mismatch probabilities
INFO:tsinfer.inference:Summary of recombination probabilities between sites: min=0.01; max=0.01; median=0.01; mean=0.01
INFO:tsinfer.inference:Summary of mismatch probabilities over sites: min=1e-20; max=1e-20; median=1e-20; mean=1e-20
INFO:tsinfer.inference:Matching using 13 digits of precision in likelihood calcs
INFO:tsinfer.inference:715162 epochs with 1.0 median size.
INFO:tsinfer.inference:First large (>500.0) epoch is 715162
INFO:tsinfer.inference:Grouping 772120 ancestors by linesweep
INFO:tsinfer.ancestors:Merged to 715172 ancestors in 1.94s
INFO:tsinfer.ancestors:Built 1430344 events in 0.16s
INFO:tsinfer.ancestors:Linesweep generated 4096554905 dependencies in 171.11s
INFO:tsinfer.ancestors:Found groups in 0.00s
INFO:tsinfer.ancestors:Un-merged in 0.30s
INFO:tsinfer.ancestors:2 groups with median size 386060.0
INFO:tsinfer.inference:Finished grouping into 2 groups in 173.62 seconds
INFO:tsinfer.inference:Starting ancestor matching for 2 groups
INFO:tsinfer.inference:Starting group -1 of 2 with 772119 ancestors
INFO:tsinfer.inference:Finished group -1 of 2 in 28542.50 seconds
INFO:tsinfer.inference:Starting group 0 of 2 with 0 ancestors
INFO:tsinfer.inference:Finished group 0 of 2 in 1.94 seconds
INFO:tsinfer.inference:Built ancestors tree sequence: 772120 nodes (0 pc ancestors); 772119 edges; 363268065 sites; 957070 mutations
INFO:tsinfer.inference:Finished ancestor matching
INFO:tsinfer.inference:Mismatch prevented by setting constant high recombination and low mismatch probabilities
INFO:tsinfer.inference:Summary of recombination probabilities between sites: min=0.01; max=0.01; median=0.01; mean=0.01
INFO:tsinfer.inference:Summary of mismatch probabilities over sites: min=1e-20; max=1e-20; median=1e-20; mean=1e-20
INFO:tsinfer.inference:Matching using 13 digits of precision in likelihood calcs
INFO:tsinfer.inference:Loaded 5096 samples 772120 nodes; 772119 edges; 957070 sites; 363268065 mutations
INFO:tsinfer.inference:Started matching for 5096 samples
INFO:tsinfer.inference:1733613055.1650207Thread 139710324618816 starting haplotype 0
python: lib/ancestor_matcher.c:838: ancestor_matcher_run_forwards_match: Assertion `right_sib[last_root] == NULL_NODE' failed.

Here's code to reproduce, and here is tree sequence, takes 8-9 hours to get through ancestor matching with 24 threads, but only 30 min or so to generate ancestors:

test_0 = tszip.load("test_0.dated.trees.tsz")
data = tsinfer.SampleData.from_tree_sequence(test_0, use_sites_time=True)
test_1 = tsinfer.infer(data, num_threads=24)

This ran fine when trimming to a smaller interval (5Mb) and produced reasonable-looking results. A few odd things pop out here relative to the well-behaved test run-- all ancestors are getting stuck into one group by the linesweep (as opposed to very many small groups with the test run); and after running ancestor matching separately to investigate it turns out there's a single edge per node and a huge number of mutations (300 million).

@jeromekelleher
Copy link
Member

Hmmm, "interesting"!

@nspope
Copy link
Author

nspope commented Dec 9, 2024

The reason for all ancestors ending up in a single group seems to be that all ancestors have >0 incoming edges, so this loop

no_incoming = np.where(incoming_edge_count == 0)[0]
if len(no_incoming) == 0:
break

exits immediately.

@hyanwong
Copy link
Member

hyanwong commented Dec 9, 2024

Ping @benjeffery, as this is a linesweep thing.

@nspope
Copy link
Author

nspope commented Dec 9, 2024

Here's what the ancestor age/length distribution looks like, for those ancestors that actually get passed to run_linesweep:

image

So I should probably be truncating lengths. Here's a heatmap of where ancestors are located spatially (in terms of site index) and temporally:

image

this also looks reasonable aside from the really long old ancestors. So may be this is a bug rather than an issue with the ancestors themselves.

@hyanwong
Copy link
Member

hyanwong commented Dec 9, 2024

There aren't any NaN or negative times that you are passing as sites_time, or anything?

@nspope
Copy link
Author

nspope commented Dec 9, 2024

There aren't any NaN or negative times that you are passing as sites_time, or anything?

Nope, everything looks good up to

children_data, children_indices, incoming_edge_count = run_linesweep(

whereafter things start to look off

@benjeffery
Copy link
Member

Thanks @nspope, I'll try to recreate

@benjeffery
Copy link
Member

@nspope The root of the issue here I think is that there are so many unique times. Could you try discretising the time array as a quick workaround? The code here should deal better with the original array but that is more extensive work.

@nspope
Copy link
Author

nspope commented Dec 13, 2024

Thanks Ben-- it does work when I round times to the nearest integer (which collapses a lot of the recent stuff, resulting in maybe 65K unique time points). Shall I leave this issue open as a reminder about the underlying issue with linesweep?

@hyanwong
Copy link
Member

I think we should leave this open (and maybe change the title to reflect what needs to be done). What is the exact issue with having a huge number of unique times? Is the linesweep running out of space to store separate execution paths or something? I'm not sure I quite get what the underlying problem is, but it's very likely that reinference will involve every ancestor having a unique time.

@benjeffery
Copy link
Member

Thanks Ben-- it does work when I round times to the nearest integer (which collapses a lot of the recent stuff, resulting in maybe 65K unique time points). Shall I leave this issue open as a reminder about the underlying issue with linesweep?

You don't have to go as far as integers, 0.1 should do the trick too.

@benjeffery
Copy link
Member

benjeffery commented Dec 14, 2024

I'm not sure I quite get what the underlying problem is

The grouping algorithm proceeds by building the DAG of dependencies and then topologically sorting the DAG. If all ancestors are considered at the scale of Nate's data you get a DAG with 4,096,554,905 edges, which is too many to do in a reasonable time. Most of the edges are at the bottom of the DAG, so we just use the top. We look at the ancestors binned by time and when the groups by time are large enough, we use the time groupings instead of the DAG.

@benjeffery
Copy link
Member

I think in this case the number of edges overflowed, the code clearly needs to cope with that and error out.

@benjeffery
Copy link
Member

I can't recreate this locally as I don't have enough RAM. Will see if I can get on a bigger box and do it.

@nspope
Copy link
Author

nspope commented Dec 16, 2024

If it'd help, I can stick the ancestors somewhere where you can get them (the zarr store is 10Gb or so)

@benjeffery
Copy link
Member

Thanks, but I've got the ancestors - it's the line sweep that is the issue. I think I might be able to do it if I truncate the samples enough to fit in RAM, but not too much to suppress the issue.

@benjeffery benjeffery added this to the Release 0.4.0 milestone Jan 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants