Assertion `right_sib[last_root] == NULL_NODE` failed #983

nspope · 2024-12-08T01:05:30Z

I wanted to reinfer an inferred tree sequence using site times, but am hitting this error (with current development version):

INFO:tsinfer.formats:Number of sites after applying mask: 1654836
INFO:tsinfer.formats:Sites chunks used: 190 - of 318
INFO:tsinfer.formats:Number of individuals after applying mask: 2548
INFO:root:Max encoded genotype matrix size=1005.3 MiB
INFO:tsinfer.inference:Starting addition of 1654836 sites
INFO:tsinfer.inference:Finished adding sites
INFO:tsinfer.inference:Ancestor builder peak RAM: 473.5 MiB
INFO:tsinfer.inference:Starting build for 772118 ancestors
INFO:tsinfer.inference:Finished building ancestors
INFO:tsinfer.inference:Mismatch prevented by setting constant high recombination and low mismatch probabilities
INFO:tsinfer.inference:Summary of recombination probabilities between sites: min=0.01; max=0.01; median=0.01; mean=0.01
INFO:tsinfer.inference:Summary of mismatch probabilities over sites: min=1e-20; max=1e-20; median=1e-20; mean=1e-20
INFO:tsinfer.inference:Matching using 13 digits of precision in likelihood calcs
INFO:tsinfer.inference:715162 epochs with 1.0 median size.
INFO:tsinfer.inference:First large (>500.0) epoch is 715162
INFO:tsinfer.inference:Grouping 772120 ancestors by linesweep
INFO:tsinfer.ancestors:Merged to 715172 ancestors in 1.94s
INFO:tsinfer.ancestors:Built 1430344 events in 0.16s
INFO:tsinfer.ancestors:Linesweep generated 4096554905 dependencies in 171.11s
INFO:tsinfer.ancestors:Found groups in 0.00s
INFO:tsinfer.ancestors:Un-merged in 0.30s
INFO:tsinfer.ancestors:2 groups with median size 386060.0
INFO:tsinfer.inference:Finished grouping into 2 groups in 173.62 seconds
INFO:tsinfer.inference:Starting ancestor matching for 2 groups
INFO:tsinfer.inference:Starting group -1 of 2 with 772119 ancestors
INFO:tsinfer.inference:Finished group -1 of 2 in 28542.50 seconds
INFO:tsinfer.inference:Starting group 0 of 2 with 0 ancestors
INFO:tsinfer.inference:Finished group 0 of 2 in 1.94 seconds
INFO:tsinfer.inference:Built ancestors tree sequence: 772120 nodes (0 pc ancestors); 772119 edges; 363268065 sites; 957070 mutations
INFO:tsinfer.inference:Finished ancestor matching
INFO:tsinfer.inference:Mismatch prevented by setting constant high recombination and low mismatch probabilities
INFO:tsinfer.inference:Summary of recombination probabilities between sites: min=0.01; max=0.01; median=0.01; mean=0.01
INFO:tsinfer.inference:Summary of mismatch probabilities over sites: min=1e-20; max=1e-20; median=1e-20; mean=1e-20
INFO:tsinfer.inference:Matching using 13 digits of precision in likelihood calcs
INFO:tsinfer.inference:Loaded 5096 samples 772120 nodes; 772119 edges; 957070 sites; 363268065 mutations
INFO:tsinfer.inference:Started matching for 5096 samples
INFO:tsinfer.inference:1733613055.1650207Thread 139710324618816 starting haplotype 0
python: lib/ancestor_matcher.c:838: ancestor_matcher_run_forwards_match: Assertion `right_sib[last_root] == NULL_NODE' failed.

Here's code to reproduce, and here is tree sequence, takes 8-9 hours to get through ancestor matching with 24 threads, but only 30 min or so to generate ancestors:

test_0 = tszip.load("test_0.dated.trees.tsz")
data = tsinfer.SampleData.from_tree_sequence(test_0, use_sites_time=True)
test_1 = tsinfer.infer(data, num_threads=24)

This ran fine when trimming to a smaller interval (5Mb) and produced reasonable-looking results. A few odd things pop out here relative to the well-behaved test run-- all ancestors are getting stuck into one group by the linesweep (as opposed to very many small groups with the test run); and after running ancestor matching separately to investigate it turns out there's a single edge per node and a huge number of mutations (300 million).

The text was updated successfully, but these errors were encountered:

jeromekelleher · 2024-12-09T14:44:16Z

Hmmm, "interesting"!

nspope · 2024-12-09T20:48:55Z

The reason for all ancestors ending up in a single group seems to be that all ancestors have >0 incoming edges, so this loop

tsinfer/tsinfer/ancestors.py

Lines 124 to 126 in 10b3c81

    
           no_incoming = np.where(incoming_edge_count == 0)[0] 
        
           if len(no_incoming) == 0: 
        
               break

exits immediately.

hyanwong · 2024-12-09T20:51:21Z

Ping @benjeffery, as this is a linesweep thing.

nspope · 2024-12-09T21:03:21Z

Here's what the ancestor age/length distribution looks like, for those ancestors that actually get passed to run_linesweep:

So I should probably be truncating lengths. Here's a heatmap of where ancestors are located spatially (in terms of site index) and temporally:

this also looks reasonable aside from the really long old ancestors. So may be this is a bug rather than an issue with the ancestors themselves.

hyanwong · 2024-12-09T22:18:12Z

There aren't any NaN or negative times that you are passing as sites_time, or anything?

nspope · 2024-12-09T22:32:23Z

There aren't any NaN or negative times that you are passing as sites_time, or anything?

Nope, everything looks good up to

tsinfer/tsinfer/ancestors.py

Line 176 in 10b3c81

children_data, children_indices, incoming_edge_count = run_linesweep(

whereafter things start to look off

benjeffery · 2024-12-10T09:51:07Z

Thanks @nspope, I'll try to recreate

benjeffery · 2024-12-13T15:04:13Z

@nspope The root of the issue here I think is that there are so many unique times. Could you try discretising the time array as a quick workaround? The code here should deal better with the original array but that is more extensive work.

nspope · 2024-12-13T18:32:39Z

Thanks Ben-- it does work when I round times to the nearest integer (which collapses a lot of the recent stuff, resulting in maybe 65K unique time points). Shall I leave this issue open as a reminder about the underlying issue with linesweep?

hyanwong · 2024-12-14T09:54:46Z

I think we should leave this open (and maybe change the title to reflect what needs to be done). What is the exact issue with having a huge number of unique times? Is the linesweep running out of space to store separate execution paths or something? I'm not sure I quite get what the underlying problem is, but it's very likely that reinference will involve every ancestor having a unique time.

benjeffery · 2024-12-14T13:05:39Z

Thanks Ben-- it does work when I round times to the nearest integer (which collapses a lot of the recent stuff, resulting in maybe 65K unique time points). Shall I leave this issue open as a reminder about the underlying issue with linesweep?

You don't have to go as far as integers, 0.1 should do the trick too.

benjeffery · 2024-12-14T13:09:20Z

I'm not sure I quite get what the underlying problem is

The grouping algorithm proceeds by building the DAG of dependencies and then topologically sorting the DAG. If all ancestors are considered at the scale of Nate's data you get a DAG with 4,096,554,905 edges, which is too many to do in a reasonable time. Most of the edges are at the bottom of the DAG, so we just use the top. We look at the ancestors binned by time and when the groups by time are large enough, we use the time groupings instead of the DAG.

benjeffery · 2024-12-14T13:10:04Z

I think in this case the number of edges overflowed, the code clearly needs to cope with that and error out.

benjeffery · 2024-12-16T11:48:52Z

I can't recreate this locally as I don't have enough RAM. Will see if I can get on a bigger box and do it.

nspope · 2024-12-16T17:34:03Z

If it'd help, I can stick the ancestors somewhere where you can get them (the zarr store is 10Gb or so)

benjeffery · 2024-12-16T19:54:02Z

Thanks, but I've got the ancestors - it's the line sweep that is the issue. I think I might be able to do it if I truncate the samples enough to fit in RAM, but not too much to suppress the issue.

benjeffery added this to the Release 0.4.0 milestone Jan 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assertion `right_sib[last_root] == NULL_NODE` failed #983

Assertion `right_sib[last_root] == NULL_NODE` failed #983

nspope commented Dec 8, 2024 •

edited

Loading

jeromekelleher commented Dec 9, 2024

nspope commented Dec 9, 2024

hyanwong commented Dec 9, 2024

nspope commented Dec 9, 2024 •

edited

Loading

hyanwong commented Dec 9, 2024

nspope commented Dec 9, 2024 •

edited

Loading

benjeffery commented Dec 10, 2024

benjeffery commented Dec 13, 2024

nspope commented Dec 13, 2024

hyanwong commented Dec 14, 2024

benjeffery commented Dec 14, 2024

benjeffery commented Dec 14, 2024 •

edited

Loading

benjeffery commented Dec 14, 2024

benjeffery commented Dec 16, 2024

nspope commented Dec 16, 2024

benjeffery commented Dec 16, 2024

Assertion right_sib[last_root] == NULL_NODE failed #983

Assertion right_sib[last_root] == NULL_NODE failed #983

Comments

nspope commented Dec 8, 2024 • edited Loading

jeromekelleher commented Dec 9, 2024

nspope commented Dec 9, 2024

hyanwong commented Dec 9, 2024

nspope commented Dec 9, 2024 • edited Loading

hyanwong commented Dec 9, 2024

nspope commented Dec 9, 2024 • edited Loading

benjeffery commented Dec 10, 2024

benjeffery commented Dec 13, 2024

nspope commented Dec 13, 2024

hyanwong commented Dec 14, 2024

benjeffery commented Dec 14, 2024

benjeffery commented Dec 14, 2024 • edited Loading

benjeffery commented Dec 14, 2024

benjeffery commented Dec 16, 2024

nspope commented Dec 16, 2024

benjeffery commented Dec 16, 2024

Assertion `right_sib[last_root] == NULL_NODE` failed #983

Assertion `right_sib[last_root] == NULL_NODE` failed #983

nspope commented Dec 8, 2024 •

edited

Loading

nspope commented Dec 9, 2024 •

edited

Loading

nspope commented Dec 9, 2024 •

edited

Loading

benjeffery commented Dec 14, 2024 •

edited

Loading