Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identifying PC nodes / edges added during match_samples #916

Open
hyanwong opened this issue May 9, 2024 · 3 comments
Open

Identifying PC nodes / edges added during match_samples #916

hyanwong opened this issue May 9, 2024 · 3 comments

Comments

@hyanwong
Copy link
Member

hyanwong commented May 9, 2024

I think we might want to release both the ancestors tree sequences and the fully simplified tree sequences from any real inference that we do, so that people can match their own samples against the ancestors.

However, it's likely that as well as matching their own samples against the ancestors, they will also want to place the original samples back on. For this, we need to be able to identify which edges were added during the sample matching process, and simply re-add them.

However, I'm not sure how we can identify the sample-matched edges, given an ancestors_ts and the final ts. It's reasonably obvious to do when there aren't PC nodes (the added edges are simply the ones above the sample nodes), but once you have PC nodes, it's more difficult. In particular, we currently can't identify which PC nodes in the final tree sequence correspond to PC nodes that were added during ancestor matching, and which correspond to PC nodes added during sample matching.

I've thought about it for a bit, and perhaps the easiest would be to add metadata to PC nodes added during sample matching, specifying the node ID they represent in the ancestors_ts. At the moment we set ancestor_data_id in the metadata of non-PC nodes in the ancestors TS. I wonder if we should set ancestor_ts_id for the PC nodes?

@jeromekelleher
Copy link
Member

We should be adding richer metadata about PC nodes all right, we want to communicate back information about how and why nodes were added.

@hyanwong
Copy link
Member Author

I'm thinking that it would be useful to add a flag either to all the nodes that have been place in the match-ancestors phase (both ancestors and PC nodes) OR a flag to all the nodes placed in the match-samples phase. Flipping a flag is easy and cheap, so I can't see any objection to this: do you have any preference for whether such a flag would be on the match-ancestors or the match-samples nodes @jeromekelleher ?

That way it's trivial to identify all the match-sample nodes, remove them from the (unsimplified) final TS, remove the non-inference sites, and you are left with the ancestors tree sequence, which can be used again for matching (e.g. a different set of samples)

@jeromekelleher
Copy link
Member

Sure, SGTM. I think we have plenty room in flag space, so whatever works best. More info is better I think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants