You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I think we might want to release both the ancestors tree sequences and the fully simplified tree sequences from any real inference that we do, so that people can match their own samples against the ancestors.
However, it's likely that as well as matching their own samples against the ancestors, they will also want to place the original samples back on. For this, we need to be able to identify which edges were added during the sample matching process, and simply re-add them.
However, I'm not sure how we can identify the sample-matched edges, given an ancestors_ts and the final ts. It's reasonably obvious to do when there aren't PC nodes (the added edges are simply the ones above the sample nodes), but once you have PC nodes, it's more difficult. In particular, we currently can't identify which PC nodes in the final tree sequence correspond to PC nodes that were added during ancestor matching, and which correspond to PC nodes added during sample matching.
I've thought about it for a bit, and perhaps the easiest would be to add metadata to PC nodes added during sample matching, specifying the node ID they represent in the ancestors_ts. At the moment we set ancestor_data_id in the metadata of non-PC nodes in the ancestors TS. I wonder if we should set ancestor_ts_id for the PC nodes?
The text was updated successfully, but these errors were encountered:
I'm thinking that it would be useful to add a flag either to all the nodes that have been place in the match-ancestors phase (both ancestors and PC nodes) OR a flag to all the nodes placed in the match-samples phase. Flipping a flag is easy and cheap, so I can't see any objection to this: do you have any preference for whether such a flag would be on the match-ancestors or the match-samples nodes @jeromekelleher ?
That way it's trivial to identify all the match-sample nodes, remove them from the (unsimplified) final TS, remove the non-inference sites, and you are left with the ancestors tree sequence, which can be used again for matching (e.g. a different set of samples)
I think we might want to release both the ancestors tree sequences and the fully simplified tree sequences from any real inference that we do, so that people can match their own samples against the ancestors.
However, it's likely that as well as matching their own samples against the ancestors, they will also want to place the original samples back on. For this, we need to be able to identify which edges were added during the sample matching process, and simply re-add them.
However, I'm not sure how we can identify the sample-matched edges, given an ancestors_ts and the final ts. It's reasonably obvious to do when there aren't PC nodes (the added edges are simply the ones above the sample nodes), but once you have PC nodes, it's more difficult. In particular, we currently can't identify which PC nodes in the final tree sequence correspond to PC nodes that were added during ancestor matching, and which correspond to PC nodes added during sample matching.
I've thought about it for a bit, and perhaps the easiest would be to add metadata to PC nodes added during sample matching, specifying the node ID they represent in the ancestors_ts. At the moment we set
ancestor_data_id
in the metadata of non-PC nodes in the ancestors TS. I wonder if we should setancestor_ts_id
for the PC nodes?The text was updated successfully, but these errors were encountered: