Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use divergence_matrix for downstream statistics #2783

Closed
jeromekelleher opened this issue Jul 7, 2023 · 4 comments
Closed

Use divergence_matrix for downstream statistics #2783

jeromekelleher opened this issue Jul 7, 2023 · 4 comments
Milestone

Comments

@jeromekelleher
Copy link
Member

I think we can rephrase at least genetic_relatedness (aka eGRM) in terms of divergence_matrix, which should substantially improve performance (although waiting for #2779 which is needed for decent site-mode performance).

Can we transform the divergence matrix into genetic_relatedness efficiently in Python (i.e. using numpy) or do we need C code for this @petrelharp?

Are there other stats we can do this for?

@jeromekelleher
Copy link
Member Author

We'd need to consider the compatibility issues raise, of course. For one, we'll be computing something slightly different in site mode after this, I guess?

@petrelharp
Copy link
Contributor

Let's see - we talked through how to do this somewhere; the missing piece is you need the function that computes, for each node, the total area from the node to the root (that's in branch mode; for site it's the number of mutations). Call this derived; then relatedness[i,j] = derived[i] + derived[j] - divergence[i,j].

HOWEVER, your point about back mutations is an important one. I think that we argued that if divergence matrix and divergence gave slightly different answers that was OK; if that is true then relatedness_matrix and relatedness could also give slightly different answers?

@jeromekelleher
Copy link
Member Author

Ah yes, that makes sense. Given we need to compute derived per window it's probably simpler to do in c rather than try to come up with numpy tricks.

So, we create a C function genetic_relatedness_matrix, following the pattern of divergence_matrix, and expose this to python in the standard way?

I think having the *_matrix functions have slightly different semantics is fine, we just need to document it clearly

@petrelharp
Copy link
Contributor

This was done in #2823 and see #1623 for documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

2 participants