-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Divmat sample_sets #2823
Divmat sample_sets #2823
Conversation
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #2823 +/- ##
==========================================
+ Coverage 89.69% 89.71% +0.02%
==========================================
Files 30 30
Lines 30159 30243 +84
Branches 5860 5883 +23
==========================================
+ Hits 27052 27134 +82
- Misses 1778 1780 +2
Partials 1329 1329
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report in Codecov by Sentry.
|
Good call! I've not looked in detail at the C, but I think answered the question? |
Great, thanks @petrelharp. I'll ping when it's ready for a proper look. |
I think this is working now with arbitrary sample_sets in C, @petrelharp (although it has taken an abominably long time to get right!). There's still some work to be done in hardening the Python-C layer, and before I do I was considering making the interface a bit nicer (but will take a bit more work, of course) def divergance_matrix(self, ids=None, *, windows=None, ...):
usage:
So, the I was considering also adding a So the main change is, rather than having |
66d8081
to
7c5d1d5
Compare
I went a ahead and implemented the It would be good if you could take a look and see what you think of the proposed interface. I guess we should have a go at the |
I'm a bit confused here - isn't this already how |
You're right, it probably is very similar. I guess the difference is that there's no dimension-dropping in the output. |
Oh, good point, that's different. Okay, I like the proposal, then, although how about instead of It might be more natural to call it |
88f26de
to
24954d9
Compare
Well the idea was that we then leave the door open for interpreting the IDs as individuals at some point, like ts.genetic_related_matrix(np.arange(ts.num_individuals), individuals=True) but, it would be easy to get this wrong and really this isn't so bad:
so let's not bother. The simplest thing is to just go head with calling it |
Shall I clean up and merge this much @petrelharp? there's a partial implementation of GRM, but we can fill that out in the next updates. I |
9a7da33
to
ea8b6ca
Compare
@brieuclehmann the implementation of @petrelharp - can you have a scan of this when you get a chance please? |
I think CI failures are due to a stale package cache. Simplest thing is to make a new PR from this one - I'll do that after @petrelharp signs off |
I'll have a look! |
X = y[:, np.newaxis] + y[np.newaxis, :] | ||
K -= X | ||
# FIXME I don't know what this factor of -2 is about | ||
return K / -2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah what the heck - I'm having trouble tracking down where the explanation of this is; where'd you get it from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, let's see - this must be because relatedness between I
and J
is defined in terms of shared alleles as:
E[m(I,J) - m(I,S) - m(J,T) + m(S,T)]
where m(,)
is the number of shared alleles and S
and T
are independently chosen sample sets.
So, naively, if d(I,J)
is the number of differing alleles, and Z
is the total number of alleles, then
m(I,J) = Z - d(I,J)
and so
m(I,J) - m(I,S) - m(J,T) + m(S,T) = - d(I,J) + d(I,S) + d(J,T) - d(S,T)
However: apparently for relatedness we want m(I,J)
for sample sets I
and J
to be the sum over samples i in I
and j in J
of m(i,j)
, while for genetic divergence we have that diversity(I,J)
is the average over diversity(i,j)
.
Hm, now I'm a bit turned around, since in this implementation, we're taking the average over samples as opposed the sum over samples. This contradicts what the docs for genetic_relatedness
say , but then again, I think would be what someone would actually expect to get from using sample sets.
Before I try to figure out what we want to do here more, do you have a reference of previous discussion on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh right - so, the argument above explains the -
in the denominator; I'm not sure about the 2, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well I just got the 2 by looking at the results and guessing. The implementation is based on this gist where I went back to first principles on the GRM. I can't remember where I got the normalisation expression from though I'm afraid.
We made a decision to merge this version, logging an issue to track making sure that the implementation is correct. I'll update when I get a chance and push forward. |
54f1fd4
to
64a9528
Compare
a0aa763
to
71db20d
Compare
So there is an issue here with non-simple sample sets @petrelharp, but hopefully it's a fairly easy one to resolve. I've opened #2888 to track. I suggest we merge this PR as there's a bunch of stuff in here and would be good to get it in. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed, not totally right yet but we're merging anyhow, to fix things up after.
I started adding support for
individuals
to the divmat code, but realised that it would actually be a lot easier if we just used asample_sets
argument, like the rest of the stats API methods. Certainly for the low-level code, it would be simplest to just implement it in terms of sample_sets, and let the higher level code do the parametrisation.This is the first steps towards that goal.
@petrelharp would you mind taking a quick look through to see if I've got the right end of the stick here in terms of definitions?
I think what I have for the main code (some post-processing by numpy) computes the same value as the stats API. Am I right in the division by the
count
matrix afterwards? (I.e., it's not the sum of the divergences between nodes in those sample sets, but their mean?)