Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Divmat sample_sets #2823

Merged
merged 2 commits into from
Jan 16, 2024
Merged

Conversation

jeromekelleher
Copy link
Member

I started adding support for individuals to the divmat code, but realised that it would actually be a lot easier if we just used a sample_sets argument, like the rest of the stats API methods. Certainly for the low-level code, it would be simplest to just implement it in terms of sample_sets, and let the higher level code do the parametrisation.

This is the first steps towards that goal.

@petrelharp would you mind taking a quick look through to see if I've got the right end of the stick here in terms of definitions?

I think what I have for the main code (some post-processing by numpy) computes the same value as the stats API. Am I right in the division by the count matrix afterwards? (I.e., it's not the sum of the divergences between nodes in those sample sets, but their mean?)

@codecov
Copy link

codecov bot commented Aug 16, 2023

Codecov Report

Attention: 5 lines in your changes are missing coverage. Please review.

Comparison is base (77faade) 89.69% compared to head (71db20d) 89.71%.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2823      +/-   ##
==========================================
+ Coverage   89.69%   89.71%   +0.02%     
==========================================
  Files          30       30              
  Lines       30159    30243      +84     
  Branches     5860     5883      +23     
==========================================
+ Hits        27052    27134      +82     
- Misses       1778     1780       +2     
  Partials     1329     1329              
Flag Coverage Δ
c-tests 86.11% <94.89%> (+0.01%) ⬆️
lwt-tests 80.78% <ø> (ø)
python-c-tests 67.73% <40.84%> (-0.17%) ⬇️
python-tests 98.94% <100.00%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
python/_tskitmodule.c 88.65% <100.00%> (+<0.01%) ⬆️
python/tskit/trees.py 98.70% <100.00%> (+0.07%) ⬆️
c/tskit/trees.c 90.18% <94.89%> (+<0.01%) ⬆️

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 77faade...71db20d. Read the comment docs.

python/tskit/trees.py Outdated Show resolved Hide resolved
@petrelharp
Copy link
Contributor

Good call! I've not looked in detail at the C, but I think answered the question?

@jeromekelleher
Copy link
Member Author

Great, thanks @petrelharp. I'll ping when it's ready for a proper look.

@jeromekelleher
Copy link
Member Author

I think this is working now with arbitrary sample_sets in C, @petrelharp (although it has taken an abominably long time to get right!). There's still some work to be done in hardening the Python-C layer, and before I do I was considering making the interface a bit nicer (but will take a bit more work, of course)

def divergance_matrix(self, ids=None, *, windows=None, ...):
    

usage:

ts.divergance_matrix() -> n x n divmat of all sample nodes
ts.divergance_matrix([ind.nodes for ind in ts.individuals()]) -> k x k divmat of all individuals
ts.divergance_matrix([ts.nodes(population=pop) for pop in range(ts.num_populations]) -> pop x pop

So, the ids argument is either a 1D list, or a list of lists. 1D list is interpreted as [[u] for u in ids].

I was considering also adding a individuals=False argument so that you could ask the method to interpret the IDs as individual instead of node IDs. I wonder if this is really worth the bother, though.

So the main change is, rather than having samples and sample_sets, we have this positional argument ids which checks the dimensions of the input array, and acts accordingly. What do you think?

@jeromekelleher
Copy link
Member Author

I went a ahead and implemented the ids= argument @petrelharp, which seems to work reasonably well. I also did a quick version of genetic_relatedness_matrix just to see if this all really works, and it seems to 🎉

It would be good if you could take a look and see what you think of the proposed interface. I guess we should have a go at the individuals switch to see if that's works out, before finalising?

@petrelharp
Copy link
Contributor

So the main change is, rather than having samples and sample_sets, we have this positional argument ids which checks the dimensions of the input array, and acts accordingly. What do you think?

I'm a bit confused here - isn't this already how sample_sets works?

@jeromekelleher
Copy link
Member Author

You're right, it probably is very similar. I guess the difference is that there's no dimension-dropping in the output.

@petrelharp
Copy link
Contributor

petrelharp commented Aug 25, 2023

the difference is that there's no dimension-dropping in the output.

Oh, good point, that's different. Okay, I like the proposal, then, although how about instead of ids it should be node_ids? Or even nodes? Otherwise ids is very similar to ind(ividual)s, which is a serious source of error for the unwary.

It might be more natural to call it sample_ids? Or just use sample_sets? (but then it wouldn't be the first argument)

@jeromekelleher
Copy link
Member Author

Oh, good point, that's different. Okay, I like the proposal, then, although how about instead of ids it should be node_ids? Or even nodes? Otherwise ids is very similar to ind(ividual)s, which is a serious source of error for the unwary.

Well the idea was that we then leave the door open for interpreting the IDs as individuals at some point, like

ts.genetic_related_matrix(np.arange(ts.num_individuals), individuals=True)

but, it would be easy to get this wrong and really this isn't so bad:

ts.genetic_related_matrix([ind.nodes for ind in ts.individuals])

so let's not bother.

The simplest thing is to just go head with calling it sample_sets, so lets change back to that.

@jeromekelleher jeromekelleher marked this pull request as ready for review September 4, 2023 14:58
@jeromekelleher
Copy link
Member Author

Shall I clean up and merge this much @petrelharp? there's a partial implementation of GRM, but we can fill that out in the next updates. I

@jeromekelleher
Copy link
Member Author

@brieuclehmann the implementation of genetic_relatedness_matrix should be ready for action now. I think we can probably clean up this PR and merge soon, and track an issue for docutation to the next release.

@petrelharp - can you have a scan of this when you get a chance please?

@jeromekelleher
Copy link
Member Author

I think CI failures are due to a stale package cache. Simplest thing is to make a new PR from this one - I'll do that after @petrelharp signs off

@petrelharp
Copy link
Contributor

I'll have a look!

X = y[:, np.newaxis] + y[np.newaxis, :]
K -= X
# FIXME I don't know what this factor of -2 is about
return K / -2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah what the heck - I'm having trouble tracking down where the explanation of this is; where'd you get it from?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, let's see - this must be because relatedness between I and J is defined in terms of shared alleles as:

 E[m(I,J) - m(I,S) - m(J,T) + m(S,T)]

where m(,) is the number of shared alleles and S and T are independently chosen sample sets.

So, naively, if d(I,J) is the number of differing alleles, and Z is the total number of alleles, then

 m(I,J) = Z - d(I,J)

and so

m(I,J) - m(I,S) - m(J,T) + m(S,T) = - d(I,J) + d(I,S) + d(J,T) - d(S,T)

However: apparently for relatedness we want m(I,J) for sample sets I and J to be the sum over samples i in I and j in J of m(i,j), while for genetic divergence we have that diversity(I,J) is the average over diversity(i,j).

Hm, now I'm a bit turned around, since in this implementation, we're taking the average over samples as opposed the sum over samples. This contradicts what the docs for genetic_relatedness say , but then again, I think would be what someone would actually expect to get from using sample sets.

Before I try to figure out what we want to do here more, do you have a reference of previous discussion on this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right - so, the argument above explains the - in the denominator; I'm not sure about the 2, though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well I just got the 2 by looking at the results and guessing. The implementation is based on this gist where I went back to first principles on the GRM. I can't remember where I got the normalisation expression from though I'm afraid.

@jeromekelleher
Copy link
Member Author

We made a decision to merge this version, logging an issue to track making sure that the implementation is correct. I'll update when I get a chance and push forward.

@jeromekelleher
Copy link
Member Author

So there is an issue here with non-simple sample sets @petrelharp, but hopefully it's a fairly easy one to resolve. I've opened #2888 to track. I suggest we merge this PR as there's a bunch of stuff in here and would be good to get it in.

Copy link
Contributor

@petrelharp petrelharp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, not totally right yet but we're merging anyhow, to fix things up after.

@jeromekelleher jeromekelleher added the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label Jan 16, 2024
@mergify mergify bot merged commit 2dae133 into tskit-dev:main Jan 16, 2024
23 checks passed
@mergify mergify bot removed the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label Jan 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants