Divmat sample_sets #2823

jeromekelleher · 2023-08-16T14:48:04Z

I started adding support for individuals to the divmat code, but realised that it would actually be a lot easier if we just used a sample_sets argument, like the rest of the stats API methods. Certainly for the low-level code, it would be simplest to just implement it in terms of sample_sets, and let the higher level code do the parametrisation.

This is the first steps towards that goal.

@petrelharp would you mind taking a quick look through to see if I've got the right end of the stick here in terms of definitions?

I think what I have for the main code (some post-processing by numpy) computes the same value as the stats API. Am I right in the division by the count matrix afterwards? (I.e., it's not the sum of the divergences between nodes in those sample sets, but their mean?)

codecov · 2023-08-16T15:16:43Z

Codecov Report

Attention: 5 lines in your changes are missing coverage. Please review.

Comparison is base (77faade) 89.69% compared to head (71db20d) 89.71%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2823      +/-   ##
==========================================
+ Coverage   89.69%   89.71%   +0.02%     
==========================================
  Files          30       30              
  Lines       30159    30243      +84     
  Branches     5860     5883      +23     
==========================================
+ Hits        27052    27134      +82     
- Misses       1778     1780       +2     
  Partials     1329     1329

Flag	Coverage Δ
c-tests	`86.11% <94.89%> (+0.01%)`	⬆️
lwt-tests	`80.78% <ø> (ø)`
python-c-tests	`67.73% <40.84%> (-0.17%)`	⬇️
python-tests	`98.94% <100.00%> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
python/_tskitmodule.c	`88.65% <100.00%> (+<0.01%)`	⬆️
python/tskit/trees.py	`98.70% <100.00%> (+0.07%)`	⬆️
c/tskit/trees.c	`90.18% <94.89%> (+<0.01%)`	⬆️

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 77faade...71db20d. Read the comment docs.

python/tskit/trees.py

petrelharp · 2023-08-17T05:12:42Z

Good call! I've not looked in detail at the C, but I think answered the question?

jeromekelleher · 2023-08-17T06:54:43Z

Great, thanks @petrelharp. I'll ping when it's ready for a proper look.

jeromekelleher · 2023-08-22T11:57:12Z

I think this is working now with arbitrary sample_sets in C, @petrelharp (although it has taken an abominably long time to get right!). There's still some work to be done in hardening the Python-C layer, and before I do I was considering making the interface a bit nicer (but will take a bit more work, of course)

def divergance_matrix(self, ids=None, *, windows=None, ...):

usage:

ts.divergance_matrix() -> n x n divmat of all sample nodes
ts.divergance_matrix([ind.nodes for ind in ts.individuals()]) -> k x k divmat of all individuals
ts.divergance_matrix([ts.nodes(population=pop) for pop in range(ts.num_populations]) -> pop x pop

So, the ids argument is either a 1D list, or a list of lists. 1D list is interpreted as [[u] for u in ids].

I was considering also adding a individuals=False argument so that you could ask the method to interpret the IDs as individual instead of node IDs. I wonder if this is really worth the bother, though.

So the main change is, rather than having samples and sample_sets, we have this positional argument ids which checks the dimensions of the input array, and acts accordingly. What do you think?

jeromekelleher · 2023-08-23T18:22:44Z

I went a ahead and implemented the ids= argument @petrelharp, which seems to work reasonably well. I also did a quick version of genetic_relatedness_matrix just to see if this all really works, and it seems to 🎉

It would be good if you could take a look and see what you think of the proposed interface. I guess we should have a go at the individuals switch to see if that's works out, before finalising?

petrelharp · 2023-08-24T16:52:44Z

So the main change is, rather than having samples and sample_sets, we have this positional argument ids which checks the dimensions of the input array, and acts accordingly. What do you think?

I'm a bit confused here - isn't this already how sample_sets works?

jeromekelleher · 2023-08-25T08:47:28Z

You're right, it probably is very similar. I guess the difference is that there's no dimension-dropping in the output.

petrelharp · 2023-08-25T15:20:03Z

the difference is that there's no dimension-dropping in the output.

Oh, good point, that's different. Okay, I like the proposal, then, although how about instead of ids it should be node_ids? Or even nodes? Otherwise ids is very similar to ind(ividual)s, which is a serious source of error for the unwary.

It might be more natural to call it sample_ids? Or just use sample_sets? (but then it wouldn't be the first argument)

jeromekelleher · 2023-09-04T14:49:16Z

Oh, good point, that's different. Okay, I like the proposal, then, although how about instead of ids it should be node_ids? Or even nodes? Otherwise ids is very similar to ind(ividual)s, which is a serious source of error for the unwary.

Well the idea was that we then leave the door open for interpreting the IDs as individuals at some point, like

ts.genetic_related_matrix(np.arange(ts.num_individuals), individuals=True)

but, it would be easy to get this wrong and really this isn't so bad:

ts.genetic_related_matrix([ind.nodes for ind in ts.individuals])

so let's not bother.

The simplest thing is to just go head with calling it sample_sets, so lets change back to that.

jeromekelleher · 2023-09-04T14:59:02Z

Shall I clean up and merge this much @petrelharp? there's a partial implementation of GRM, but we can fill that out in the next updates. I

jeromekelleher · 2023-12-01T10:28:01Z

@brieuclehmann the implementation of genetic_relatedness_matrix should be ready for action now. I think we can probably clean up this PR and merge soon, and track an issue for docutation to the next release.

@petrelharp - can you have a scan of this when you get a chance please?

jeromekelleher · 2023-12-01T12:55:29Z

I think CI failures are due to a stale package cache. Simplest thing is to make a new PR from this one - I'll do that after @petrelharp signs off

petrelharp · 2023-12-04T01:49:19Z

I'll have a look!

petrelharp · 2023-12-04T03:57:39Z

python/tskit/trees.py

+            X = y[:, np.newaxis] + y[np.newaxis, :]
+            K -= X
+            # FIXME I don't know what this factor of -2 is about
+            return K / -2


yeah what the heck - I'm having trouble tracking down where the explanation of this is; where'd you get it from?

Okay, let's see - this must be because relatedness between I and J is defined in terms of shared alleles as:

E[m(I,J) - m(I,S) - m(J,T) + m(S,T)]

where m(,) is the number of shared alleles and S and T are independently chosen sample sets.

So, naively, if d(I,J) is the number of differing alleles, and Z is the total number of alleles, then

m(I,J) = Z - d(I,J)

and so

m(I,J) - m(I,S) - m(J,T) + m(S,T) = - d(I,J) + d(I,S) + d(J,T) - d(S,T)

However: apparently for relatedness we want m(I,J) for sample sets I and J to be the sum over samples i in I and j in J of m(i,j), while for genetic divergence we have that diversity(I,J) is the average over diversity(i,j).

Hm, now I'm a bit turned around, since in this implementation, we're taking the average over samples as opposed the sum over samples. This contradicts what the docs for genetic_relatedness say , but then again, I think would be what someone would actually expect to get from using sample sets.

Before I try to figure out what we want to do here more, do you have a reference of previous discussion on this?

Oh right - so, the argument above explains the - in the denominator; I'm not sure about the 2, though.

Well I just got the 2 by looking at the results and guessing. The implementation is based on this gist where I went back to first principles on the GRM. I can't remember where I got the normalisation expression from though I'm afraid.

jeromekelleher · 2024-01-11T17:08:16Z

We made a decision to merge this version, logging an issue to track making sure that the implementation is correct. I'll update when I get a chance and push forward.

jeromekelleher · 2024-01-16T16:12:06Z

So there is an issue here with non-simple sample sets @petrelharp, but hopefully it's a fairly easy one to resolve. I've opened #2888 to track. I suggest we merge this PR as there's a bunch of stuff in here and would be good to get it in.

petrelharp

As discussed, not totally right yet but we're merging anyhow, to fix things up after.

petrelharp reviewed Aug 17, 2023

View reviewed changes

python/tskit/trees.py Outdated Show resolved Hide resolved

jeromekelleher force-pushed the divmat-individuals branch from 66d8081 to 7c5d1d5 Compare August 23, 2023 14:53

jeromekelleher force-pushed the divmat-individuals branch from 88f26de to 24954d9 Compare September 4, 2023 14:42

jeromekelleher marked this pull request as ready for review September 4, 2023 14:58

jeromekelleher force-pushed the divmat-individuals branch from 9a7da33 to ea8b6ca Compare November 30, 2023 16:42

petrelharp reviewed Dec 4, 2023

View reviewed changes

jeromekelleher force-pushed the divmat-individuals branch from 54f1fd4 to 64a9528 Compare January 15, 2024 11:51

jeromekelleher mentioned this pull request Jan 16, 2024

Sample_sets incorrect in genetic_relatedness_matrix #2888

Closed

jeromekelleher added 2 commits January 16, 2024 16:08

Support sample sets in divmat

163b86f

Add check for non-single sample sets and disable tests

71db20d

jeromekelleher force-pushed the divmat-individuals branch from a0aa763 to 71db20d Compare January 16, 2024 16:08

petrelharp approved these changes Jan 16, 2024

View reviewed changes

jeromekelleher added the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label Jan 16, 2024

mergify bot merged commit 2dae133 into tskit-dev:main Jan 16, 2024
23 checks passed

mergify bot removed the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label Jan 16, 2024

petrelharp mentioned this pull request Sep 25, 2024

Use divergence_matrix for downstream statistics #2783

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Divmat sample_sets #2823

Divmat sample_sets #2823

jeromekelleher commented Aug 16, 2023

codecov bot commented Aug 16, 2023 •

edited

Loading

petrelharp commented Aug 17, 2023

jeromekelleher commented Aug 17, 2023

jeromekelleher commented Aug 22, 2023

jeromekelleher commented Aug 23, 2023

petrelharp commented Aug 24, 2023

jeromekelleher commented Aug 25, 2023

petrelharp commented Aug 25, 2023 •

edited

Loading

jeromekelleher commented Sep 4, 2023

jeromekelleher commented Sep 4, 2023

jeromekelleher commented Dec 1, 2023

jeromekelleher commented Dec 1, 2023

petrelharp commented Dec 4, 2023

petrelharp Dec 4, 2023

petrelharp Dec 4, 2023

petrelharp Dec 4, 2023

jeromekelleher Dec 4, 2023

jeromekelleher commented Jan 11, 2024

jeromekelleher commented Jan 16, 2024

petrelharp left a comment

Divmat sample_sets #2823

Divmat sample_sets #2823

Conversation

jeromekelleher commented Aug 16, 2023

codecov bot commented Aug 16, 2023 • edited Loading

Codecov Report

petrelharp commented Aug 17, 2023

jeromekelleher commented Aug 17, 2023

jeromekelleher commented Aug 22, 2023

jeromekelleher commented Aug 23, 2023

petrelharp commented Aug 24, 2023

jeromekelleher commented Aug 25, 2023

petrelharp commented Aug 25, 2023 • edited Loading

jeromekelleher commented Sep 4, 2023

jeromekelleher commented Sep 4, 2023

jeromekelleher commented Dec 1, 2023

jeromekelleher commented Dec 1, 2023

petrelharp commented Dec 4, 2023

petrelharp Dec 4, 2023

Choose a reason for hiding this comment

petrelharp Dec 4, 2023

Choose a reason for hiding this comment

petrelharp Dec 4, 2023

Choose a reason for hiding this comment

jeromekelleher Dec 4, 2023

Choose a reason for hiding this comment

jeromekelleher commented Jan 11, 2024

jeromekelleher commented Jan 16, 2024

petrelharp left a comment

Choose a reason for hiding this comment

codecov bot commented Aug 16, 2023 •

edited

Loading

petrelharp commented Aug 25, 2023 •

edited

Loading