Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate AFS bias due to polytomies #933

Open
hyanwong opened this issue Jun 11, 2024 · 3 comments
Open

Investigate AFS bias due to polytomies #933

hyanwong opened this issue Jun 11, 2024 · 3 comments

Comments

@hyanwong
Copy link
Member

hyanwong commented Jun 11, 2024

@YunDeng98 pointed out an interesting feature of the branch length AFS in tsinfer tree sequences: there seems to be a bit of a bias towards shorter branches at medium frequencies (and maybe slightly longer ones at low frequencies):

import msprime
import tsinfer
import tsdate
from matplotlib import pyplot as plt

ts = msprime.sim_ancestry(50, population_size=2e4, sequence_length=5e6, recombination_rate=1e-8, random_seed=1)
ts = msprime.sim_mutations(ts, rate=1e-8, random_seed=1)
print(f"Simulated {ts.num_sites} sites")
sts = tsinfer.infer(tsinfer.SampleData.from_tree_sequence(ts))
sts = tsdate.preprocess_ts(sts)
sts = tsdate.date(sts, mutation_rate=1e-8)
plt.plot(np.arange(ts.num_samples), ts.allele_frequency_spectrum(mode="branch", polarised=True)[1:], label="true branch AFS")
plt.plot(np.arange(sts.num_samples), sts.allele_frequency_spectrum(mode="branch", polarised=True)[1:], label="tsinfer + tsdate branch AFS")
plt.yscale("log")
plt.legend()
plt.show();

download

This is almost certainly due to a combination of the polytomies created by tsinfer with the effect of the tsdate rescaling algorithm (see follow-up below). But it's unclear as to what that would affect different frequency classes differently. It could, perhaps be due to how we build ancestors for medium-frequency mutations. This might be an interesting (longer term) thing to investigate when revisiting the ancestor building algorithm, if we think that's the source of this bias.

It's a bit less biased when using the true site times, but it's still the same basic pattern.

@hyanwong
Copy link
Member Author

hyanwong commented Jun 11, 2024

This is almost certainly due to the polytomies in tsinfer.

Some of it may be due to that, but there is still a bias if we use the original (true) tree sequence and simply make polytomies wherever an edge has no mutational support (using tskit-dev/tskit#2926):

del_by_interval = unsupported_edges(ts, per_interval=True)
poly_ts_by_interval = remove_edges(ts, del_by_interval).simplify()
dts = tsdate.date(poly_ts_by_interval, mutation_rate=1e-8)
plt.plot(np.arange(ts.num_samples), ts.allele_frequency_spectrum(mode="branch", polarised=True)[1:], label="true branch AFS")
plt.plot(np.arange(sts.num_samples), dts.allele_frequency_spectrum(mode="branch", polarised=True)[1:], label="dated true branch AFS (polytomies)")
plt.yscale("log")
plt.legend()
plt.show();

screenshot_2024-06-11_at_18 50 51_720

Note the weirdly long low freq branches in the polytomy-induced tree sequence, which are also there if you simply plot the AFS from the poly_ts_by_interval tree sequence.

@hyanwong
Copy link
Member Author

Note the weirdly long low freq branches in the polytomy-induced tree sequence

If we remove the rescaling from tsdate, we get the same low freq pattern on tsinferred tree sequences:

sts = tsdate.preprocess_ts(sts)
sts = tsdate.date(sts, mutation_rate=1e-8, rescaling_intervals=0)
plt.plot(np.arange(ts.num_samples), ts.allele_frequency_spectrum(mode="branch", polarised=True)[1:], label="true branch AFS")
plt.plot(np.arange(sts.num_samples), sts.allele_frequency_spectrum(mode="branch", polarised=True)[1:], label="tsinfer + tsdate (no rescaling) branch AFS")
plt.yscale("log")
plt.legend()
plt.show();

download

@nspope
Copy link

nspope commented Jun 11, 2024

When you've introduced polytomies to a binary treeseq, you're going to either get biased dates (if you rescale to match total segregating sites in time intervals) or you're going to get a biased SFS (if you rescale to match leaf-to-root area). Some bias in the SFS might be unavoidable no matter how you scale, because edges with different numbers of descendants overlap within the same time interval (and the distribution of this will change in a not-straightforward way depending on polytomies).

I'm not sure that you can ever expect tsinfer to give a perfect match to the SFS, given the polytomies. It seems like the thing to do is look at summary statistics that are invariant to the presence of polytomies; like path length.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants