Investigate AFS bias due to polytomies #933

hyanwong · 2024-06-11T17:17:26Z

@YunDeng98 pointed out an interesting feature of the branch length AFS in tsinfer tree sequences: there seems to be a bit of a bias towards shorter branches at medium frequencies (and maybe slightly longer ones at low frequencies):

import msprime
import tsinfer
import tsdate
from matplotlib import pyplot as plt

ts = msprime.sim_ancestry(50, population_size=2e4, sequence_length=5e6, recombination_rate=1e-8, random_seed=1)
ts = msprime.sim_mutations(ts, rate=1e-8, random_seed=1)
print(f"Simulated {ts.num_sites} sites")
sts = tsinfer.infer(tsinfer.SampleData.from_tree_sequence(ts))
sts = tsdate.preprocess_ts(sts)
sts = tsdate.date(sts, mutation_rate=1e-8)
plt.plot(np.arange(ts.num_samples), ts.allele_frequency_spectrum(mode="branch", polarised=True)[1:], label="true branch AFS")
plt.plot(np.arange(sts.num_samples), sts.allele_frequency_spectrum(mode="branch", polarised=True)[1:], label="tsinfer + tsdate branch AFS")
plt.yscale("log")
plt.legend()
plt.show();

This is almost certainly due to a combination of the polytomies created by tsinfer with the effect of the tsdate rescaling algorithm (see follow-up below). But it's unclear as to what that would affect different frequency classes differently. It could, perhaps be due to how we build ancestors for medium-frequency mutations. This might be an interesting (longer term) thing to investigate when revisiting the ancestor building algorithm, if we think that's the source of this bias.

It's a bit less biased when using the true site times, but it's still the same basic pattern.

hyanwong · 2024-06-11T17:52:04Z

This is almost certainly due to the polytomies in tsinfer.

Some of it may be due to that, but there is still a bias if we use the original (true) tree sequence and simply make polytomies wherever an edge has no mutational support (using tskit-dev/tskit#2926):

del_by_interval = unsupported_edges(ts, per_interval=True)
poly_ts_by_interval = remove_edges(ts, del_by_interval).simplify()
dts = tsdate.date(poly_ts_by_interval, mutation_rate=1e-8)
plt.plot(np.arange(ts.num_samples), ts.allele_frequency_spectrum(mode="branch", polarised=True)[1:], label="true branch AFS")
plt.plot(np.arange(sts.num_samples), dts.allele_frequency_spectrum(mode="branch", polarised=True)[1:], label="dated true branch AFS (polytomies)")
plt.yscale("log")
plt.legend()
plt.show();

Note the weirdly long low freq branches in the polytomy-induced tree sequence, which are also there if you simply plot the AFS from the poly_ts_by_interval tree sequence.

hyanwong · 2024-06-11T18:18:37Z

Note the weirdly long low freq branches in the polytomy-induced tree sequence

If we remove the rescaling from tsdate, we get the same low freq pattern on tsinferred tree sequences:

sts = tsdate.preprocess_ts(sts)
sts = tsdate.date(sts, mutation_rate=1e-8, rescaling_intervals=0)
plt.plot(np.arange(ts.num_samples), ts.allele_frequency_spectrum(mode="branch", polarised=True)[1:], label="true branch AFS")
plt.plot(np.arange(sts.num_samples), sts.allele_frequency_spectrum(mode="branch", polarised=True)[1:], label="tsinfer + tsdate (no rescaling) branch AFS")
plt.yscale("log")
plt.legend()
plt.show();

nspope · 2024-06-11T18:58:41Z

When you've introduced polytomies to a binary treeseq, you're going to either get biased dates (if you rescale to match total segregating sites in time intervals) or you're going to get a biased SFS (if you rescale to match leaf-to-root area). Some bias in the SFS might be unavoidable no matter how you scale, because edges with different numbers of descendants overlap within the same time interval (and the distribution of this will change in a not-straightforward way depending on polytomies).

I'm not sure that you can ever expect tsinfer to give a perfect match to the SFS, given the polytomies. It seems like the thing to do is look at summary statistics that are invariant to the presence of polytomies; like path length.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate AFS bias due to polytomies #933

Investigate AFS bias due to polytomies #933

hyanwong commented Jun 11, 2024 •

edited

Loading

hyanwong commented Jun 11, 2024 •

edited

Loading

hyanwong commented Jun 11, 2024

nspope commented Jun 11, 2024 •

edited

Loading

Investigate AFS bias due to polytomies #933

Investigate AFS bias due to polytomies #933

Comments

hyanwong commented Jun 11, 2024 • edited Loading

hyanwong commented Jun 11, 2024 • edited Loading

hyanwong commented Jun 11, 2024

nspope commented Jun 11, 2024 • edited Loading

hyanwong commented Jun 11, 2024 •

edited

Loading

hyanwong commented Jun 11, 2024 •

edited

Loading

nspope commented Jun 11, 2024 •

edited

Loading