-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate AFS bias due to polytomies #933
Comments
Some of it may be due to that, but there is still a bias if we use the original (true) tree sequence and simply make polytomies wherever an edge has no mutational support (using tskit-dev/tskit#2926): del_by_interval = unsupported_edges(ts, per_interval=True)
poly_ts_by_interval = remove_edges(ts, del_by_interval).simplify()
dts = tsdate.date(poly_ts_by_interval, mutation_rate=1e-8)
plt.plot(np.arange(ts.num_samples), ts.allele_frequency_spectrum(mode="branch", polarised=True)[1:], label="true branch AFS")
plt.plot(np.arange(sts.num_samples), dts.allele_frequency_spectrum(mode="branch", polarised=True)[1:], label="dated true branch AFS (polytomies)")
plt.yscale("log")
plt.legend()
plt.show(); Note the weirdly long low freq branches in the polytomy-induced tree sequence, which are also there if you simply plot the AFS from the poly_ts_by_interval tree sequence. |
If we remove the rescaling from tsdate, we get the same low freq pattern on tsinferred tree sequences: sts = tsdate.preprocess_ts(sts)
sts = tsdate.date(sts, mutation_rate=1e-8, rescaling_intervals=0)
plt.plot(np.arange(ts.num_samples), ts.allele_frequency_spectrum(mode="branch", polarised=True)[1:], label="true branch AFS")
plt.plot(np.arange(sts.num_samples), sts.allele_frequency_spectrum(mode="branch", polarised=True)[1:], label="tsinfer + tsdate (no rescaling) branch AFS")
plt.yscale("log")
plt.legend()
plt.show(); |
When you've introduced polytomies to a binary treeseq, you're going to either get biased dates (if you rescale to match total segregating sites in time intervals) or you're going to get a biased SFS (if you rescale to match leaf-to-root area). Some bias in the SFS might be unavoidable no matter how you scale, because edges with different numbers of descendants overlap within the same time interval (and the distribution of this will change in a not-straightforward way depending on polytomies). I'm not sure that you can ever expect tsinfer to give a perfect match to the SFS, given the polytomies. It seems like the thing to do is look at summary statistics that are invariant to the presence of polytomies; like path length. |
@YunDeng98 pointed out an interesting feature of the branch length AFS in tsinfer tree sequences: there seems to be a bit of a bias towards shorter branches at medium frequencies (and maybe slightly longer ones at low frequencies):
This is almost certainly due to a combination of the polytomies created by tsinfer with the effect of the tsdate rescaling algorithm (see follow-up below). But it's unclear as to what that would affect different frequency classes differently. It could, perhaps be due to how we build ancestors for medium-frequency mutations. This might be an interesting (longer term) thing to investigate when revisiting the ancestor building algorithm, if we think that's the source of this bias.
It's a bit less biased when using the true site times, but it's still the same basic pattern.
The text was updated successfully, but these errors were encountered: