Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drift trajectory implementation #1803

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
118 changes: 118 additions & 0 deletions docs/ancestry.md
Original file line number Diff line number Diff line change
Expand Up @@ -90,6 +90,8 @@ this.
{class}`.SweepGenicSelection`
: Selective sweep at a linked locus

{class}`.NeutralFixation`
: a neutral fixation at a linked locus
---


Expand Down Expand Up @@ -2274,6 +2276,122 @@ multiple models. See the {ref}`sec_logging` section for more
details.
:::

(sec_ancestry_models_neutral_fixations)=

### Neutral Fixations

The same sort of structured coalescent that we used above to model
selective sweeps can also be used to model the coalescent conditional
on the fixation of a selectively neutral mutation, what we call in
msprime the {class}`.NeutralFixation` model.

As the {class}`.SweepGenicSelection` model,
the population is split between two classes of selective backgrounds,
those linked to the derived neutral allele, call it {math}`B`, and those not,
{math}`b`. As with the {class}`.SweepGenicSelection` model,
the user supplies a final allele frequency and a starting
allele frequency, between which msprime simulates a stochastic
drift trajectory according to a conditional diffusion model that
conditions on eventual loss of the allele (looking backwards in time).

Beyond the start and end frequencies of the sweep trajectory, the user
must specify the position along
the chromosome where the focal allele occurs.
All other parameters can be set as usual.

:::{warning}
The implementation of the structured coalescent is quite general, but there are
some limitations in the current implementation of the sweeps model (e.g., no
change of population size during a sweep). Please let us know if there are
related features you would like to see (or if you would be interested in
helping to create this functionality!)
:::
#### An example of a neutral fixation simulation

First we set up some basic parameters, and define the sweep model:

```{code-cell}
Ne = 1e3
L = 1e6 # Length of simulated region
num_reps = 100

# define hard sweep model
nf_model = msprime.NeutralFixation(
position=L / 2, # middle of chrom
start_frequency=1.0 / (2 * Ne),
end_frequency=1.0 - (1.0 / (2 * Ne)),
dt=1e-6,
)
```
The start and end frequencies here specify that we will look at
the complete sojourn of the neutral fixation, from frequency
1/2N to frequency 1 looking forward in time.

Next we set up the replicate simulations. As described
in the {ref}`sec_ancestry_models_specifying_multiple` section,
the ``model`` parameter is a list when we want to run multiple
ancestry models. As in the hard sweep example above,
we here have a neutral fixation that occurs
in the immediate past, and is then followed by the
standard coalescent for the rest of time:

```{code-cell}
reps = msprime.sim_ancestry(
5,
model=[nf_model, msprime.StandardCoalescent()],
population_size=Ne,
recombination_rate=1e-7,
sequence_length=L,
num_replicates=num_reps,
)
```

:::{note}
Because the {class}`.NeutralFixation` model has a random
{ref}`duration<sec_ancestry_models_specifying_duration>` we do
not set this value in the model. Please see the
{ref}`sec_ancestry_models_specifying_completion` section for
more discussion on this important point.
:::

Once we've set up the replicate simulations we can compute the
windows for plotting, run the actual simulations
(see the {ref}`sec_randomness_replication` section for more details)
and compute the
{meth}`pairwise diversity<tskit.TreeSequence.diversity>`.

```{code-cell}
wins = np.linspace(0, L, 21)
mids = (wins[1:] + wins[:-1]) / 2
diversity = np.zeros((num_reps, mids.shape[0]))
for j, ts in enumerate(reps):
diversity[j] = ts.diversity(windows=wins, mode="branch")
```

Finally, we can plot the observed mean diversity across the replicates
and compare it to the neutral expectation:

```{code-cell}
from matplotlib import pyplot as plt

plt.plot(mids, diversity.mean(axis=0), label="Simulations")
plt.axhline(4 * Ne, linestyle=":", label=r'Neutral expectation')
plt.ylabel(r'Branch $\pi$');
plt.xlabel('Position (bp)')
plt.legend();
```

:::{note}
We use the "branch" measure of pairwise diversity which is
defined in terms of the trees rather than nucleotide sequences. See
[Ralph et al. 2020](https://doi.org/10.1534/genetics.120.303253)
for more information.
:::

As we can see, the neutral fixation has reduced variation in the region
most very closely linked to the focal allele but not in nearly as
dramatic a fashion as the hard sweeps we simulated above.

(sec_ancestry_missing_data)=

## Missing data
Expand Down
5 changes: 5 additions & 0 deletions docs/api.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ for discussion and examples of individual features.
BetaCoalescent
DiracCoalescent
SweepGenicSelection
NeutralFixation
```

### Mutations
Expand Down Expand Up @@ -115,6 +116,10 @@ for discussion and examples of individual features.
.. autoclass:: msprime.SweepGenicSelection
```

```{eval-rst}
.. autoclass:: msprime.NeutralFixation
```

### Mutations


Expand Down
24 changes: 19 additions & 5 deletions lib/msprime.c
Original file line number Diff line number Diff line change
Expand Up @@ -7004,7 +7004,7 @@ msp_beta_common_ancestor_event(msp_t *self, population_id_t pop_id, label_id_t l
}

/**************************************************************
* Allele frequency trajectory simulation for genic selection
* Allele frequency trajectory simulation for sweep routines
*
**************************************************************/

Expand All @@ -7018,6 +7018,16 @@ genic_selection_stochastic_forwards(double dt, double freq, double alpha, double
return freq + (ux * dt) + sign * sqrt(freq * (1.0 - freq) * dt);
}

static double
neutral_stochastic_backwards(double dt, double freq, double u)
{
/* returns allele freq of neutral allele drifting conditional
on loss (looking backwards in time) following Ewens */
double ux = -1.0 * freq;
int sign = u < 0.5 ? 1 : -1;
return freq + (ux * dt) + sign * sqrt(freq * (1.0 - freq) * dt);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we think of reasonable ways to handle the possibility of numeric errors here? It seems that there are no restrictions placed on the input values, so one can get negative frequencies out:

In [1]: dt = 1e-8

In [2]: freq = 1e-8

In [3]: ux = -freq

In [4]: sign = -1

In [5]: import math

In [7]: freq + (ux * dt) + sign * math.sqrt(freq * (1.0 - freq) * dt)
Out[7]: -4.999999918875795e-17

Perhaps this should return max(value, 0.0) ?

}

static int
genic_selection_generate_trajectory(sweep_t *self, msp_t *simulator,
size_t *ret_num_steps, double **ret_time, double **ret_allele_frequency)
Expand Down Expand Up @@ -7067,9 +7077,13 @@ genic_selection_generate_trajectory(sweep_t *self, msp_t *simulator,
}
pop_size = get_population_size(&simulator->populations[0], sim_time);
alpha = 2 * pop_size * trajectory.s;
x = 1.0
- genic_selection_stochastic_forwards(
trajectory.dt, 1.0 - x, alpha, gsl_rng_uniform(rng));
if (alpha > 0) {
x = 1.0
- genic_selection_stochastic_forwards(
trajectory.dt, 1.0 - x, alpha, gsl_rng_uniform(rng));
} else {
x = neutral_stochastic_backwards(trajectory.dt, x, gsl_rng_uniform(rng));
}
/* need our recored traj to stay in bounds */
t += trajectory.dt;
sim_time += trajectory.dt * pop_size * simulator->ploidy;
Expand Down Expand Up @@ -7325,7 +7339,7 @@ msp_set_simulation_model_sweep_genic_selection(msp_t *self, double position,
ret = MSP_ERR_BAD_TIME_DELTA;
goto out;
}
if (s <= 0) {
if (s < 0) {
ret = MSP_ERR_BAD_SWEEP_GENIC_SELECTION_S;
goto out;
}
Expand Down
11 changes: 11 additions & 0 deletions lib/tests/test_sweeps.c
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,16 @@ test_genic_selection_trajectory(void)
verify_simple_genic_selection_trajectory(0.1, 0.4, 50, 0.0001);
}

static void
test_neutral_fixation_trajectory(void)
{
verify_simple_genic_selection_trajectory(0.1, 0.9, 0, 0.0001);
verify_simple_genic_selection_trajectory(0.1, 0.9, 0, 0.000001);
verify_simple_genic_selection_trajectory(0.8, 0.9, 0, 0.000002);
verify_simple_genic_selection_trajectory(0.1, 0.7, 0, 0.000001);
verify_simple_genic_selection_trajectory(0.1, 0.4, 0, 0.0001);
}

static void
test_sweep_genic_selection_bad_parameters(void)
{
Expand Down Expand Up @@ -383,6 +393,7 @@ main(int argc, char **argv)
{
CU_TestInfo tests[] = {
{ "test_genic_selection_trajectory", test_genic_selection_trajectory },
{ "test_neutral_fixation_trajectory", test_neutral_fixation_trajectory },
{ "test_sweep_genic_selection_bad_parameters",
test_sweep_genic_selection_bad_parameters },
{ "test_sweep_genic_selection_events", test_sweep_genic_selection_events },
Expand Down
1 change: 1 addition & 0 deletions msprime/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@
SmcPrimeApproxCoalescent,
StandardCoalescent,
SweepGenicSelection,
NeutralFixation,
WrightFisherPedigree,
)

Expand Down
24 changes: 22 additions & 2 deletions msprime/_msprimemodule.c
Original file line number Diff line number Diff line change
Expand Up @@ -1077,9 +1077,10 @@ Simulator_parse_simulation_model(Simulator *self, PyObject *py_model)
PyObject *dirac_s = NULL;
PyObject *beta_s = NULL;
PyObject *sweep_genic_selection_s = NULL;
PyObject *neutral_fixation_s = NULL;
PyObject *value;
int is_hudson, is_dtwf, is_smc, is_smc_prime, is_dirac, is_beta, is_sweep_genic_selection;
int is_wf_ped;
int is_wf_ped, is_neutral_fixation;
double psi, c, alpha, truncation_point;

hudson_s = Py_BuildValue("s", "hudson");
Expand Down Expand Up @@ -1114,6 +1115,10 @@ Simulator_parse_simulation_model(Simulator *self, PyObject *py_model)
if (sweep_genic_selection_s == NULL) {
goto out;
}
neutral_fixation_s = Py_BuildValue("s", "neutral_fixation");
if (neutral_fixation_s == NULL) {
goto out;
}

py_name = get_dict_value(py_model, "name");
if (py_name == NULL) {
Expand Down Expand Up @@ -1219,9 +1224,23 @@ Simulator_parse_simulation_model(Simulator *self, PyObject *py_model)
goto out;
}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't comment on the CPython bits -- need @jeromekelleher for that.

is_neutral_fixation = PyObject_RichCompareBool(py_name,
neutral_fixation_s, Py_EQ);
if (is_neutral_fixation == -1) {
goto out;
}
if (is_neutral_fixation) {
/* currently using all the machinery from the genic model */
ret = Simulator_parse_sweep_genic_selection_model(self, py_model);
if (ret != 0) {
goto out;
}
}

if (! (is_hudson || is_dtwf || is_smc || is_smc_prime || is_dirac
|| is_beta || is_sweep_genic_selection || is_wf_ped)) {
|| is_beta || is_sweep_genic_selection || is_neutral_fixation
|| is_wf_ped)) {
PyErr_SetString(PyExc_ValueError, "Unknown simulation model");
goto out;
}
Expand All @@ -1239,6 +1258,7 @@ Simulator_parse_simulation_model(Simulator *self, PyObject *py_model)
Py_XDECREF(beta_s);
Py_XDECREF(dirac_s);
Py_XDECREF(sweep_genic_selection_s);
Py_XDECREF(neutral_fixation_s);
return ret;
}

Expand Down
64 changes: 63 additions & 1 deletion msprime/ancestry.py
Original file line number Diff line number Diff line change
Expand Up @@ -1314,7 +1314,9 @@ def _choose_num_labels(self, models):
"""
num_labels = 1
for model in models:
if isinstance(model, SweepGenicSelection):
if isinstance(model, SweepGenicSelection) or isinstance(
model, NeutralFixation
):
num_labels = 2
return num_labels

Expand Down Expand Up @@ -1922,3 +1924,63 @@ def __init__(
self.end_frequency = end_frequency
self.s = s
self.dt = dt


@dataclasses.dataclass
class NeutralFixation(ParametricAncestryModel):
"""
A fixation of a neutral mutation has occured in the history of the sample.
This will lead to a weak reduction in polymorphism near the fixed site.

The model is one of a structured coalescent where selective backgrounds are
defined as in
`Braverman et al. (1995) <https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1206652/>`_
The implementation details here follow closely those in discoal
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the scaling identical to Schrider and Kern? The sweep model wasn't, so any differences here also need to be documented.

`(Kern and Schrider, 2016)
<https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5167068/>`_

See :ref:`sec_ancestry_models_neutral_fixations` for examples and
details on how to specify different types of sweeps.

.. warning::
Currently models with more than one population and a sweep
are not implemented. Population size changes during the sweep
are not yet possible in msprime.

:param float position: the location of the fixation along the
chromosome.
:param float start_frequency: population frequency of the neutral
allele at the start of the sojourn that we will follow. E.g., for a *de novo*
allele in a diploid population of size N, start frequency would be
:math:`1/2N`.
:param float end_frequency: population frequency of neutral allele
allele at the end of its sojourn.
:param float dt: dt is the small increment of time for stepping through
the sweep phase of the model. a good rule of thumb is for this to be
approximately :math:`1/40N` or smaller.
"""

name = "neutral_fixation"

position: float | None
start_frequency: float | None
end_frequency: float | None
dt: float | None
s: float | 0

# We have to define an __init__ to enfore keyword-only behaviour
def __init__(
self,
*,
duration=None,
position=None,
start_frequency=None,
end_frequency=None,
dt=None,
):
self.duration = duration
self.position = position
self.start_frequency = start_frequency
self.end_frequency = end_frequency
self.s = 0
self.dt = dt
Loading