popSimManu.tex

\documentclass[12pt,halfline,a4paper]{ouparticle}
\usepackage{amsmath,amssymb}
\usepackage[round]{natbib}
\usepackage{graphicx,rotating}
\usepackage{dsfont}
\usepackage{xspace}

\newcommand{\Est}[1]{\hat{#1}}

\newcommand{\beginsupplement}{%
        \setcounter{table}{0}
        \renewcommand{\thetable}{S\arabic{table}}%
        \setcounter{figure}{0}
        \renewcommand{\thefigure}{S\arabic{figure}}%
     }
\newcommand{\stopsupplement}{%
        \setcounter{table}{0}
        \renewcommand{\thetable}{\arabic{table}}%
        \setcounter{figure}{0}
        \renewcommand{\thefigure}{\arabic{figure}}%
     }

% method names

\newcommand{\stdpopsim}{\texttt{stdpopsim}\xspace}
\newcommand{\dadi}{$\partial a \partial i$\xspace}
\newcommand{\MSMC}{\texttt{MSMC}\xspace}
\newcommand{\smcpp}{\texttt{smc++}\xspace}
\newcommand{\stairwayplot}{\texttt{stairwayplot}\xspace}
\newcommand{\fastsimcoal}{\texttt{fastsimcoal}\xspace}

\newcommand{\adk}[1]{\textcolor{red}{ADK: #1}}

\usepackage[hidelinks]{hyperref}

\begin{document}

\title{A community-maintained standard library of population genetic simulations}

\author{%
\name{The PopSim Consortium}
\address{}
\email{}}

% JK: didn't bother with the \texttt stuff for stdpopsim in the abstract as it's nearly
% always dropped in production anyway.

\abstract{The explosion in population genomic data demands ever more complex
modes of analysis, and increasingly these analyses depend on sophisticated simulations.
Recent advances in population genetic simulation have made it possible to
simulate large and complex models, but specifying such models for a particular
simulation engine remains a difficult and error-prone task.
Computational genetics researchers currently re-implement simulation models
independently, leading to duplication of effort and the possibility for error.
Population genetics, as a field, also lacks standard benchmarks by which new tools for inference might
be measured. Here we describe a new resource, \stdpopsim, that attempts to rectify this situation.
\stdpopsim is a community-driven, open source project, which provides a standard
catalog of published simulation models from a wide range of organisms,
and supports multiple simulation engine backends.
% PopSim assures correctness in our implementations through a mature, community-based quality control pipeline.
We share some examples demonstrating how \stdpopsim
can be used to systematically compare demographic inference methods,
and encourange an even broader
community of developers to join in the effort to standardize
population genetic methods.}
\date{\today}

\keywords{Population genetics, Simulation, Inference}

\maketitle


\section*{Introduction}
While population genetics has always used statistical methods to make inferences from data,
the degree of sophistication of the questions, models, data, and computational approaches
used have all increased over the past decade. Currently there exist myriad computational methods
that can infer the histories of populations, the distribution of fitness effects,
recombination rates, the extent of positive selection in genome sequence data
\citep{li2011inference,schiffels2014inferring,terhorst2017robust,eyre2009estimating,kim2017inference,
chan2012genome,lin2013fast,Adrion662247,alachiotis2012omegaplus,degiorgio2016sweepfinder2,
kern2018diplos,sugden2018localization}.
While en masse these methods have increased our understanding of the
impacts of genetic and evolutionary processes, very little has been done to systematically
benchmark the quality of inferences gleaned from computational population genetics.
As large databases of population genetic variation begin to be used to inform public health procedures,
the accuracy and quality of these inferences is becoming even more important.
This is in large part because ``ground-truth'' in population genetics comes from simulation,
not from empirical observations, and as a field there has been no agreement as to what should
constitute community standards or best practices. The general modus operandi to date has been to
introduce methods which are validated by a set of bespoke simulations produced by individual
groups. Beyond a lack of efficiency from duplicated effort, such a situation leads
to decreases in reproducibility and transparency across the entire field.
Here we describe \stdpopsim, a community-driven catalog
of standardized population genetics models spanning a variety of common study organisms.
This resource will empower the field to rigorously evaluate population genetic software
and select the best tool for a given task, thereby helping to usher in the next era of population genetic data analysis.
%DRS: still not in love with this last sentence here, but tried to make it a little less sappy by adding some explanation.
%ADK: made some edits here

The degree to which modeling assumptions and different choices of summaries of the
data can affect the ultimate results has been challenging to assess. Yet, there
have been several clear examples of different methods yielding fundamentally
different conclusions. For example, \MSMC methods applied to human genomes have
% JK is this right? > 100 years doesn't sound very ancient to me.
suggested large ancient ($>100$ years ago) ancestral population sizes and
bottlenecks that have not been detected by SFS-based methods
\citep[see][]{beichman2017comparison}.
These distinct methods differ in how they model, summarize, and optimize fit to
genetic variation data, suggesting that such design choices can greatly affect the
performance of the inference. Furthermore, it is reasonable
to assume generally that specific methods will perform better than others under certain scenarios.
However, because these conditions are typically unknown, empirical researchers
lack principled guidelines for deciding which statistical method is best suited
to accurately answer their particular question. The need for empirical
guidance will only increase as researchers continue to generate more genomic
variation data from non-model taxa and wish to make inferences regarding
demography and selection.

One way to benchmark different methods for statistical inference in population
genetics is to apply the methods to simulated datasets where the true model
parameters are known. Methods that perform well will accurately infer these
parameters and will also provide reasonable estimates of uncertainty of the
estimated parameters. Indeed, simulations studies are a standard ingredient in
most papers proposing new methods in population genetics. However, these
studies often use simulations of the particular models for which the method was designed,
which showcases the novel method rather than rigorously comparing performance to competing methods.
Of course, testing a new method in the situation for which it was designed is an important step,
but more rigorous testing is necessary for tools that are to be widely used in the field.
% Moreover researchers are generally forced to reimplement previously
% inferred scenarios independently, which makes good benchmarking
% a major additional time investment, and can lead to errors.
% Lastly, the degree to which different methods are compared can be quite variable.
Many more researchers in the broader field
use simulations to perform power and other exploratory analyses.
The many steps of implementing a moderately complicated population genetics simulation,
including translating published demographic models into input for a simulator
and obtaining genetic maps,
present a major obstacle, especially to new researchers.

% One reason why simulation studies do not compare the performance of many methods
% across a wide range of situations
% is that simulation studies are challenging to carry out, for several reasons.
% First, while there are many published population genetic models available, their
% parameters are often not easily accessible, usually presented in supplemental
% tables. Translating these parameters into input for a given simulation engine can
% be a time-consuming and error prone process. In addition, papers often present many
% different models, and different authors may choose to select distinct models,
% making comparison difficult. While improvements in simulation software and ever
% faster computers have partially ameliorated this concern, generating the
% simulated datasets can still be a time and computing resource bottleneck.

For these reasons, we have generated a standardized, community-driven resource
for simulating published demographic models from a number of popular study systems.
This resource, which we call \texttt{stdpopsim}, makes running
realistic simulations for population genetic analysis a simple matter of
choosing pre-implemented models from a community maintained catalog.
The \stdpopsim catalog currently contains four organisms: humans, 
\emph{D.~melanogaster}, \emph{A.~thaliana}, and \emph{E.~coli}. For each
organism the catalog contains details on the physical organization (e.g., chromosome structure)
of their genomes, one or more genetic maps, default population-level parameters (mutation rate,
generation time) and one or more published demographic histories. Through
either a command line interface or a simple Python API, the user can specify which
organism, genetic map, chromosome, and demographic history they are interested in simulating, and are returned
simulation output from their chosen model.

The software has been developed by the PopSim Consortium using a
distributed open source model, with strong procedures in place
to continue its growth and maintain its accuracy.
Importantly, we have rigorous quality control methods to ensure implemented models are accurate,
and have documented methods for others to contribute new modules.
We welcome new collaborators.
Below we describe the resource and give an
example of how it can be used to benchmark demographic inference methods.

\begin{figure}[t]
\begin{center}
\includegraphics[width=0.5\linewidth]{display_items/Figure1.pdf}
\caption{\textbf{Structure of \stdpopsim}. \textbf{A.} The
hierarchical organization of the PopSim catalog contains all information
within individual species. Information within the species level is shown here
for \emph{Homo sapiens} only. We see that \emph{H. sapiens} is associated
with a default generation time and mutation rate, that it contains within
it a representation of the physical genome, genetic maps, and as shown a
subset of the demographic models currently implemented. \textbf{B.} Caching
and automated download of genetic maps. \textbf{C.} Example code to specify
and simulate models using \stdpopsim via the API (top) or
command-line interface (bottom). \textbf{D.} Simulation output from
\stdpopsim is generated as tree sequence files that can
be manipulated and analysed with \texttt{tskit}.}
\label{fig:cartoon}
\end{center}
\end{figure}

\section*{Results}
\paragraph{The \stdpopsim library}
The central contribution of the PopSim consortium currently is \stdpopsim, a community
maintained library of empirical genome data and population genetics simulation
models, providing easy access to realistic simulations. Figure \ref{fig:cartoon} shows a graphical
representation of the structure of \stdpopsim. The package centers
on a catalog of species (Fig. \ref{fig:cartoon}A), initially consisting of humans, \emph{Drosophila},
\emph{Arabidopsis}, \emph{E. coli}, and \emph{Pongo}. A species definition consists
of two key elements.  Firstly, the library defines
some basic information about its genome, including information about chromosome
lengths, average mutation rates and so on. We also provide access to detailed
empirical information such as recombination maps, which model empirically
observed heterogeneity along chromosomes. As such maps are often large and hard
to find, we provide a mechanism to automatically download (and cache locally) a
set of previously published maps (Fig. \ref{fig:cartoon}B). For example, in humans we support the
HapMapII~\citep{international2007second} and
deCODE~\citep{kong2010fine} genetic maps. The second key element of a species description
within \stdpopsim is a set of carefully curated population genetic model
descriptions from the literature, which are the basis for accurate simulations
of specific historical scenarios that have influenced present-day patterns of
genetic polymorphism.

Given the genome data and simulation model descriptions defined within the
library, it is then straightforward to run accurate, standardized simulations
across a range of organisms. \stdpopsim has a Python API and a user-friendly
command line interface (Fig. \ref{fig:cartoon}C), allowing users with minimal experience direct access to
state-of-the-art simulations. Simulations are output in the “tree sequence”
format \citep{kelleher2016efficient,kelleher2018efficient,kelleher2019inferring}, which
contains complete genealogical information about the simulated samples, is
extremely compact, and can be processed very efficiently using the tskit library
\citep{kelleher2016efficient,kelleher2018efficient}. Currently,
\stdpopsim uses the  \texttt{msprime} coalescent simulator \citep{kelleher2016efficient}
as the default simulation engine. We plan to include other simulation
engines to support scenarios that cannot be modelled under the coalescent framework,
and have already implemented \texttt{SLiM} \citep{haller2019slim} as
an alternative back-end.

The \stdpopsim command line interface, by default, outputs citation information
for the models, genetic maps and simulation engines used in any particular run.
We hope that this will encourage users to appropriately acknowledge the
resources used in published work, and authors
of simulation models to contribute to our ongoing community-driven development process.

\subsection*{The \stdpopsim Catalog}
The current contents of the \stdpopsim catalog are shown in \autoref{tab:catalog}. At the time of
writing the PopSim Consortium has implemented previously published demographic models for
humans, \emph{Drosophila}, and \emph{Arabidopsis}. These range from
simple, single population histories \cite[e.g.,][]{sheehan2016deep},
to complex models which include population splitting, migration, and archaic
admixture \cite[e.g.,][]{ragsdale2019models}.

\begin{table}[t]
\begin{footnotesize}
\input{model_table.tex}
\end{footnotesize}
\caption{\label{tab:catalog}
Details of the initial set of population models across four species.
\textbf{NOTES: Column names are ID, name (or description maybe?),
reference, CPU time, RAM}
}
\end{table}

At the time of writing, more models of human history have been implemented than for
any other organism. These models include: a simplified version of the \cite{tennessen2012evolution}
model with only the African population modeled (expansion from the ancestral
population and recent growth), the three population model of \cite{gutenkunst2009inferring}
which modeled the out-of-Africa bottleneck as well as the subsequent divergence of
the European and Asian populations, the \cite{tennessen2012evolution} two population variant of the
Gutenkunst model which did not include Asian populations, but more explicitly modeled
recent rapid human population growth, the \cite{browning2018ancestry} admixture model
for American populations that models ancestral African, European, and Asian population
components, and the three-population out-of-Africa model from \cite{ragsdale2019models} that includes archaic admixture
. Together these models
contain features believed to have widespread impacts in real data (e.g. bottlenecks, population growth,
admixture) which are therefore highly pertinent in the context of method development.

For non-human genomes we have have implemented two demographic histories for
\emph{Drosophila melanogaster} and one each from \emph{A. thaliana} and \emph{E. coli}.
For \emph{D. melanogaster} we have implemented the three-epoch model estimated by \cite{sheehan2016deep} from
an African sample, as well as the out-of-Africa divergence
and associated bottleneck model of \cite{li2006inferring} which jointly models African
and European populations. For \emph{Arabidopsis thaliana}, we implemented the
model from in \cite{durvasula2017african} inferred using \MSMC. This model includes
a continuous change in population size over time, rather than pre-specified epochs of different
population sizes.


\subsection*{Use case: comparing methods of demographic inference}
As an example of the utility of \stdpopsim, we demonstrate how it can be used 
to simply and fairly compare popular demographic inference methods.
Although we present comparison of results from several
competing methods, our aim at this stage is not to provide an exhaustive
evaluation or ranking of these methods. Our hope is instead that future work built upon this resource
will enable more detailed exploration of the strengths and weaknesses of the numerous
inference methods that are available to the population genetics community.

We start by comparing popular methods for estimating
population size histories ($N(t)$) of single populations and subsequently
show simple examples of multi-population inference.
To evaluate and compare the performance of inference methods we developed
\texttt{snakemake} \citep{koster2012snakemake} workflows that allow reproducibility
in comparing inference methods which are available from \url{https://github.com/popgensims/analysis}
. The workflows allow efficient computing in multicore environments or on
cluster environments.

For single-population population size histories, we compared
\MSMC~\citep{schiffels2014inferring}, \smcpp~\citep{terhorst2017robust}, and
\stairwayplot~\citep{liu2015exploring}
 on simulated genomes sampled from a single population,
in a number of the \stdpopsim models described above. The pipeline was generalized to
run $R$ replicates with $C$ chromosomes for $n$ samples, for a total of $R \times C$
simulations for each demographic model. After simulation, the pipeline prepares
input files for each of the respective inference methods by grouping all
chromosomes, for each sample, into a single file. For each of the $R$ simulation replicates, this step results in an
input file for each of the
respective inference methods and derived from the same simulated tree sequences.
Each of the inference programs are then run in parallel, followed by plotting of
$N(t)$ estimates from each program.

\begin{figure}
\begin{center}
\includegraphics[width=0.8\linewidth]{display_items/homo_sapiens_mask_Ragsdale.pdf}
\caption{\textbf{Comparing estimates of $N(t)$ in humans}. Here we show estimates of population
size over time ($N(t)$) inferred using 4 different methods: \smcpp, \texttt{stairwayplot},
\MSMC with $n=2$ and $n=8$. Data were generated by simulating
replicate human genomes under the \cite{ragsdale2019models} model and using the
HapMapII genetic map~\citep{international2007second}. From top to bottom we show estimates for each
of the three populations in the model (YRI, CEU, and CHB). In shades of blue we show the estimated
$N(t)$ trajectories for each replicate. In black we show the true population size history as inferred
for the rate of coalescence in the demographic model.}
\label{fig:n_t_ragsdale}
\end{center}
\end{figure}


% JK here --- I think it would really help if we described what the models
% were, briefly, rather than saying who inferred them.
In Figure \ref{fig:n_t_ragsdale} we present results from our $N(t)$ analysis pipeline
run on whole genome simulations under a sophisticated demographic model
and empirical genetic map. In each column of this figure
we show inferred $N(t)$ for the three extant populations in the model.
In each row we show comparisons among the methods (note that we compare two differing
sample sizes for \MSMC). The solid black lines represent the true effective population
size as inferred from the inverse coalescent rate defined by the model.
Blue lines represent estimates from each of three replicate whole genome simulations.
While there is variation in accuracy among methods, populations, and individual replicates,
generally spoken the methods are accurate for this model of human history.

\stdpopsim allows us to readily compare these estimates to those drawn from a different
model of human history. In Figure \ref{fig:n_t_gutenkunst} we show estimates of
$N(t)$ from simulations using the same physical and genetic maps, but a different demographic
history -- that inferred by \cite{gutenkunst2009inferring}. Again we see that each
of the methods is capturing relevant parts of the population history, although the
discrepancy between inferred and simulated values is changes over time. In comparing inferences between the
models it is interesting to note that $N(t)$ estimates for the CHB and CEU
simulated populations are generally better across methods than estimates from the YRI
simulated population.

We can even do away with the demographic model entirely, and simulate under a human
genome architecture and genetic map to see how well methods might to at recovering
the population history of a constant sized population. We show results of such an
experiment in Figure \ref{fig:n_t_HomSap_constant}. Again we see that all methods
are recovering a modern population size within a factor of two of the truth, however
SMC-based methods, perhaps due to their regularization, tend to learn sinusoidal
patterns of population size even though no change is present.


\begin{figure}
\begin{center}
\includegraphics[width=0.8\linewidth]{display_items/d_mel_Sheehan_mask2.pdf}
\caption{\textbf{Comparing estimates of $N(t)$ in \emph{Drosophila}}. Population
size over time ($N(t)$) estimated from an African population sample. Data were generated by simulating
replicate \emph{D.melanogaster} genomes under the three-epoch \cite{sheehan2016deep} model
with the genetic map inferred in \cite{comeron2012many}. In shades of blue we show the estimated
$N(t)$ trajectories for each replicate. In black we show the true population size history as inferred
for the rate of coalescence in the demographic model.}
\label{fig:n_t_sheehan}
\end{center}
\end{figure}

As most population genetic methods development has been focused on human-scale
data, it is of consequence to ask how such methods might perform in non-human
genomes. Figure \ref{fig:n_t_sheehan} shows parameter estimates from a demographic
model estimated from an African sample of \emph{Drosophila melanogaster} \citep{sheehan2016deep}.
As with humans, we use \stdpopsim to simulate replicate \emph{Drosophila} genomes under
an empirically derived genetic map, and then try to infer back the parameters of model.
Accuracy is mixed among methods in this setting, and generally worse than what we
observe under simulations of the human genome.


\begin{figure}
\begin{center}
\includegraphics[width=0.7\linewidth]{display_items/homo_sapiens_two_popn_comp.png}
\caption{\textbf{A comparison two-population parameter estimates in humans}. Here we show estimates of $N(t)$ inferred using \dadi, \texttt{fastsimcoal2}, or \smcpp.
A) Data were generated by simulating
replicate human genomes under the \cite{gutenkunst2009inferring} model and using the genetic map
inferred in \cite{international2007second}. B) For \dadi and \texttt{fastsimcoal2} we show inferred
parameters of the depicted IM model, which include population sizes, migration rates, and a split
time between CEU and YRI samples. C) Population size estimates for each population (rows)
from \dadi, \texttt{fastsimcoal2}, and \smcpp (columns). In shades of blue we show the estimated
$N(t)$ trajectories for each replicate. In black we show the true population size history as inferred
for the rate of coalescence in the demographic model. The population split date, $T_{div}$, is shown at
the bottom, with a common X-axis to the background figure.}
\label{fig:IM_popn_human}
\end{center}
\end{figure}

\paragraph*{Multi Population demographic models}
As \texttt{stdpopsim} implements multi-population demographic models, we also
explored parameter estimation of population divergence parameters. In particular
we performed simulations under multi-population models for humans and \emph{Drosophila}
and inferred simulated parameters using \dadi, \fastsimcoal, and \smcpp.
For simplicity, we conducted inference in \dadi and \fastsimcoal using an IM model
with constant population sizes and bi-directional migration. For human
models with more than two-populations, e.g. \cite{gutenkunst2009inferring},
this means that we are inferring parameters from a model which itself does
not match the model from which the data were generated (Figure
\ref{fig:IM_popn_human}A and B), however this is common practice in the field
when applying IM-type models.

In Figure \ref{fig:IM_popn_human}C we show estimates of population sizes and divergence
time from each of the methods for samples simulated for the African and European populations
under the \cite{gutenkunst2009inferring} model. Each of the methods compared has stronger
and weaker points to their accuracy across parameters. For instance the simple, IM-style
analyses we have performed do not capture recent population size change in CEU, however
they do a good job of estimating the simpler simulated YRI population size history from this model.

Again we can compare between species and look at the performance of these methods in
the context of a two-population model from \emph{Drosophila melanogaster}. Figure
\ref{fig:two_popn_fly} shows parameter estimates from simulations drawn from
the model from \cite{li2006inferring} which includes
an ancestral population in Africa, then a population expansion with a population
split and bottleneck into a European population. These simulations are generated
using the recombination map of \cite{comeron2012many}. Here methods differ dramatically
in the estimates, with \smcpp yielding quite noisy estimates of population size
towards the more ancient past and all methods failing to capture the very brief
population bottleneck in Europe, although the IM-model of course only allows for
two epochs in the population size history.

Again the goal of these analyses are not to perform thorough benchmarking of
the methods at hand under a wide range of simulations, but instead to show how
the tools that we have built could be used to facilitate such comparisons.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section*{Discussion}

Here we have described what we view as the first product from the PopSim Consortium:
the \stdpopsim library. We have founded the Consortium with a number of specific aims
in mind: standardization of simulation within the computational genetics community,
increasing reproducibility of simulation based results, community-based development and
decision making in guiding best practices in population genetics, and eventual benchmarking
of inferential methods.

The \stdpopsim library is the first piece of this effort, in that it allows for rigorous
standardization of complex population genetic simulations. Population genetics, as a field,
has yet to coalesce around a set of gold-standards for the crucial task of method
evaluation, which in our field hinges on simulation. In contrast, other fields such as
structural biology (\citep{moult1995large}) and machine learning (\citep{russakovsky2015imagenet}) have a long track record
of standardized method testing. We hope that our efforts represent the beginning of what
will prove to be an equally longstanding and invaluable tradition in population genetics.

Moreover individual research groups are often in the position of having to independently
reimplement complex, previously published models. \stdpopsim offers solutions to both of
these issues by creating a quality controlled, easy-to-use, gateway to models and organisms
that are central to modern research in genetics.

We have built \stdpopsim, and the PopSim Consortium, to be an open, community-developed
project, meaning that it has excellent potential for future growth.
Thus far we have implemented genome representations and genetic maps for the three most
common study systems in computational genetics: humans, \emph{Drosophila}, and \emph{Arabidopsis}.
The standard operating procedures that we have outlined in our detailed documentation
means that new researchers, who are interested in adding their favorite organism/genome
are free to with very few barriers to entry (see Methods). Further new demographic models can be added
to each of the organisms in the catalog through similarly well documented steps.
A long-term goal then is that through a mechanism such as \stdpopsim, the community
can determine (in an open way) what is the correct implementation for a given model,
and in turn, what are the models that the community is most interested in.

We have shown in the results above how \stdpopsim can be used for direct comparisons
of inferential methods on a common set of simulations. While we have not made a concerted
effort at benchmarking methods yet, we believe that this would be an important goal for
the PopSim Consortium and population genetics more broadly. The analysis workflows that
we have released as part of \stdpopsim should provide a cookbook for the community to
dive in to more rigorous benchmarking than we have done thus far.

In the use case comparisons we have performed,
we found that the methods tested \emph{do} generally perform well on human genomes,
up to, for instance, the usual factor-of-two differences in inferred population sizes.
This will not be surprising to people familiar with the landscape of
demographic inference methods in population genetics,
but the nature and scale of uncertainty may be not clear to people new to the field.
The ability for \stdpopsim to generate data with and without significant features, such
as the genetic map or population size change (e.g., Figure \ref{fig:n_t_HomSap_constant}), allows
us to ask directly about the various failure modes of popular methods.
Moreover the comparison of methods across the various genome organizations, genetic maps,
and demographic histories of different organisms provides valuable information
about how methods might perform on non-human systems.
In our initial comparison we find that inference of population histories from \emph{Drosophila}
or \emph{Arabidopsis} simulated genomes is a fair bit more difficult from SMC based methods
than in the human genome. Likely this is at least partly a consequence of the reduced physical and genetic
map lengths of these genomes, although we will leave further investigation of this to the future.
Finally, comparison of results across methods or simulation runs
gives a strong estimate of inference uncertainty, analogously to parametric bootstrapping --
although different methods have different strengths and weaknesses,
strong conclusions should probably not rest entirely on features inferred by only one method.

The broad brush comparisons we have done, while far from exhaustive, are informative,
and highlight the utility of easy comparison of methods on the same footing.
The tests here were under best-case scenarios --
simulations of discrete, randomly mating populations that fit the assumptions of the methods nearly exactly --
and yet different methods, as well as applications of the same method
to different simulations of the same model,
produced somewhat different results. Future efforts will extend the resource described here to
encompass additional population genetic tasks, such as detecting natural selection, inferring
recombination rates, etc, in hopes of producing a general-purpose resource for fair and
standardized evaluation of the ever-increasing arsenal of computational population genetic
inference tools.

%\adk{should we mention benchmarking of tree sequence inference here? what else?}

% \paragraph{The problem of comparison} % Moved from the intro
% Benchmarking the performance of inference methods in population genetics is not straightforward.
% Different methods often fit different types of model
% or use different summaries of the genetic variation data, even when
% attempting to address the same biological question. For example, there are
% numerous methods to infer population size changes.
% The \MSMC method \citep{schiffels2014inferring} allows for the
% population to continuously change in size over time. Other methods, such as
% \dadi \citep{gutenkunst2009inferring}, allow for a smaller number of size changes to occur at certain
% points throughout the population's history. The types of data that the
% methods use also differs. While \MSMC uses one to four whole genome sequences and leverages the
% spatial patterns of genetic variation along the sequence, \dadi  instead uses the
% site frequency spectrum (SFS), which summarizes the frequencies of SNPs in the
% sample, ignoring any information about the correlation structure (linkage
% disequilibrium) among SNPs. Thus from the start these methods may not be
% expected to perform equally well in all scenarios,
% and indeed, it is not always clear how to compare to the truth --
% for instance, how well can a method of piecewise constant population sizes
% describe a period of exponential growth?
% Or, how should discrete population models be mapped onto real-world (continuous) geography?
% The answer depends on determining what is most important for the question at hand.

\section*{Methods}
\subsection*{Generating the PopSim resource}

\subsection*{Model QC procedures}
As a consortium we have agreed to a standardized proceedure for model inclusion
into \stdpopsim that allows for rigorous quality control. Imagine Developer A
wants to introduce a new model into \stdpopsim. Developer A implements the
demographic model for the relevant organism along with clear documentation
of the model parameters and populations. This model is submitted as a Pull Request,
where it is evaluated by a reviewer and then included as `preliminary',
but is not linked to the online documentation nor the command line interface.
Inclusion of a preliminary model triggers a QC issue, and a second developer,
Developer B, then independently reimplements the model from the relevant
primary sources and adds an automatic unit test for equality between the
QC implementation and the preliminary production model. If the two
implementations are equal the original model is included in \stdpopsim.
If not, we move to an arbitration process whereby the A and B first try
to work out the details of what went wrong. If that fails, the original
authors of the published model must be contacted
to resolve ambiguities. This process can lead to cycles of revision of
the implementation from Developers A and B but allows the project to assure
quality. Indeed we have already had a number of interesting cases of arbitration
to this point. Further details of our QC process can be found in our
\href{https://stdpopsim.readthedocs.io/en/latest/development.html#}{developer documentation}.

\subsection*{Workflow for analysis of simulated data}
To demonstrate the utility of \stdpopsim we created \texttt{Snakemake}
workflows \citep{koster2012snakemake} that perform demographic inference on
tree sequence output from our package using a few common software packages.
Our choice of \texttt{Snakemake} allows complete reproducibility of the
analyses shown and the entire Analysis repository is available from
\url{https://github.com/popgensims/analysis}.

We performed demographic analysis for one-and two-population tasks.
Our one population task was to infer historical population size over
time ($N(t)$). This was done using three software packages \stairwayplot
which uses site frequency spectrum information only \citep{liu2015exploring},
the SMC based \MSMC~\citep{schiffels2014inferring} which we ran with two different
sample sizes $n\in (2 , 8)$, and \smcpp~\citep{terhorst2017robust}
which combines information from the site frequency spectrum with
recombination information as in SMC-based methods. No attempt
was made at trying to optimize the analysis from
any particular software package in present case,
as our goal was not to benchmark performance of methods but
instead show how such benchmarking could be easily done using
the \stdpopsim resource. In this spirit we ran each software package as near
to default parameters as possibile in each case. For \stairwayplot we
set the parameters numRuns=1 and dimFactor=5000. For \smcpp we used the
``estimate" run mode to infer $N(t)$ with all other parameters set
to their default values. For \MSMC we used the ``--fixedRecombination"
option and used the default number of iterations.

For the single population task we ran human (HomSap) simulations
using a variety of models: OutOfAfricaArchaicAdmixture\_5R19, OutOfAfrica\_3G09,
a constant-sized generic model, and a two-epoch generic model where the
population size instantaneously decreases from $N=10^4$ to $N=10^3$ 500
generations before the present. Each HomSap simulation was run
using the HapmapII\_GRCh37 genetic maps. For \emph{D. melanogaster}
we estimated $N(t)$ from an African sample simulated under the DroMel,
African3Epoch\_1S16 model using the Comeron2012\_dm6 map. Finally we ran
simulations of \emph{A. thaliana} genomes using the AraTha African2Epoch\_1H18
model under the Salome2012\_TAIR7 map. For each model, 3 replicate whole genomes were
simulated and the population size estimated from those data. In all cases we
set the sample size of the focal population to $N=50$ chromosomes.

Following simulation, low-recombination portions of chromosomes were masked
from the analysis in a manner that reflects the ``accessible" subset of sites
used in empirical population genomic studies \citep[e.g.,][]{danecek20111000,langley2012genomic}.
Specifically we masked regions of 1cM or greater in the lowest 5th percentile of the empirical
distribution of recombination, regions which are nearly uniformly absent for
empirical analysis.

Our goal for the two-population analysis pipeline was to explore two-population inference
on some of the multi-population demographic models implemented in \stdpopsim.
For HomSap we examined the OutOfAfrica\_3G09 model with the HapmapII\_GRCh37 genetic map,
and for DroMel we simulated under the OutOfAfrica\_2L06 model with the Comeron2012\_dm6 map.
The HomSap model is a three population model (Africa, Europe, and Asia) including post-divergence
migration and exponential growth (Figure \ref{fig:IM_popn_human}C), whereas the
DroMel model is a two population model (Africa and Europe) with no post-divergence
migration and constant population sizes (Figure \ref{fig:two_popn_fly}).

To conduct inference on these models, we used three commonly used methods:
\dadi~\citep{gutenkunst2009inferring}, \fastsimcoal2~\citep{excoffier2013robust},
and \smcpp~\citep{terhorst2017robust}. As above, these methods were used
generally with default settings and we did not attempt to optimize their performance or fit
parameter-rich demographic models. As with the single population inference pipeline,
we developed a \texttt{snakemake}~\citep{koster2012snakemake} workflow to manage
our inference pipeline and maintain reproducibility.

For both \dadi and \fastsimcoal2, we fit a two population
isolation-with-migration (IM) model with constant population sizes to the simulated
data from the HomSap and DroMel models. This IM model contains six parameters:
the ancestral population size, the size of population 1 after the split, the size of
population 2 after the split, the divergence time, and two migration rate parameters.
Importantly, this meant that for both species, the
fitted model did not match the simulated model (Figure \ref{fig:IM_popn_human}C and Sx).
In the HomSap case, we therefore performed inference solely on the Africa
and Europe populations, meaning that the Asia population functioned as a ghost
population that was ignored by our inference. Our motivation for fitting this simple
IM model was to mimic the typical approach of two population inference on empirical
data, where the user is not aware of the `true' underlying demography and the inference
model is often misspecified. To ground-truth our inference approach, we also conducted
inference on a generic IM model that was identical to the model used for inference.

From HomSap we ran replicate whole genome simulations under the model
and specified taking 20 samples each from the Europe and Africa populations.
For DroMel, runtimes under OutOfAfrica\_2L06 were prohibitively slow when simulating whole genomes.
For example, both chromosomes 3R and X failed to coalesce within 30 days of starting using $n = 50$.
Runtimes using the identical sample size were 19, 21, and 12 days for chromosomes 2L, 2R, and 3L,
respectively. For this reason, we chose to present only data from chromosome 2R for simulations
under OutOfAfrica\_2L06. For the generic IM simulations, we used the HomSap genome along with the
HapmapII\_GRCh37 genetic map, and sampled 20 individuals from each population.

Following simulation, we outputted tree sequences and masked low-recombination
regions using the same approach described for the single population pipeline above. We
converted tree sequences into a two-dimensional site frequency spectrum for all
chromosomes in the appropriate format for \dadi and \fastsimcoal2. For each simulation
replicate, we performed 10 runs of \dadi and \fastsimcoal2 and checked for convergence.
Detailed settings for \dadi and \fastsimcoal2 can be found in the Snakefile
on the popgensims/analysis GitHub repository (\url{https://github.com/popgensims/analysis}).
The parameter estimates from the top run from each simulation replicate
are shown in Figures \ref{fig:IM_popn_human}C and Sx.

For \smcpp , we converted the tree sequences into vcf format and performed inference
with default settings. Importantly, \smcpp assumes no migration post-divergence, so
here again the inference model does not exactly match the simulated model for the
human model that does include migration. However, because \smcpp allows
for continuous population size changes, it is better equipped to capture many of the
more complex aspects of the simulated demographic models (e.g., exponential growth).

To visualize our results, we plotted the inferred population size trajectories
for each simulation replicate along side the true (census) population sizes
(Figures \ref{fig:IM_popn_human}C Sx). Here, unlike the single population pipeline,
we focus on census size rather than the inverse coalescence rate as the `true' population size.


% \subsection*{Extra text}
% [This section describes the stdpopsim Python library. What it’s for, how it’s used and what it can do]
% [First para: running accurate simulations is hard]
% Running realistic population genetic simulations is a complex task today. It is
% straightforward to run simple simulations of population models such as the
% coalescent or Wright-Fisher using (e.g.) msprime or SLiM, but the process of
% instantiating these models to accurately reflect the properties of natural
% populations that are important for a particular study is fraught with
% difficulties. There are typically three important inputs that must be decided:
% the demographic model, recombination and mutation rates. Demographic models for
% a particular species are typically chosen by examining the literature for
% published model specifications, which can consist of dozens of parameters. These
% parameters must be transcribed from the original publication and translated into
% the input format required for the simulator in question. This is a difficult and
% error prone task, and mistakes are common. Recombination and mutation rates are
% simpler to describe and often a single genome-wide estimate is appropriate.
% However, such estimates are updated frequently and there is wide variation in
% the values used for the simulations of the same species. When more fine-grained
% simulations of the local chromosome structure is required, empirical
% recombination and mutation rate maps are used. However, there is no standard
% format for such maps, and finding the maps can be challenging.


\bibliographystyle{plainnat}
\bibliography{popSim.bib}

\pagebreak
\beginsupplement
\section*{Supplemental Figures}
\begin{figure}
\begin{center}
\includegraphics[width=0.8\linewidth]{display_items/homo_sapiens_mask_Gutenkunst.pdf}
\caption{\textbf{Comparing estimates of $N(t)$ in humans}. Estimates of population
size over time ($N(t)$) inferred using 4 different methods, \smcpp, \stairwayplot,
\MSMC with $n=2$ and $n=8$. Data were generated by simulating
replicate human genomes under the \cite{gutenkunst2009inferring} model and using the genetic map
inferred in \cite{international2007second}. From top to bottom we show estimates for each
of the three populations in the model YRI, CEU, and CHB. In shades of blue we show the estimated
$N(t)$ trajectories for each replicate. In black we show the true population size history as inferred
for the rate of coalescence in the demographic model.}
\label{fig:n_t_gutenkunst}
\end{center}
\end{figure}

\begin{figure}
\begin{center}
\includegraphics[width=0.8\linewidth]{display_items/homo_sapiens_constant.pdf}
\caption{\textbf{Comparing estimates of $N(t)$ in humans}. Here we show estimates of population
size over time ($N(t)$) inferred using 4 different methods, \smcpp, \texttt{stairwayplot},
\MSMC with $n=2$ and $n=8$. Data were generated by simulating
replicate human genomes under a constant sized population model with $N=10^4$ and using the
HapMapII genetic map~\citep{international2007second}. In black we show the true population size history
of the model.}
\label{fig:n_t_HomSap_constant}
\end{center}
\end{figure}


\begin{figure}
\begin{center}
\includegraphics[width=0.8\linewidth]{display_items/d_mel_two_popn_comp.png}
\caption{\textbf{A comparison two-population parameter estimates in \emph{Drosophila}}. Here we show estimates of $N(t)$ inferred using \dadi, \texttt{fastsimcoal2}, or \smcpp.
Data were generated by simulating
replicate human genomes under the \cite{li2006inferring} model and using the genetic map
inferred in \cite{comeron2012many}. See legend of Figure \ref{fig:IM_popn_human} for details.
In shades of blue we show the estimated
$N(t)$ trajectories for each replicate. In black we show the true population size history as inferred
for the rate of coalescence in the demographic model.}
\label{fig:two_popn_fly}
\end{center}
\end{figure}

\stopsupplement

\end{document}