Major feature release
This is a major update to the underlying data structures in msprime to generalise the information that can be modelled, and allow for data from external sources to be efficiently processed. The new Tables API enables efficient interchange of tree sequence data using numpy arrays. Many updates have also been made to the tree sequence API to make it more Pythonic and general. Most changes are backwards compatible, however.
Breaking changes:
-
The
SparseTree.mutations()
andTreeSequence.mutations()
iterators no longer support tuple-like access to values. For example, code likefor x, u, j in ts.mutations(): print("mutation at position", x, "node = ", u)
will no longer work. Code using the old
Mutation.position
andMutation.index
will still work through deprecated aliases, but new code should access these values throughSite.position
andSite.id
, respectively. -
The
TreeSequence.diffs()
method no longer works. Please use theTreeSequence.edge_diffs()
method instead. -
TreeSequence.get_num_records()
no longer works. Any code using this or therecords()
iterator should be rewritten to work with theedges()
iterator and num_edges instead. -
Files stored in the HDF5 format will need to upgraded using the
msp upgrade
command.
New features:
-
The API has been made more Pythonic by replacing (e.g.)
tree.get_parent(u)
withtree.parent(u)
, and
tree.get_total_branch_length()
withtree.total_branch_length
. The old forms have been maintained as deprecated aliases. (#64) -
Efficient interchange of tree sequence data using the new Tables API. This consists of classes representing the various tables (e.g.
NodeTable
) and some utility functions (such asload_tables
,sort_tables
, etc). -
Support for a much more general class of tree sequence topologies. For example, trees with multiple roots are fully supported.
-
Substantially generalised mutation model. Mutations now occur at specific sites, which can be associated with zero to many mutations. Each site has an ancestral state (any character string) and each mutation a derived state (any character string).
-
Substantially updated documentation to rigorously define the underlying data model and requirements for imported data.
-
The
variants()
method now returns a list of alleles for each site, and genotypes are indexes into this array. This is both consistent with existing usage and works with the newly generalised mutation model, which allows arbitrary strings of characters as mutational states. -
Add the formal concept of a sample, and distinguished from 'leaves'. Change
tracked_leaves
, etc. totracked_samples
(#225). Also renamesample_size
tonum_samples
for consistency (#227). -
The simplify() method returns subsets of a large tree sequence.
-
TreeSequence.first() returns the first tree in sequence.
-
Windows support. Msprime is now routinely tested on Windows as part of the suite of continuous integration tests.
-
Newick output is not supported for more general trees. (#117)
-
The
genotype_matrix
method allows efficient access to the full genotype matrix. (#306) -
The variants iterator no longer uses a single buffer for genotype data, removing a common source of error (#253).
-
Unicode and ASCII output formats for
SparseTree.draw()
. -
SparseTree.draw()
renders tree in the more conventional 'square shoulders' format. -
SparseTree.draw()
by default returns an SVG string, so it can be easily displayed in a Jupyter notebook. (#204) -
Preliminary support for a broad class of site-based statistics, including Patterson's f-statistics, has been added, through the
SiteStatCalculator
, and its branch length analog,BranchLengthStatCalculator
. The interface is still in development, and is expected may change.
Bug fixes:
-
Duplicate site no longer possible (#159)
-
Fix for incorrect population sizes in DemographyDebugger (#66).
Deprecated:
-
The
records
iterator has been deprecated, and the underlying data model has moved away from the concept of coalescence records. The structure of a tree sequence is now defined in terms of a set of nodes
and edges, essentially a normlised version of coalescence records. -
Changed
population_id
topopulation
in various DemographicEvent classes for consistency. The oldpopulation_id
argument is kept as a deprecated alias. -
Changed
destination
todest
in MassMigrationEvent. The olddestination
argument is retained as a deprecated alias. -
Changed
sample_size
tonum_samples
in TreeSequence and SparseTree. The older versions are retained as deprecated aliases. -
Change
get_num_leaves
tonum_samples
in SparseTree. Theget_num_leaves
method (and other related methods) that have been retained for backwards compatability are semantically incorrect,
in that they now return the number of samples. This should have no effect on existing code, since samples and leaves were synonymous. New code should use the documentednum_samples
form. -
Accessing the
position
attribute on aMutation
orVariant
object is now deprecated, as this is a property of aSite
. -
Accessing the
index
attribute on aMutation
orVariant
object is now deprecated. Please usevariant.site.id
instead. In general, objects with IDs (i.e., derived from tables) now have anid
field. -
Various
get_
methods in TreeSequence and SparseTree have been replaced by more Pythonic alternatives.