Skip to content

Commit

Permalink
Merge pull request #346 from jeromekelleher/tables-docs
Browse files Browse the repository at this point in the history
Tables docs
  • Loading branch information
jeromekelleher authored Jan 15, 2018
2 parents 8f63bc4 + 4360ca7 commit cffec4f
Show file tree
Hide file tree
Showing 11 changed files with 1,129 additions and 335 deletions.
4 changes: 2 additions & 2 deletions _msprimemodule.c
Original file line number Diff line number Diff line change
Expand Up @@ -2607,7 +2607,7 @@ MutationTable_add_row(MutationTable *self, PyObject *args, PyObject *kwds)
int parent = MSP_NULL_MUTATION;
char *derived_state;
Py_ssize_t derived_state_length;
PyObject *py_metadata = NULL;
PyObject *py_metadata = Py_None;
char *metadata = NULL;
Py_ssize_t metadata_length = 0;
static char *kwlist[] = {"site", "node", "derived_state", "parent", "metadata", NULL};
Expand All @@ -2620,7 +2620,7 @@ MutationTable_add_row(MutationTable *self, PyObject *args, PyObject *kwds)
if (MutationTable_check_state(self) != 0) {
goto out;
}
if (py_metadata != NULL) {
if (py_metadata != Py_None) {
if (PyBytes_AsStringAndSize(py_metadata, &metadata, &metadata_length) < 0) {
goto out;
}
Expand Down
1 change: 0 additions & 1 deletion docs/_build/.README

This file was deleted.

210 changes: 201 additions & 9 deletions docs/api.rst
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
.. _sec-api:
.. _sec_api:

=================
API Documentation
=================

This is the API documentation for ``msprime``, and provides detailed information
on the Python programming interface. See the :ref:`sec-tutorial` for an
on the Python programming interface. See the :ref:`sec_tutorial` for an
introduction to using this API to run simulations and analyse the results.

****************
Expand Down Expand Up @@ -181,11 +181,11 @@ Loading data

There are several methods for loading data into the msprime API. The simplest
and most convenient is the use the :func:`msprime.load` function to load
a :ref:`HDF ancestry file <sec-hdf5-file-format>`. For small scale data
a :ref:`HDF ancestry file <sec_hdf5_file_format>`. For small scale data
and debugging, it is often convenient to use the :func:`msprime.load_text`
to read data in the :ref:`text file format <sec-text-file-format>`.
to read data in the :ref:`text file format <sec_text_file_format>`.
The :func:`msprime.load_tables` function efficiently loads large volumes
of data using the :ref:`Tables API <sec-tables-api>`.
of data using the :ref:`Tables API <sec_tables_api>`.


.. autofunction:: msprime.load
Expand Down Expand Up @@ -219,14 +219,14 @@ population genetics statistics from a given :class:`.TreeSequence`.
:members:


.. _sec-tables-api:
.. _sec_tables_api:

***********
Tables API
***********

The :ref:`tables API <sec-binary-interchange>` provides an efficient way of working
with and interchanging :ref:`tree sequence data <sec-data-model>`. Each table
The :ref:`tables API <sec_binary_interchange>` provides an efficient way of working
with and interchanging :ref:`tree sequence data <sec_data_model>`. Each table
class (e.g, :class:`.NodeTable`, :class:`.EdgeTable`) has a specific set of
columns with fixed types, and a set of methods for setting and getting the data
in these columns. The number of rows in the table ``t`` is given by ``len(t)``.
Expand Down Expand Up @@ -271,7 +271,7 @@ computations using the :mod:`multiprocessing` module). ::
1 1.00000000 2.00000000 9 11

However, pickling will not be as efficient as storing tables
in the native :ref:`HDF5 format <sec-hdf5-file-format>`.
in the native :ref:`HDF5 format <sec_hdf5_file_format>`.

Tables support the equality operator ``==`` based on the data
held in the columns::
Expand All @@ -290,6 +290,186 @@ held in the columns::
>>> t == t2
False



.. _sec_tables_api_text_columns:

++++++++++++
Text columns
++++++++++++

As described in the :ref:`sec_encoding_ragged_columns`, working with
variable length columns is somewhat more involved. Columns
encoding text data store the **encoded bytes** of the flattened
strings, and the offsets into this column in two separate
arrays.

Consider the following example::

>>> t = msprime.SiteTable()
>>> t.add_row(0, "A")
>>> t.add_row(1, "BB")
>>> t.add_row(2, "")
>>> t.add_row(3, "CCC")
>>> print(t)
id position ancestral_state metadata
0 0.00000000 A
1 1.00000000 BB
2 2.00000000
3 3.00000000 CCC
>>> t[0]
SiteTableRow(position=0.0, ancestral_state='A', metadata=b'')
>>> t[1]
SiteTableRow(position=1.0, ancestral_state='BB', metadata=b'')
>>> t[2]
SiteTableRow(position=2.0, ancestral_state='', metadata=b'')
>>> t[3]
SiteTableRow(position=3.0, ancestral_state='CCC', metadata=b'')

Here we create a :class:`.SiteTable` and add four rows, each with a different
``ancestral_state``. We can then access this information from each
row in a straightforward manner. Working with the data in the columns
is a little trickier, however::

>>> t.ancestral_state
array([65, 66, 66, 67, 67, 67], dtype=int8)
>>> t.ancestral_state_offset
array([0, 1, 3, 3, 6], dtype=uint32)
>>> msprime.unpack_strings(t.ancestral_state, t.ancestral_state_offset)
['A', 'BB', '', 'CCC']

Here, the ``ancestral_state`` array is the UTF8 encoded bytes of the flattened
strings, and the ``ancestral_state_offset`` is the offset into this array
for each row. The :func:`.unpack_strings` function, however, is a convient
way to recover the original strings from this encoding. We can also use the
:func:`.pack_strings` to insert data using this approach::

>>> a, off = msprime.pack_strings(["0", "12", ""])
>>> t.set_columns(position=[0, 1, 2], ancestral_state=a, ancestral_state_offset=off)
>>> print(t)
id position ancestral_state metadata
0 0.00000000 0
1 1.00000000 12
2 2.00000000

When inserting many rows with standard infinite sites mutations (i.e.,
ancestral state is "0"), it is more efficient to construct the
numpy arrays directly than to create a list of strings and use
:func:`.pack_strings`. When doing this, it is important to note that
it is the **encoded** byte values that are stored; by default, we
use UTF8 (which corresponds to ASCII for simple printable characters).::

>>> t_s = msprime.SiteTable()
>>> m = 10
>>> a = ord("0") + np.zeros(m, dtype=np.int8)
>>> off = np.arange(m + 1, dtype=np.uint32)
>>> t_s.set_columns(position=np.arange(m), ancestral_state=a, ancestral_state_offset=off)
>>> print(t_s)
id position ancestral_state metadata
0 0.00000000 0
1 1.00000000 0
2 2.00000000 0
3 3.00000000 0
4 4.00000000 0
5 5.00000000 0
6 6.00000000 0
7 7.00000000 0
8 8.00000000 0
9 9.00000000 0
>>> t_s.ancestral_state
array([48, 48, 48, 48, 48, 48, 48, 48, 48, 48], dtype=int8)
>>> t_s.ancestral_state_offset
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10], dtype=uint32)

Here we create 10 sites at regular positions, each with ancestral state equal to
"0". Note that we use ``ord("0")`` to get the ASCII code for "0" (48), and create
10 copies of this by adding it to an array of zeros.

Mutations can be handled similarly::

>>> t_m = msprime.MutationTable()
>>> site = np.arange(m, dtype=np.int32)
>>> d = ord("1") + np.zeros(m, dtype=np.int8)
>>> off = np.arange(m + 1, dtype=np.uint32)
>>> node = np.zeros(m, dtype=np.int32)
>>> t_m.set_columns(site=site, node=node, derived_state=d, derived_state_offset=off)
>>> print(t_m)
id site node derived_state parent metadata
0 0 0 1 -1
1 1 0 1 -1
2 2 0 1 -1
3 3 0 1 -1
4 4 0 1 -1
5 5 0 1 -1
6 6 0 1 -1
7 7 0 1 -1
8 8 0 1 -1
9 9 0 1 -1
>>>


.. _sec_tables_api_binary_columns:

++++++++++++++
Binary columns
++++++++++++++

Columns storing binary data take the same approach as
:ref:`sec_tables_api_text_columns` to encoding
:ref:`variable length data <sec_encoding_ragged_columns>`.
The difference between the two is
only raw :class:`bytes` values are accepted: no character encoding or
decoding is done on the data. Consider the following example::


>>> t = msprime.NodeTable()
>>> t.add_row(metadata=b"raw bytes")
>>> t.add_row(metadata=pickle.dumps({"x": 1.1}))
>>> t[0].metadata
b'raw bytes'
>>> t[1].metadata
b'\x80\x03}q\x00X\x01\x00\x00\x00xq\x01G?\xf1\x99\x99\x99\x99\x99\x9as.'
>>> pickle.loads(t[1].metadata)
{'x': 1.1}
>>> print(t)
id flags population time metadata
0 0 -1 0.00000000000000 cmF3IGJ5dGVz
1 0 -1 0.00000000000000 gAN9cQBYAQAAAHhxAUc/8ZmZmZmZmnMu
>>> t.metadata
array([ 114, 97, 119, 32, 98, 121, 116, 101, 115, -128, 3,
125, 113, 0, 88, 1, 0, 0, 0, 120, 113, 1,
71, 63, -15, -103, -103, -103, -103, -103, -102, 115, 46], dtype=int8)
>>> t.metadata_offset
array([ 0, 9, 33], dtype=uint32)


Here we add two rows to a :class:`.NodeTable`, with different
:ref:`metadata <sec_metadata_definition>`. The first row contains a simple
byte string, and the second contains a Python dictionary serialised using
:mod:`pickle`. We then show several different (and seemingly incompatible!)
different views on the same data.

When we access the data in a row (e.g., ``t[0].metadata``) we are returned
a Python bytes object containing precisely the bytes that were inserted.
The pickled dictionary is encoded in 24 bytes containing unprintable
characters, and when we unpickle it using :func:`pickle.loads`, we obtain
the original dictionary.

When we print the table, however, we see some data which is seemingly
unrelated to the original contents. This is because the binary data is
`base64 encoded <https://en.wikipedia.org/wiki/Base64>`_ to ensure
that it is print-safe (and doesn't break your terminal). (See the
:ref:`sec_metadata_definition` section for more information on the
use of base64 encoding.).

Finally, when we print the ``metadata`` column, we see the raw byte values
encoded as signed integers. As for :ref:`sec_tables_api_text_columns`,
the ``metadata_offset`` column encodes the offsets into this array. So, we
see that the metadata value is 9 bytes long and the second is 24.

The :func:`pack_bytes` and :func:`unpack_bytes` functions are also useful
for encoding data in these columns.

+++++++++++++
Table classes
+++++++++++++
Expand All @@ -298,12 +478,16 @@ Table classes
:members:

.. autoclass:: msprime.EdgeTable
:members:

.. autoclass:: msprime.MigrationTable
:members:

.. autoclass:: msprime.SiteTable
:members:

.. autoclass:: msprime.MutationTable
:members:

.. autoclass:: msprime.ProvenanceTable

Expand All @@ -322,3 +506,11 @@ Table functions
.. autofunction:: msprime.parse_sites

.. autofunction:: msprime.parse_mutations

.. autofunction:: msprime.pack_strings

.. autofunction:: msprime.unpack_strings

.. autofunction:: msprime.pack_bytes

.. autofunction:: msprime.unpack_bytes
20 changes: 10 additions & 10 deletions docs/cli.rst
Original file line number Diff line number Diff line change
@@ -1,28 +1,28 @@
.. _sec-cli:
.. _sec_cli:

======================
Command line interface
======================

Two command-line applications are provided with ``msprime``: :ref:`sec-msp` and
:ref:`sec-mspms`. The :command:`msp` program is an experimental interface for
Two command_line applications are provided with ``msprime``: :ref:`sec_msp` and
:ref:`sec_mspms`. The :command:`msp` program is an experimental interface for
interacting with the library, and is a POSIX compliant command line
interface. The :command:`mspms` program is a fully-:command:`ms` compatible
interface. This is useful for those who wish to get started quickly with using
the library, and also as a means of plugging ``msprime`` into existing work
flows. However, there is a substantial overhead involved in translating data
from ``msprime``'s native history file into legacy formats, and so new code
should use the :ref:`Python API <sec-api>` where possible.
should use the :ref:`Python API <sec_api>` where possible.

.. _sec-msp:
.. _sec_msp:

***
msp
***

The ``msp`` program provides a convenient interface to the :ref:`msprime API
<sec-api>`. It is based on subcommands that either generate or consume a
:ref:`history file <sec-hdf5-file-format>`. The ``simulate`` subcommand runs a
<sec_api>`. It is based on subcommands that either generate or consume a
:ref:`history file <sec_hdf5_file_format>`. The ``simulate`` subcommand runs a
simulation storing the results in a file. The other commands are concerned with
converting this file into other formats.

Expand Down Expand Up @@ -54,7 +54,7 @@ to the file provided as an argument.



.. _sec-msp-upgrade:
.. _sec_msp_upgrade:

+++++++++++
msp upgrade
Expand Down Expand Up @@ -105,7 +105,7 @@ sequence in newick format.
.. todo::
Document the nodes, edges, sites and mutations commands.

.. _sec-mspms:
.. _sec_mspms:

*****
mspms
Expand All @@ -116,7 +116,7 @@ command line interface to the ``msprime`` library. This interface should
be useful for legacy applications, where it can be used as a drop-in
replacement for :command:`ms`. This interface is not recommended for new applications,
particularly if the simulated trees are required as part of the output
as Newick is very inefficient. The :ref:`Python API <sec-api>` is the recommended interface,
as Newick is very inefficient. The :ref:`Python API <sec_api>` is the recommended interface,
providing direct access to the structures used within ``msprime``.


Expand Down
Loading

0 comments on commit cffec4f

Please sign in to comment.