Skip to content

Commit

Permalink
initial build workflows in snakemake (#307)
Browse files Browse the repository at this point in the history
* tweaking snakemake workflow for row diff 2.0

* grouping options in workflow build command

* fixing failing test

* adding missing files

* fixing primary graph creation workflow

* tweaking how input sequence files are handled, adding --seqs-dir-path argument

* fixing primary graph creation work flow

* separating out workflow package

* renaming command line tool to metagraph-workflows

* adding --additional-snakemake-args parameter

* improving default handling

* moving some python code out to common.smk

* updating readme, still rough draft

* integrating metagraph-workflows test in CI

* Actions: tweak setup of metagraph binary

* moving directory with snakemake code into metagraph_workflows to simplify packaging

* workflows: improve implementation of memory config management

* adding --disk-swap option in more rules, some refacotring

* adding build.smk which was ignored

* incorporate mem config for every rule

* updating workflow graph

* add missing cfg_utils.py file

* first iteration on supporting workflow for building graphs on a per sample basis separately

* tweaking data staging mechanism and including it in the example workflow. moving example workflow related files into a seperate directory

* cleaning up pypi packaging

* adding some more parametrization of metagraph commands

* using snakmakes log directive systematically

* change lookup for rule configs, as the current one doesn't seem to work reliably (i.e. configs from the wrong rule are looked up

* changing some directory names

* by default, remove intermediary output files during the build phase (file can be kept using the --notemp of snakemake)

* adding disk-cap, mem-cap and swap-dir to more rules. also fixed some build rules to use canonical mode

* adding verbose flag to all build rules

* including KMC, improving resource management

* using buffer instead of cap, e.g renamed mem_cap to mem_buffer

* renaming exec_cmd -> metagraph_cmd

* timing commands using GNU time

* when estimating memory buffer size, cap maximum at 50GB

* improving logging

* supporting samples consisting of several files

* making a parameter out of MAX_BUFFER_SIZE

* making it possible to set number of threads via config for primarize_canonical_graph_single_sample and build_canonical_graph_single_sample

* tweak memory heuristics for primarize_canonical_graph_single_sample

* fixing test in test_resource_management

* fixing unit of disk-cap

* moving all kinds of utility functions to utils.py

* merging common.py and constants.py and renaming it to workflow_configs

* moving 'build' subcommand related stuff to cli.py

* remote rule graph related files

* renaming example_workflow to test_workflow

* updating setup.py

* better error message in get_gnu_time_command

* adding jupyter notebook with end to end example from indexing to quering using the python api

* throwing exception instead of return status code in run_build_workflow

* use sequences from ncbi in workflow_end_to_end_example notebook

* removing template generated Makefile in workflows python package

* fixing _convert_type

* moving content of workflows README to sphinx documentation

* some dangling changes after renaming directory
  • Loading branch information
Marc Zimmermann authored Oct 22, 2021
1 parent 5c9d0ea commit 1cf7a6e
Show file tree
Hide file tree
Showing 37 changed files with 4,437 additions and 10 deletions.
36 changes: 36 additions & 0 deletions .github/workflows/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -215,6 +215,42 @@ jobs:
tags: ${{ steps.meta.outputs.tags }}
labels: ${{ steps.meta.outputs.labels }}


Metagraph-Workflows:
name: Test metagraph workflows
runs-on: ubuntu-20.04
needs: [Linux]

steps:
- uses: actions/checkout@v2

- name: Set up Python 3.8
uses: actions/setup-python@v1
with:
python-version: 3.8

- name: fetch static binary
uses: actions/download-artifact@v2
with:
path: artifacts

- name: setup metagraph binary
run: |
sudo ln -s $(pwd)/artifacts/metagraph_DNA_linux_x86/metagraph_DNA /usr/local/bin/metagraph
sudo chmod +rx /usr/local/bin/metagraph
/usr/local/bin/metagraph --help
metagraph --help
- name: Install python dependencies
run: |
python -m pip install --upgrade pip
pip install pytest
pip install -r metagraph/workflows/requirements.txt
- name: Test metagraph-workflows pytest
run: |
cd metagraph/workflows
pytest
Release:
name: Create Release
if: contains(github.ref, 'tags/v')
Expand Down
1 change: 1 addition & 0 deletions metagraph/api/python/README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -31,3 +31,4 @@ Usage
For more examples, see `notebooks
<./notebooks>`_.

4 changes: 0 additions & 4 deletions metagraph/api/python/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,10 +31,6 @@
'Programming Language :: Python :: 3.6',
],
description="Metagraph Toolkit",
entry_points={
'console_scripts': [
],
},
install_requires=requirements,
license="MIT license",
long_description=readme,
Expand Down
5 changes: 1 addition & 4 deletions metagraph/api/python/tests/test_helpers.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ def _load_json_data(filename):

@pytest.mark.parametrize("file_name,align,expected_shape", [
('search_response.json', False, (4, 15)),
('search_with_align_response.json', True, (354, 18))
('search_with_align_response.json', True, (354, 15))
])
def test_df_from_search_result(file_name, align, expected_shape):
json_obj = _load_json_data(file_name)
Expand All @@ -27,9 +27,6 @@ def test_df_from_search_result(file_name, align, expected_shape):
'metasub_name', 'num_reads', 'sample_type', 'station',
'surface_material', 'seq_description']

if align:
expected_cols = expected_cols + ['sequence', 'score', 'cigar']

assert list(df.columns) == expected_cols


Expand Down
3 changes: 1 addition & 2 deletions metagraph/docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,9 +12,8 @@ framework, a software platform for indexing and analysis of very large sequence

installation.rst
quick_start.rst
workflows.rst
api.rst
sequence_search.rst
sequence_assembly.rst
resources.rst


105 changes: 105 additions & 0 deletions metagraph/docs/source/workflows.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
=========
Workflows
=========

This package provides workflows for the `metagraph framework
<https://metagraph.ethz.ch>`_


Workflows for Creating Graphs and Annotations
---------------------------------------------

Since the creation of graph and indices comprises several steps, this package provides
some support to simplify these tasks - in particular for standard cases.

Given some raw sequence data and a few options like the kmer size (`k`) graphs and annotations
are automatically built:

.. code-block:: bash
metagraph-workflows build -k 5 transcript_paths.txt /tmp/mygraph
If you prefer invoking the workflow from within a python script, the following is equivalent:

.. code-block:: python
from metagraph_workflows import workflows
workflows.run_build_workflow('/tmp/mygraph', seqs_file_list_path='transcript_paths.txt', k=5)
The workflow logic itself is expressed as a `Snakemake workflow
<https://snakemake.readthedocs.io/>`_ . You can also directly invoke the workflows
using the `snakemake` command line tool (see below).


Installation and Set up
~~~~~~~~~~~~~~~~~~~~~~~


Set up a conda environment and install the necessary packages using:

.. code-block:: bash
conda create -n metagraph-workflows python=3.8
conda activate metagraph-workflows
conda install -c bioconda -c conda-forge metagraph
pip install -U "git+https://github.com/ratschlab/metagraph.git#subdirectory=metagraph/workflows"
Usage Example
~~~~~~~~~~~~~

Typically, the following steps would be performed:

1. sequence file preparation: add your sequence files of interest into a directory.
2. running workflow: you can invoke the workflow using ``metagraph-workflows build``. Important parameters you may consider tuning are:

* k
* primary vs non primary graph creation
* annotation label source: ``sequence_headers`` or ``sequence_file_names``

An example invocation:

.. code-block:: bash
metagraph-workflows build -k 31 \
--seqs-dir-path [PATH_TO_SEQUENCES] \
--annotation-labels-source sequence_headers \
--build-primary-graph
[OUTPUT_DIR]
see ``metagraph-workflows build --help`` for more help
3. do queries: once you created the indices you can query either by using the command line
query tool or starting the metagraph server on your laptop or another suitable machine and access
do queries using e.g. the python :ref:`API` client.


There is also a `jupyter notebook <https://github.com/ratschlab/metagraph/blob/master/metagraph/workflows/notebooks/workflow_end_to_end_example.ipynb>`_ walking you through an example from indexing to api querying.



Workflow Management
~~~~~~~~~~~~~~~~~~~

The following snakemake options are exposed in the ``build`` subcommand

* ``--dryrun``: see what workflow steps would be done
* ``--force`` (corresponds to ``--forceall`` in snakemake): force run all steps


Directly Invoking Snakemake Workflow
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The above command is only a wrapper around a snakemake workflow. You can also
directly invoke the snakemake workflow (assuming you checked out the `metagraph git repository <https://github.com/ratschlab/metagraph>`_):

.. code-block:: bash
cd metagraph/workflows
snakemake --forceall --configfile default.yml \
--config k=5 seqs_file_list_path='transcript_paths.txt' output_directory=/tmp/mygraph \
annotation_labels_source=sequence_headers --cores 2
21 changes: 21 additions & 0 deletions metagraph/workflows/.editorconfig
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# http://editorconfig.org

root = true

[*]
indent_style = space
indent_size = 4
trim_trailing_whitespace = true
insert_final_newline = true
charset = utf-8
end_of_line = lf

[*.bat]
indent_style = tab
end_of_line = crlf

[LICENSE]
insert_final_newline = false

[Makefile]
indent_style = tab
108 changes: 108 additions & 0 deletions metagraph/workflows/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
.snakemake
metagraph_workflows/snakemake/output_dir_example

# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
env/
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg

# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
.hypothesis/
.pytest_cache/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
target/

# Jupyter Notebook
.ipynb_checkpoints

# pyenv
.python-version

# celery beat schedule file
celerybeat-schedule

# SageMath parsed files
*.sage.py

# dotenv
.env

# virtualenv
.venv
venv/
ENV/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/

# Pycharm
.idea
24 changes: 24 additions & 0 deletions metagraph/workflows/LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@


MIT License

Copyright (c) 2021, ETH Zurich, Biomedical Informatics Group; Marc Zimmermann

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

12 changes: 12 additions & 0 deletions metagraph/workflows/MANIFEST.in
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
include LICENSE
include requirements.txt

recursive-include tests *
recursive-exclude * __pycache__
recursive-exclude * *.py[co]

recursive-include docs *.rst conf.py Makefile make.bat *.jpg *.png *.gif

recursive-include metagraph_workflows/snakemake *.smk Snakefile default.yml
recursive-include metagraph_workflows/snakemake/test_data *.fa
recursive-exclude **/.snakemake *
8 changes: 8 additions & 0 deletions metagraph/workflows/README.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
===================
metagraph_workflows
===================

This package provides workflows for the `metagraph framework
<https://metagraph.ethz.ch>`_

See the `corresponding section <https://metagraph.ethz.ch/static/docs/workflows.html>`_ in the metagraph documentation.
7 changes: 7 additions & 0 deletions metagraph/workflows/metagraph_workflows/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# -*- coding: utf-8 -*-

"""Top-level package for metagraph_workflows."""

__author__ = """Marc Zimmermann"""
__email__ = '[email protected]'
__version__ = '0.1.0'
Loading

0 comments on commit 1cf7a6e

Please sign in to comment.