Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflows Base Module #229

Merged
merged 79 commits into from
Apr 4, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
79 commits
Select commit Hold shift + click to select a range
ad59354
initialize workflows base PR
cadeduckworth Jan 15, 2023
792f6f1
initial transfer of workflows base module files and testing data from…
cadeduckworth Jan 15, 2023
093f98a
initialize workflows registry module
cadeduckworth Jan 15, 2023
2416a44
minor updates and reminders for when PR217 is merged
cadeduckworth Feb 7, 2023
7f5075c
add base functionality with workflows registry module, core functions…
cadeduckworth Feb 7, 2023
bd5607c
remove old testing data
cadeduckworth Feb 14, 2023
f9885f3
pre-merge prep review and changes
cadeduckworth Feb 14, 2023
9ba2a5d
Merge branch 'develop' into workflows-base
cadeduckworth Feb 14, 2023
30ba2c0
docs
cadeduckworth Feb 14, 2023
9a58cef
add option to provide a directory for csv file to be saved from outpu…
cadeduckworth Feb 14, 2023
7e0765d
docs
cadeduckworth Feb 14, 2023
28c72ae
fix import names
cadeduckworth Feb 14, 2023
e944833
add new testing data, .csv for workflows base module
cadeduckworth Feb 14, 2023
f6dbf55
initialize workflows tests
cadeduckworth Feb 14, 2023
c8c3a5b
update existing tests, issue with temp dirs and files still exists
cadeduckworth Feb 14, 2023
d2100f6
change file paths and naming conventions
cadeduckworth Feb 17, 2023
98e99a7
Update test_workflows_base.py
cadeduckworth Feb 17, 2023
3baa028
change base test assert value
cadeduckworth Feb 17, 2023
f5a6be7
assert df value
cadeduckworth Feb 17, 2023
698df96
tests, directory path dataframes
cadeduckworth Feb 18, 2023
e7c0b28
directory_paths csv input test, unsure if additional assertion needed
cadeduckworth Feb 21, 2023
bac0066
test errors, exceptions, logging, workflows base module
cadeduckworth Feb 21, 2023
554f31d
add workflows to STATES dictioonary
cadeduckworth Mar 4, 2023
41ccfb6
change and update testing resource paths and add fixtures for tests
cadeduckworth Mar 4, 2023
bbc3477
fix double and single quotes and string formatting in workflows base …
cadeduckworth Mar 21, 2023
68ebd60
edit docs
cadeduckworth Mar 21, 2023
489f397
cleanup tests
cadeduckworth Mar 21, 2023
03c64ba
Merge branch 'develop' into workflows-base
orbeckst Mar 21, 2023
aff964c
add documentation for workflows registry
cadeduckworth Mar 25, 2023
650d934
Merge branch 'workflows-base' of github.com:Becksteinlab/MDPOW into w…
cadeduckworth Mar 25, 2023
f6999e4
registry docs
cadeduckworth Mar 25, 2023
ac03192
docs, and new entry in CHANGES
cadeduckworth Mar 25, 2023
9c8334b
doc changes for registry
cadeduckworth Mar 28, 2023
eb75456
registry docs
cadeduckworth Mar 28, 2023
2fc3241
registry docs
cadeduckworth Mar 28, 2023
32316c3
registry docs
cadeduckworth Mar 28, 2023
cba5513
registry docs
cadeduckworth Mar 28, 2023
0e91d02
docs
cadeduckworth Mar 28, 2023
acdedc4
docs
cadeduckworth Mar 28, 2023
56f9fd0
docs
cadeduckworth Mar 28, 2023
7712e2c
docs
cadeduckworth Mar 28, 2023
f2a73c6
docs
cadeduckworth Mar 28, 2023
fa6bf25
docs
cadeduckworth Mar 28, 2023
e173888
docs
cadeduckworth Mar 28, 2023
d1746fd
docs
cadeduckworth Mar 28, 2023
0f2c50c
docs
cadeduckworth Mar 28, 2023
827083d
docs
cadeduckworth Mar 28, 2023
1317446
docs
cadeduckworth Mar 28, 2023
f26dfb4
docs
cadeduckworth Mar 28, 2023
19d4f97
docs
cadeduckworth Mar 28, 2023
e5af917
docs
cadeduckworth Mar 28, 2023
9f65ba1
docs and naming conventions
cadeduckworth Mar 28, 2023
63532d7
tests and docs, naming conventions
cadeduckworth Mar 28, 2023
936b2de
remove deprecated test
cadeduckworth Mar 28, 2023
94c0320
Merge branch 'develop' into workflows-base
orbeckst Mar 28, 2023
e1a2ea8
docs and formatting
cadeduckworth Mar 31, 2023
e232302
reduce and reorganize try/except method to remove ambiguity and incre…
cadeduckworth Mar 31, 2023
63e084e
docs
cadeduckworth Mar 31, 2023
98d3770
Merge branch 'workflows-base' of github.com:Becksteinlab/MDPOW into w…
cadeduckworth Mar 31, 2023
fc0e580
docs
cadeduckworth Mar 31, 2023
c3f3ab6
reST workflows registry table for docs
cadeduckworth Mar 31, 2023
26e02a0
registry doc table
cadeduckworth Mar 31, 2023
41972e2
registry doc table
cadeduckworth Mar 31, 2023
f2d1bc4
registry doc table
cadeduckworth Mar 31, 2023
3f085d1
registry docs
cadeduckworth Mar 31, 2023
7b49ba8
registry docs
cadeduckworth Mar 31, 2023
3e68f44
registry docs
cadeduckworth Mar 31, 2023
f978ba9
registry docs
cadeduckworth Mar 31, 2023
78223c5
registry docs
cadeduckworth Mar 31, 2023
4434454
registry docs
cadeduckworth Mar 31, 2023
d831df4
registry docs
cadeduckworth Mar 31, 2023
a231376
registry docs
cadeduckworth Apr 1, 2023
5ddff82
registry docs
cadeduckworth Apr 1, 2023
bad6f85
registry docs
cadeduckworth Apr 1, 2023
e1b06aa
registry docs
cadeduckworth Apr 1, 2023
679476b
registry docs
cadeduckworth Apr 1, 2023
5625720
registry docs
cadeduckworth Apr 1, 2023
e337d29
registry docs
cadeduckworth Apr 1, 2023
099d6b5
Apply suggestions from code review
orbeckst Apr 4, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 7 additions & 4 deletions CHANGES
Original file line number Diff line number Diff line change
Expand Up @@ -23,10 +23,13 @@ Changes

Enhancements

* new workflows module (PR #217)
* new workflows registry that contains each EnsembleAnalysis for which
a workflows module exists, for use with workflows base module (#229)
* new workflows base module that provides iterative workflow use for
directories that contain multiple projects (#229)
* new workflows module (#217)
* new automated dihedral analysis workflow (detect dihedrals with SMARTS,
analyze with EnsembleAnalysis, and generate seaborn violinplots)
PR #217)
analyze with EnsembleAnalysis, and generate seaborn violinplots) (#217)

Fixes

Expand All @@ -36,7 +39,7 @@ Fixes
* fix ensemble.EnsembleAnalysis.check_groups_from_common_ensemble (#212)


2021-01-03 0.8.0
2022-01-03 0.8.0
ALescoulie, orbeckst

Changes
Expand Down
2 changes: 2 additions & 0 deletions doc/sphinx/source/workflows.txt
Original file line number Diff line number Diff line change
Expand Up @@ -14,4 +14,6 @@ for use with :class:`~mdpow.analysis.dihedral.DihedralAnalysis`.
.. toctree::
:maxdepth: 1

workflows/base
workflows/registry
workflows/dihedrals
7 changes: 7 additions & 0 deletions doc/sphinx/source/workflows/base.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
==============
Workflows Base
==============

.. versionadded:: 0.9.0

.. automodule:: mdpow.workflows.base
7 changes: 7 additions & 0 deletions doc/sphinx/source/workflows/registry.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
==================
Workflows Registry
==================

.. versionadded:: 0.9.0

.. automodule:: mdpow.workflows.registry
1 change: 1 addition & 0 deletions mdpow/tests/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,5 +13,6 @@
"FEP": RESOURCES.join("states", "FEP"),
"base": RESOURCES.join("states", "base"),
"md_npt": RESOURCES.join("states", "FEP"),
"workflows": RESOURCES.join("states", "workflows"),
}
CONFIGURATIONS = RESOURCES.join("test_configurations")
101 changes: 101 additions & 0 deletions mdpow/tests/test_workflows_base.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
import re
import os
import sys
import yaml
import pybol
import pytest
import pathlib
import logging

import pandas as pd

from . import RESOURCES
from . import STATES

import py.path

from ..workflows import base

from pkg_resources import resource_filename

RESOURCES = pathlib.PurePath(resource_filename(__name__, 'testing_resources'))
MANIFEST = RESOURCES / 'manifest.yml'

@pytest.fixture(scope='function')
def molname_workflows_directory(tmp_path):
m = pybol.Manifest(str(MANIFEST))
m.assemble('workflows', tmp_path)
return tmp_path

class TestWorkflowsBase(object):

@pytest.fixture(scope='function')
def SM_tmp_dir(self, molname_workflows_directory):
dirname = molname_workflows_directory
return dirname

@pytest.fixture(scope='function')
def csv_input_data(self):
csv_path = STATES['workflows'] / 'project_paths.csv'
csv_df = pd.read_csv(csv_path).reset_index(drop=True)
return csv_path, csv_df

@pytest.fixture(scope='function')
def test_df_data(self):
test_dict = {'molecule' : ['SM25', 'SM26'],
'resname' : ['SM25', 'SM26']}
test_df = pd.DataFrame(test_dict).reset_index(drop=True)
return test_df

@pytest.fixture(scope='function')
def project_paths_data(self, SM_tmp_dir):
project_paths = base.project_paths(parent_directory=SM_tmp_dir)
return project_paths

def test_project_paths(self, test_df_data, project_paths_data):
test_df = test_df_data
project_paths = project_paths_data

assert project_paths['molecule'][0] == test_df['molecule'][0]
assert project_paths['molecule'][1] == test_df['molecule'][1]
assert project_paths['resname'][0] == test_df['resname'][0]
assert project_paths['resname'][1] == test_df['resname'][1]

def test_project_paths_csv_input(self, csv_input_data):
csv_path, csv_df = csv_input_data
project_paths = base.project_paths(csv=csv_path)

pd.testing.assert_frame_equal(project_paths, csv_df)

def test_automated_project_analysis(self, project_paths_data, caplog):
project_paths = project_paths_data
# change resname to match topology (every SAMPL7 resname is 'UNK')
# only necessary for this dataset, not necessary for normal use
project_paths['resname'] = 'UNK'

base.automated_project_analysis(project_paths, solvents=('water',),
ensemble_analysis='DihedralAnalysis')

assert 'all analyses completed' in caplog.text, ('automated_dihedral_analysis '
'did not iteratively run to completion for the provided project')

def test_automated_project_analysis_KeyError(self, project_paths_data, caplog):
caplog.clear()
caplog.set_level(logging.ERROR, logger='mdpow.workflows.base')

project_paths = project_paths_data
# change resname to match topology (every SAMPL7 resname is 'UNK')
# only necessary for this dataset, not necessary for normal use
project_paths['resname'] = 'UNK'

# test error output when raised
with pytest.raises(KeyError,
match="Invalid ensemble_analysis 'DarthVaderAnalysis'. "
"An EnsembleAnalysis type that corresponds to an existing "
"automated workflow module must be input as a kwarg. ex: "
"ensemble_analysis='DihedralAnalysis'"):
base.automated_project_analysis(project_paths, ensemble_analysis='DarthVaderAnalysis', solvents=('water',))

# test logger error recording
assert "'DarthVaderAnalysis' is an invalid selection" in caplog.text, ('did not catch incorrect '
'key specification for workflows.registry that results in KeyError')
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
molecule,resname,path
SM25,SM25,mdpow/tests/testing_resources/states/workflows/SM25
SM26,SM26,mdpow/tests/testing_resources/states/workflows/SM26
180 changes: 180 additions & 0 deletions mdpow/workflows/base.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,180 @@
# MDPOW: base.py
# 2022 Cade Duckworth

"""
:mod:`mdpow.workflows.base` --- Automated workflow base functions
=================================================================

cadeduckworth marked this conversation as resolved.
Show resolved Hide resolved
To analyze multiple MDPOW projects, provide :func:`project_paths`
with the top-level directory containing all MDPOW projects' simulation data
to obtain a :class:`pandas.DataFrame` containing the project information
and paths. Then, :func:`automated_project_analysis` takes as input the
aforementioned :class:`pandas.DataFrame` and runs the specified
:class:`~mdpow.analysis.ensemble.EnsembleAnalysis` for all MDPOW projects
under the top-level directory provided to :func:`project_paths`.

.. seealso:: :mod:`~mdpow.workflows.registry`

.. autofunction:: project_paths
.. autofunction:: automated_project_analysis

"""

import os
import re
import pandas as pd

from mdpow.workflows import registry

import logging

logger = logging.getLogger('mdpow.workflows.base')

def project_paths(parent_directory=None, csv=None, csv_save_dir=None):
"""Takes a top directory containing MDPOW projects and determines
the molname, resname, and path, of each MDPOW project within.

Optionally takes a .csv file containing `molname`, `resname`, and
`paths`, in that order.

:keywords:

*parent_directory*
the path for the location of the top directory
under which the subdirectories of MDPOW simulation
data exist, additionally creates a 'project_paths.csv' file
for user manipulation of metadata and for future reference

*csv*
.csv file containing the molecule names, resnames,
and paths, in that order, for the MDPOW simulation
data to be iterated over must contain header of the
form: `molecule,resname,path`

*csv_save_dir*
optionally provided directory to save .csv file, otherwise,
data will be saved in current working directory

:returns:

*project_paths*
:class:`pandas.DataFrame` containing MDPOW project metadata

.. rubric:: Example

Typical Workflow::

project_paths = project_paths(parent_directory='/foo/bar/MDPOW_projects')
automated_project_analysis(project_paths)

or::

project_paths = project_paths(csv='/foo/bar/MDPOW.csv')
automated_project_analysis(project_paths)

"""

if parent_directory is not None:

locations = []

reg_compile = re.compile('FEP')
for dirpath, dirnames, filenames in os.walk(parent_directory):
result = [dirpath.strip() for dirname in dirnames if reg_compile.match(dirname)]
if result:
locations.append(result[0])

resnames = []

for loc in locations:
res_temp = loc.strip().split('/')
resnames.append(res_temp[-1])

project_paths = pd.DataFrame(
{
'molecule': resnames,
'resname': resnames,
'path': locations
}
)
if csv_save_dir is not None:
project_paths.to_csv(f'{csv_save_dir}/project_paths.csv', index=False)
logger.info(f'project_paths saved under {csv_save_dir}')
else:
current_directory = os.getcwd()
project_paths.to_csv('project_paths.csv', index=False)
logger.info(f'project_paths saved under {current_directory}')

elif csv is not None:
locations = pd.read_csv(csv)
project_paths = locations.sort_values(by=['molecule', 'resname', 'path']).reset_index(drop=True)

return project_paths

def automated_project_analysis(project_paths, ensemble_analysis, **kwargs):
"""Takes a :class:`pandas.DataFrame` created by :func:`~mdpow.workflows.base.project_paths`
and iteratively runs the specified :class:`~mdpow.analysis.ensemble.EnsembleAnalysis`
for each of the projects by running the associated automated workflow
in each project directory returned by :func:`~mdpow.workflows.base.project_paths`.

Compatibility with more automated analyses in development.

:keywords:

*project_paths*
:class:`pandas.DataFrame` that provides paths to MDPOW projects

*ensemble_analysis*
name of the :class:`~mdpow.analysis.ensemble.EnsembleAnalysis`
that corresponds to the desired automated workflow module

*kwargs*
keyword arguments for the supported automated workflows,
see the :mod:`~mdpow.workflows.registry` for all available
workflows and their call signatures

.. rubric:: Example

A typical workflow is the automated dihedral analysis from
:mod:`mdpow.workflows.dihedrals`, which applies the *ensemble analysis*
:class:`~mdpow.analysis.dihedral.DihedralAnalysis` to each project.
The :data:`~mdpow.workflows.registry.registry` contains this automated
workflow under the key *"DihedralAnalysis"* and so the automated execution
for all `project_paths` (obtained via :func:`project_paths`) is performed by
passing the specific key to :func:`automated_project_analysis`::

project_paths = project_paths(parent_directory='/foo/bar/MDPOW_projects')
automated_project_analysis(project_paths, ensemble_analysis='DihedralAnalysis', **kwargs)

"""

for row in project_paths.itertuples():
molname = row.molecule
resname = row.resname
dirname = row.path

logger.info(f'starting {molname}')

try:
registry.registry[ensemble_analysis](dirname=dirname, resname=resname, molname=molname, **kwargs)
cadeduckworth marked this conversation as resolved.
Show resolved Hide resolved

logger.info(f'{molname} completed')

except KeyError as err:
msg = (f"Invalid ensemble_analysis {err}. An EnsembleAnalysis type that corresponds "
"to an existing automated workflow module must be input as a kwarg. "
"ex: ensemble_analysis='DihedralAnalysis'")
logger.error(f'{err} is an invalid selection')

raise KeyError(msg)

except TypeError as err:
msg = (f"Invalid ensemble_analysis {ensemble_analysis}. An EnsembleAnalysis type that "
"corresponds to an existing automated workflow module must be input as a kwarg. "
"ex: ensemble_analysis='DihedralAnalysis'")
logger.error(f'workflow module for {ensemble_analysis} does not exist yet')

raise TypeError(msg)

logger.info('all analyses completed')
return
53 changes: 53 additions & 0 deletions mdpow/workflows/registry.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# MDPOW: registry.py
# 2023 Cade Duckworth

"""
:mod:`mdpow.workflows.registry` --- Registry of currently supported automated workflows
=======================================================================================

The :mod:`mdpow.workflows.registry` module hosts a dictionary with keys that correspond to an
:class:`~mdpow.analysis.ensemble.EnsembleAnalysis` for which exists a corresponding automated workflow.

.. table:: Currently supported automated workflows.
:widths: auto
:name: workflows_registry

+-------------------------------+------------------------------------------------------------------------------------------------------+
| key/keyword: EnsembleAnalysis | value: <workflow module>.<top-level automated analysis function> |
+===============================+======================================================================================================+
| DihedralAnalysis | :any:`dihedrals.automated_dihedral_analysis <mdpow.workflows.dihedrals.automated_dihedral_analysis>` |
+-------------------------------+------------------------------------------------------------------------------------------------------+

.. autodata:: registry
cadeduckworth marked this conversation as resolved.
Show resolved Hide resolved

.. seealso:: :mod:`~mdpow.workflows.base`
cadeduckworth marked this conversation as resolved.
Show resolved Hide resolved

"""

# import analysis
from mdpow.workflows import dihedrals

registry = {
cadeduckworth marked this conversation as resolved.
Show resolved Hide resolved
cadeduckworth marked this conversation as resolved.
Show resolved Hide resolved

'DihedralAnalysis' : dihedrals.automated_dihedral_analysis

}

"""
In the `registry`, each entry corresponds to an
:class:`~mdpow.analysis.ensemble.EnsembleAnalysis`
for which exists a corresponding automated workflow.

Intended for use with :mod:`mdpow.workflows.base` to specify which
:class:`~mdpow.analysis.ensemble.EnsembleAnalysis` should run iteratively over
the provided project data directory.

To include a new automated workflow for use with :mod:`mdpow.workflows.base`,
create a key that is the name of the corresponding
:class:`~mdpow.analysis.ensemble.EnsembleAnalysis`, with the value defined as
`<workflow module>.<top-level automated analysis function>`.

The available automated workflows (key-value pairs) are listed in the
following table :any:`Currently supported automated workflows. <workflows_registry>`

"""