Clarify support for (re)sampling from a pool of starting conformations. #22

eirrgang · 2022-06-14T17:20:26Z

eirrgang
Jun 14, 2022
Maintainer

We have been using a single starting conformation for entire ensembles of BRER sampling, and relying on the convergence phase to get us close to the target conformation for each production run.

However, the published method describes sampling starting conformations from a pool of conformations. We should clarify the degree to which this is or isn't supported in the current code, and consider more complete support before a 2.0 release.

Can a list of initial conformations be provided for run_brer to sample from?
Can/should/does run_brer resample the starting conformation for successive "iterations"?
Should/do final conformations from iterations update the pool from which starting conformations are drawn?
If successive iterations use different starting conformations than the previous, how is this reconciled across the ensemble?

Related to kassonlab/run_brer#18 : We should consider that the pool of conformations from which to sample is not necessarily coupled to the array of simulations for any one job submission. We should consider how, if at all, run_brer should track/manage persistent workflow data over multiple jobs.

eirrgang · 2022-06-14T17:33:40Z

eirrgang
Jun 14, 2022
Maintainer Author

According to run_config.py

A list of TPRs corresponding to the ensemble width as defined by the MPI communicator width is mapped to the array of simulators.
After the first iteration, the training phase uses the final conformation from the production phase of the previous iteration on the same rank / ensemble member. There is no clean way for the user to intervene, such as to manually resample and provide new inputs.

The targets are resampled, but this doesn't sound quite like the bag of conformations described in the paper.

0 replies

carolinedav · 2022-07-07T16:16:48Z

carolinedav
Jul 7, 2022
Maintainer

Discussion Notes:

The subscripted conformational estimate for 1 iteration has # elements consistent with ensemble and the implied subscripted {X̂} set is the union of all iterations
Math constraint is that for whatever conformational ensemble is defined, the math should be consistent with yielding a deer-defined conformation
Negating any hysteresis, the ensemble should still converge on the experimental distribution
It seems reasonable to discard iteration 0 because it is not stochastically sampled in any way
For the iterative aspect, it doesn't make sense to do sampling any way other than sampling with replacement
The user can provide whatever set of initial conditions, but that the user must sample themselves because run_brer doesn't randomly sample and if they use the iterative feature in future iterations, Figure 1 sampling is incorrect
We should add support to do the initial sampling with replacement as described in the paper
This process implies data management that run_brer hasn't tried to do (i.e. keeping a catalogue of structures)
We might as well handle the rest of the accounting as well as random sampling at any point

What we could do:

We could say that {X̂}0 is a bootstrapping iteration (Figure 1;step 1 is really sampling n distances from the distribution, iteration 0 begins at Figure 1;step 2 and we need to bootstrap iteration 0 with conformations)
Be comfortable with:
a. letting run_brer collect conformational ensemble estimate that is a union of all iterations' conformational output
b. not including the initial conformation ({X̂}0) as a part of that union

Clean Process:

When run _brer starts, the conformational ensemble estimate is empty (i.e. if no iterations have run) so nothing to sample
run_brer provides the option for how to bootstrap an iteration before any ensemble estimate
Therefore, it does make sense that we're actually looking at two similar but distinct coding blocks that help the user provide a randomized set of conditions
User might have arbitrary set of states to seed the first iteration, which doesn't need overlap with the notion that we're sampling with replacement in additional ith iterations
Note: Starting states are only necessary when no simulations have run

For user friendliness: "You might want to provide one conformation, # conformations equal to ensemble, or any arbitrary # of conformations. You might want to explicitly map to ensemble members or you might want randomly assign ensembles."
We can provide utilities for these steps, will not coincide with further ith iteration random sampling

We provide:

Helper functions to randomly sample a set of starting conformations
Tracking ability for conformational ensemble structures
Use conformational ensemble set to randomly sample for ith iterations (decoupled from 1,2)

Developer Notes:

Easiest to code: User-friendly updated front-end for setting up initial conditions
Hardest: Book-keeping for full conformational ensembles. This will warrant another json file at a higher level than the state.json file, written to as simulations finish, and remains hierarchical to be recorded on disk as complete sets for each iteration (or done programmatically (code written to scan the file system))
Next step: collecting minimal functions to scan the file system to create book-keeping on conformational ensembles and build the resampling functions and the functions to recognize when the set is empty and how to bootstrap

0 replies

carolinedav · 2022-07-21T14:44:58Z

carolinedav
Jul 21, 2022
Maintainer

Discussion Notes:

Priorities:

Do not break state.json file. If we add more info, we must add case for info that is missing.
Write utility front-end functions for minimal use. Should produce some object but debatable.

How to:

Start with some utility functions

Minimum requirements:

Need a function to collect all results from filesystem scan
Need a function to distinguish all iterations for iter i
Configurable errors to handle incomplete iterations, but allow, say, allow_incomplete=True in order to make it possible to get all results currently available
Then, for the resampling part, we need to be able to get the results up through iteration i-1 and choose an input randomly.

We need to recognize when the number of available samples is zero and produce an error
We need to catch the error (or pre-empt it) and use a user-provided bootstrap scheme (for starters, just use existing behavior of the user-provided array of TPRs). Q: Would it be appropriate to just check whether we are on the first iteration? Or do we want to be decoupled from the iteration count, and just rely on our accounting system, in case we want to trust the statistics of any existing results? It is tempting to just check the iteration number, but let's keep that decoupled for now, and instead rely on the accounting machinery for tracking whether or not we can identify an existing pool of states in our estimated conformational ensemble. (Yay for back-up state files!)
Helpful to use a well-defined class, however simple, rather than just a dict, to hold the results of these helper functions, in case we want to easily write out (cache) results, or modify/update the results in memory (life while running a job).

To do:

Write a stand-alone function with minimum input (input: starting directory) that can scan the directory function for state.json files.
Write another stand-alone function identify the outputs for iter i.
A third function would identify specific results for analysis.
Create a test for utilities.
Update current tests for previous_iter_state.json.

0 replies

eirrgang · 2022-07-24T12:32:03Z

eirrgang
Jul 24, 2022
Maintainer Author

Note that there seems to be some (possibly incomplete) prior work on this. I just noticed MultiPair and MultiMetaData, which appear to be unused. We should take a look at those and either build on them or remove them. Maybe some investigation will remind us of some important considerations or reveal some gotchas.

0 replies

carolinedav · 2022-07-25T19:00:57Z

carolinedav
Jul 25, 2022
Maintainer

Looks to me that MultiMetaData in metadata.py is used with multiple restraints and not necessarily multiple conformational estimates?

MultiPair in pair_data.py looks to impact the resampling of targets once an iteration is completed?

Not sure if these classes are aiming to handle what we're interested in re: multiple initial conformations, I think we should still make a utility.py script for bootstrapping.

Although perhaps we can alter those classes enough to impact any initial bootstrapping...

0 replies

eirrgang · 2022-07-26T11:06:51Z

eirrgang
Jul 26, 2022
Maintainer Author

Looks to me that MultiMetaData in metadata.py is used with multiple restraints and not necessarily multiple conformational estimates?

I think the idea was that these would be the tools for managing the data associated with multiple ensemble members in an ensemble-aware fashion. We should try to determine how complete the implementation is, and see whether there are any design lessons to be learned from what is there. Beyond that, there's no priority on re-using or extending dead code, but we should try to tidy up a bit.

MultiPair in pair_data.py looks to impact the resampling of targets once an iteration is completed?

It appears that was at least the original intention. I'll look more closely in the next few days.

Not sure if these classes are aiming to handle what we're interested in re: multiple initial conformations, I think we should still make a utility.py script for bootstrapping.

Certainly. You on it? ;-)

Although perhaps we can alter those classes enough to impact any initial bootstrapping...

I don't currently have an opinion on what would be easier or cleaner. I.e. no preference on recycling bin or compost pile.

1 reply

eirrgang Oct 31, 2022
Maintainer Author

Note that MultiMetaData and MultiPair have gone away, now. The pair_data module has free functions for (re)sampling from a PairData or PairDataCollection.

carolinedav · 2022-07-26T15:17:51Z

carolinedav
Jul 26, 2022
Maintainer

Notes from initial pull request:

run_brer appears to have more sophistication for path management than os.path in many cases. I'm unsure of the correct usage of these functions like _Path = Union so clean-up should be discussed at the next meeting.
This is a rough outline of what we may need. Discussion for this week should include clarifications of what data is "necessary" and where we should cp conformational ensemble results.
Again, run_brer has its own functions and scripts to handle some aspects of what I'm trying to do (like finding a list of mem_* directories. We'll need to go through and look over the functions to ensure we're not overcomplicating any efforts.

0 replies

carolinedav · 2022-07-28T11:46:30Z

carolinedav
Jul 28, 2022
Maintainer

Discussion Notes:

utilities.py

_Path = Union[str, pathlib.Path] is a type hint.
a path may be a str or pathlib.Path, use hyphens to catch assumption cases
utilize absolute paths, not os.chdir()
most cases, we want a couple lightweight functions that give a list of files
check prev_state.json files for previous iterations, get files from completed samples
utilities.py may only care about previous iter state.json files
separate task for bootstrapping: consider state.json existence
track tpr by relative path not absolute
we need to know tpr file that went into a state and we need to know if there was a checkpoint file used, what it is
add one new field to json: simulation input and that new field is a dict that holds tpr, checkpoint file, and schema=version 2 (or whatever it is) (paths are relative to the path of the state.json file
earlier in the run pathway, create at the initial start of run after the call to set_work.dir and before def training, update state.json during training as normal
add new field in run_data class, use in run_config

0 replies

eirrgang · 2022-07-28T12:55:35Z

eirrgang
Jul 28, 2022
Maintainer Author

Tasks

A: Update RunData (#23 ): a new field

"simulation_input": {
    "schema_version": 2,
    "tpr_file": <path relative to this state file. (there's a Python function for that.) convert to 'str' explicitly to avoid json serialization errors.>
    "checkpoint": <None, or relative path to checkpoint>
}

We need to update the various member functions of RunData, and its documentation, to reflect this.

B: Establish state at the level of run (#24). Move the iteration-specific logic from RunConfig.__train to RunConfig.run (or a new utility method, called by run), and create the state.json file before we start dispatching to the task-specific functions for phase.

C: Write some utility functions (#25) to collect the paths of prev_state files for completed iterations.
C.A: Gather trajectory outputs from completed iterations from which to construct the conformational ensemble.
C.B: Construct a new simulation input from an arbitrary element of the prev_state collection. Determine the working directory path and return a new SimulationInput with the checkpoint file from the previous production work dir. Requires completion of Task A

D: Decide whether to randomly select an input from the ensemble or to use user-provided bootstrapping information, and update RunConfig.run(). Requires all of the above.

E: (Optional) Provide utility functions for users to provide bootstrapping inputs more flexibly than an array of one TPR-file-per-member.

Partial solution A:

class SimulationInput(typing.TypedDict):
	schema_version: typing.ClassVar[int]
	tpr_file: str
	checkpoint: typing.Optional[str]

def simulation_input(tpr_file: str, checkpoint: typing.Optional[str] = None) -> SimulationInput:
	return {'schema_version': 2, 'tpr_file': tpr_file, 'checkpoint': checkpoint}

Alternative:

@dataclasses.dataclass
class SimulationInput:
	schema_version: int = dataclasses.field(default=2, init=False)
	tpr_file: str
	checkpoint: typing.Optional[str] = None
# and use dataclasses.asdict() in RunData.as_dictionary()

0 replies

eirrgang · 2022-10-31T13:46:12Z

eirrgang
Oct 31, 2022
Maintainer Author

update: we will have one simulation input record for each phase (as available) in the state file for a given iteration + ensemble member.

0 replies

eirrgang · 2022-10-31T14:02:01Z

eirrgang
Oct 31, 2022
Maintainer Author

I'm having a hard time following this issue in its current form. I'm going to move it to a Discussion, and extract smaller issues from there.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify support for (re)sampling from a pool of starting conformations. #22

{{title}}

Replies: 11 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Clarify support for (re)sampling from a pool of starting conformations. #22

eirrgang Jun 14, 2022 Maintainer

Replies: 11 comments · 1 reply

eirrgang Jun 14, 2022 Maintainer Author

carolinedav Jul 7, 2022 Maintainer

Discussion Notes:

carolinedav Jul 21, 2022 Maintainer

eirrgang Jul 24, 2022 Maintainer Author

carolinedav Jul 25, 2022 Maintainer

eirrgang Jul 26, 2022 Maintainer Author

eirrgang Oct 31, 2022 Maintainer Author

carolinedav Jul 26, 2022 Maintainer

carolinedav Jul 28, 2022 Maintainer

Discussion Notes:

utilities.py

eirrgang Jul 28, 2022 Maintainer Author

Tasks

eirrgang Oct 31, 2022 Maintainer Author

eirrgang Oct 31, 2022 Maintainer Author

eirrgang
Jun 14, 2022
Maintainer

Replies: 11 comments 1 reply

eirrgang
Jun 14, 2022
Maintainer Author

carolinedav
Jul 7, 2022
Maintainer

carolinedav
Jul 21, 2022
Maintainer

eirrgang
Jul 24, 2022
Maintainer Author

carolinedav
Jul 25, 2022
Maintainer

eirrgang
Jul 26, 2022
Maintainer Author

eirrgang Oct 31, 2022
Maintainer Author

carolinedav
Jul 26, 2022
Maintainer

carolinedav
Jul 28, 2022
Maintainer

eirrgang
Jul 28, 2022
Maintainer Author

eirrgang
Oct 31, 2022
Maintainer Author

eirrgang
Oct 31, 2022
Maintainer Author