-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
creation of nodes with snakemake functionality #188
Comments
I also made notes from this meeting. I only got to the pseudocode parts this morning, but was actually not super happy and wrote some more pseudocode right after. Notes: On 6 April, Joerg and I just had a zoom call to discuss recent thoughts on the next iteration of pyiron_base/all this graph stuff. First on the docket was to quickly run through my (barely)pseudo-code example for a cyclic while loop over here. Next up was Joerg's node/task sketch here. First, some nomenclature: we broke the idea of a "node" down into "atomic node" (in the greek sense of the word, not the kind of science we're doing) and a "macro node".
Now we come back to Joerg's sketch and the idea of tasks. Pseudocode modifications to the node class run routine to accommodate executors might look something like this: class Node:
...
def run(self):
if self.executor is None:
function_output = self.node_function(**self.inputs.to_value_dict())
else:
if self.executor.is_queued(self):
function_output = Executing()
# Executing is some custom class the node will know how to process
# s.t. outputs don't get updated when we process the function_output
elif self.executor.is_finished(self):
function_output = self.executor.collect(self) # Executor has state! :(
# This might return the expected tuple of output, or might return some
# Failed() class instance or otherwise similar to Executing()
else:
self.executor.push(self)
# The executor is then responsible for executing
# node.node_function(**node.inputs.to_value_dict)
function_output = Executing()
self.process_output(function_output)
# Catch failure errors, update output channels with successful data, etc. Where the In this way, the node processes its input, produces one or more tasks, talks directly with an executor to submit them, wait for them to be executed, and retrieve the result, then processes this into node output, which then propagates through the node graph to trigger new node work cycles. Previously, we had talked about such executors/task queues belonging to the workflow itself, and all nodes submitting their directly to their workflow, or tasks of their workflow. Pseudocode using executors might look something like this: from pyiron_contrib import Workflow
from time import sleep
wf = Workflow("murn_pseudocode")
# First, let's define nodes that will exist outside all loops
wf.lattice_constants = wf.create.node.linspace(
start=3.8,
end=4.2,
n=10
)
wf.lammps_executor = wf.create.node.lammps_executor(
n_cores=10,
n_instances=5
)
# This node, unfortunately, has state!
# On first run(on instantiation?), it creates an executor and returns
# it as output. Subsequent runs do nothing (unless cores or instances change?)
# The executor object in turn holds n_instances copies of the Lammps interpreter
# and distributes tasks to them under the assumption n_cores are available.
wf.potential = wf.create.node.atomic_potential(
type="EAM",
species=("Al",),
choice=0,
)
wf.lammps_engine = wf.create.node.lammps_engine(
potential=wf.potential.outputs.potential,
executor=wf.lammps_executor.outputs.executor,
# If the executor is None, the engine should create
# its own single-use instance of the lammps interpreter
# on each run that just uses the main python process
# for computation -- i.e. it behaves the same as a pure-
# python node that hogs the python process when executed
)
# Next, we want an outer while loop that runs until
# all the jobs in our for-loop produce real data
# Decorator is class method for single import
@Workflow.node(
"required_or_none", "data",
input_that_must_update=("step",) # Won't re-execute until this has been updated
)
def while_executing(required, step, data, sleep_time=0):
"""
`step` is not used, but the node won't execute the function unless this channel
has received an update since the last execution.
"""
if any(isinstance(out, Workflow.Executing) for out in data):
# Executing is class attribute for single import
# Loop again
sleep(sleep_time)
return required, None
else:
# Stop and return the completed data
return None, data
wf.while_executing = while_executing(sleep_time=5)
# And an inner loop that will iterate over our lattice constants
wf.for_loop = wf.create.node.for_loop()
# inputs: iterable, i, step (must update), reset
# outputs: item, i (gets incremented and connects to input), done
# A connection is made between the input i and output i
# so that we have topologically-defined pseudo-state!
# The step channel must have received an update before
# a new execution will trigger
# Next, let's define the nodes needed inside the inner loop
wf.structure = wf.create.node.bulk_structure(element="Al")
wf.calc_static = wf.create.node.calc_static(
engine=wf.lammps_engine.outputs.engine,
)
wf.energies = wf.create.node.accumulator()
# inputs: item, items, reset
# outputs: items
# Starts as an empty list, gets pseudo-state
# by looping the output items back to the input items
# Appends item to items at each call, resetting to an
# empty list if reset==True
# Now let's wire up our loops!
# The flow of data inside the for-loop is very easy
wf.calc_static.inputs.structure = wf.structure.outputs.structure
wf.energies.inputs.item = wf.calc_static.outputs.energy_pot
# The for-loop should pass in lattice constants and iterate
# each time we append to the accumulator
wf.structure.inputs.a = wf.for_controller.outputs.item
wf.for_loop.inputs.step = wf.energies.outputs.items
# The while loop will kill the for-loop by destroying its
# iterable input
wf.while_executing.inputs.required = wf.lattice_constants.outputs.array
wf.for_loop.inputs.iterable = wf.while_executing.outputs.required
# Or reset the loop
wf.for_loop.inputs.reset = wf.while_executing.outputs.required
# The while loop should step with each completion of the for-loop,
wf.while_executing.inputs.step = wf.for_loop.outputs.done
# and once when we start the graph
wf.while_executing.inputs.step = wf.lattice_constants.outputs.array
# And the while loop needs to see the accumulated energies,
# to check whether they're all finished data or not
wf.while_executing.inputs.data = wf.energies.outputs.items
# Finally, we'll use the collected energies and lattice constants
# to calculate something
@Workflow.node("bulk_modulus")
def murnaghan(lattice_constants, energies: list | np.ndarray):
# Do the math
return 42
wf.bulk_modulus = murnaghan(
lattice_constants=wf.lattice_constants.outputs.array,
energies=wf.for_loop.outputs.data,
)
wf.run()
print(wf.bulk_modulus.outputs.bulk_modulus)
>>> 42 The idea is that we have an inner for-loop which generates Lammps static calculations and ships them off to the Lammps executor queue. Finally, Joerg shared an idea he had from listening to some of the engineers' presentations about providing a pyiron wrapper for snakemake. Technical note: English notes were completed immediately following the meeting on 6 April, but I didn't have time to sit down and write the Murnaghan pseudo-code and accompanying paraphraphs (i.e. the header stating what the pseudocode is for and the paragraph immediately after the code) until 11 April. |
Better (but still imperfect) pseudocode for loops, relying on nodes having flow control events: from pyiron_contrib import Workflow
wf = Workflow("murn_pseudocode")
wf.lattice_constants = wf.create.node.linspace(
start=3.8,
end=4.2,
n=10
)
wf.lammps_executor = wf.create.node.lammps_executor(
n_cores=10,
n_instances=5
)
# This node, unfortunately, has state!
# On first run(on instantiation?), it creates an executor and returns
# it as output. Subsequent runs do nothing (unless cores or instances change?)
# The executor object in turn holds n_instances copies of the Lammps interpreter
# and distributes tasks to them under the assumption n_cores are available.
wf.potential = wf.create.node.atomic_potential(
type="EAM",
species=("Al",),
choice=0,
)
wf.lammps_engine = wf.create.node.lammps_engine(
potential=wf.potential.outputs.potential,
executor=wf.lammps_executor.outputs.executor,
# If the executor is None, the engine should create
# its own single-use instance of the lammps interpreter
# on each run that just uses the main python process
# for computation -- i.e. it behaves the same as a pure-
# python node that hogs the python process when executed
)
inner = Workflow("inner_loop")
inner.structure = inner.create.node.bulk_structure(
element="Al"
)
inner.calc_static = inner.create.node.calc_static(
structure=inner.structure.outputs.structure
)
lattice_energy = Workflow.meta.for_loop(
body=inner,
step=inner.calc_static.outputs.control.ran,
iterable_inputs={
"lattice_constant": inner.structure.inputs.lattice_constant
},
iterable_outputs={
"energy_pot": inner.calc_static.outputs.energy_pot,
}
)
# a meta node takes a workflow instance and returns a node class
wf.for_loop = lattice_energy()
# This is a macro node. When called it will make sure that its children
# match with its input, creating or destroying nodes as needed
# In this case, it loops over its iterable input to create a body node for each
# element in the iterable inputs (can there be more than one??) and updates the
# corresponding input channel in each to trigger a run of the subgraph
# then collects all the subgraph outputs into iterable_outputs
# What happens if the number of iterables changes??
# Death? Error? Maybe it's OK on the first call and the first call only, since it's a macro
# Can we loop over multiple iterable inputs as long as they're the same length?
# Yes, zip and execute
# What about when they're different lengths?
# No, nest multiple meta nodes together
# Can we collect an arbitrary number of iterable outputs?
# Yes
# How to change/access non-iterable IO?
# Some sort of broadcasting magic? Then do we even need iterable_outputs?
wf.for_loop.iterable_inputs.lattice_constants = wf.lattice_constants.outputs.array
wf.all_finished = wf.create.node.none_running(
data=wf.for_loop.iterable_outputs.energy_pot
)
wf.while_loop = wf.create.node.while_loop(
condition=wf.all_finished.outputs.truth,
step=wf.for_loop.outputs.control.done,
)
wf.for_loop.inputs.control.reset = wf.while_loop.outputs.control.if_false
@Workflow.node("bulk_modulus")
def murnaghan(lattice_constants, energies: list | np.ndarray):
# Do the math
return 42
wf.bulk_modulus = murnaghan(
lattice_constants=wf.lattice_constants.outputs.array,
energies=wf.for_loop.iterable_outputs.energy_pot,
update_automatically=False
)
wf.bulk_modulus.inputs.control.run = wf.while_loop.outputs.control.if_true |
@liamhuber, thanks for the great summary and all the suggestions and pseudocode. I like most of your pseudocode. I am however afraid that the loop part (with the inner workflow) will be hard to understand - it is rather far away from any standard python notation. For acceptance, this is however a super important criterion. Below are some (very preliminary) thoughts to combine a python-like syntax with a workflow notation:
The main idea is to create an iterator object that replaces the normal integer index and provides all the functionality to run and store the inner workflow in the loop. This object not only allows to call break or continue but can be appended also to node objects to log and store all commands performed for this node. This logging ability is also needed e.g. for the structure object, e.g., to store when vacancies, substitutions etc. are done. Again, very preliminary ideas but it would be good to discuss them. |
@JNmpi so the good news is that the sort of thing you propose is actually already completely doable, the bad news is that I have concerns with the paradigm (but concerns, not objections! And I strongly agree that similarity with python is critical for adoption). So, first, the good news: if you're just doing things live in the notebook, we can actually really easily dynamically create new nodes, and even dynamically create node connections! This comes very naturally from our syntactic sugar that from pyiron_contrib.workflow.workflow import Workflow
from pyiron_contrib.workflow.node import node
@node("y")
def add_one(x):
return x + 1
wf = Workflow("my_loop")
wf.n0 = add_one(x=0)
i = 0
while wf[f"n{i}"].outputs.y.value < 5:
wf.add(add_one(x=wf[f"n{i}"].outputs.y, label=f"n{i+1}"))
i += 1
print(wf.nodes.keys())
>>> dict_keys(['n0', 'n1', 'n2', 'n3', 'n4'])
print(wf.n4.outputs.y.value)
>>> 5 Now, what worries me about these examples is that mine (exclusively) and yours (naively) work only when live in a notebook, i.e. they once again promote the jupyter notebook rather than the I keep saying "naively" for yours, because I really like your idea of allowing the workflow to construct a custom iterator. In this paradigm, I can envision that your entire example is strictly syntactic sugar on top of building something like my node-based loop example a couple comments above -- I suspect this may just be what you're driving at already!! In this way, we could satisfy code-based users at the same time that we maintain a consistent, node-based paradigm -- Which in turn keeps a consistent universe for graphical and text users, and makes sure we are working with serializable (and thus shareable) objects. How do we achieve this technically? Honestly, I'm not sure. I took a peek at some of the generator/iterator docs today to refresh my memory, and they are relatively powerful and flexible objects, but this is asking an awful lot of them. We'll need to either delay For me, falling back to live notebook objects is a show stopper and must be avoided. But, if we agree that this sort of loop syntax is actually just sugar on top of constructing some sort of computation-free, serializable graph objects under the hood, then I would propose to simply continue development using the more verbose but rigorous paradigm of explicit flow-control nodes, and then figure out how we can get an iterator object to map onto those. More discussions in real-time also sounds good. This coming Monday (the 17th) I need to leave the pyiron meeting a little early, and won't be free again until ~10:00, and on Friday I am unavailable starting ~11:00. Otherwise I should be more or less free to schedule something 06:00-16:00 PST, so it depends on the workshop schedule for you and @jan-janssen. I guess @pmrv and @samwaseda might also be interested, in which case a morning time slot is important. |
Summary
Provide a new node type that allows execution of snakemake rules. pyiron/ironflow would provide a thin interface to make snakemake rules appearing as native pyiron objects.
Detailed Description
@liamhuber, this is a brief summary of the discussion we had at our last Zoom-meeting.
Looking at several workflow solutions in the Plattform MaterialDigital (PMD) a recurring request is that workflows with codes running in various conda environments have to be supported. A simple approach, which should be rather easy to implement would be to use snakemake. The idea would be to have a special run command that provides all the parameters that are needed to setup the Snakemake input file to run snakemake and to run it within pyiron. The corresponding jobs would be regular pyiron jobs, but could use the functionality of snakemake to run install, load and run codes living in different conda environments. Rather than using a full snakemake workflow the idea would be to run each snakemake rule as a separate job. This should be straightforward to implement since the rule defines input and output parameters (similarly to pyiron) plus some extra information regarding the conda version etc.
The (pseudo-) code could look like this:
Further Information, Files, and Links
Example workflows (snakemake example):
The text was updated successfully, but these errors were encountered: