Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion - Estimating a ProcessGraph #286

Open
vermesr opened this issue Oct 22, 2024 Discussed in #285 · 0 comments
Open

Discussion - Estimating a ProcessGraph #286

vermesr opened this issue Oct 22, 2024 Discussed in #285 · 0 comments

Comments

@vermesr
Copy link

vermesr commented Oct 22, 2024

Hello, I have opened this discussion, I am not sure if this should be opened as an issue.

Discussed in #285

Originally posted by vermesr October 21, 2024
Hello,

I am from Thales and I work in collaboration with the French Space Agency (CNES) on an Openeo backend.

One of our concerns is to estimate the resources required to compute a ProcessGraph. (We are using openeo-processes-dask on top of an HPC cluster a SLURM resource manager).

There are 2 parts to achieve this :

  • Have an estimation of the required resources for each process, based on the data size to work on
  • Estimate a process graph according to those estimations

Processes Estimation

The goal is to execute individually each openeo process from openeo-processes-dask, and time it for each set of parameters around :

  • Dask SlurmCluster configuration (mostly number of workers)
  • Data size to process
Current method

We implement a wrapper function for each process to call. For example our function for the 'round' process looks like this :

    def _round_loc(self):
        round_process = process_registry_cnes.get('round').implementation
        def _wrapper(x=None, p=3, positional_parameters=None, named_parameters=None):
            return round_process(x=x, p=p)

        process_apply = process_registry_cnes.get('apply').implementation
        round_process_wrapped = process_apply(data=self._get_data(), process=_wrapper)
        
        return round_process_wrapped`

In the estimation process, the datacubes required for the process are persisted in dask workers memory before we start the process execution (and monitor it - currently mostly time).
The execution is made like this :

result = process()
persisted_result = result.persist(pure=True)
Futures_res_wait = distributed.wait(da)

From this estimations, we build a json file containing each process different resources profiles, according to the data size to process.

Automatic method ?

I was wondering if we could have an automatic implementation that looks into the process spec and determine wether it can be called directly or it needs to be wrapped into the process 'apply' for example, and also determine the input arguments to pass.

My feeling about this is that it is difficult regarding some processes. For example the process round takes a number and an integer and we would like to pass our rastercube as the number and a number of decimals as the integer (and wrap the call into the process 'apply'), but it looks hard to determine an automatic value for the arguments, not 'automatically' knowing what they really stand for.

ProcessGraph Estimation

Integration of a ProcessGraph Estimation into our Data-Processing Service (for route /jobs/<job_id>/estimate and/or before starting batch job).
Here we have taken the openeo-pgparser-networkx.OpenEOProcessGraph _map_node_to_callable to extract data from processes that run into the inner "node_callable" function.
The dask graph is built by calling the pg_callable(), but we skip the result saving part.

This assumes everything is ran lazily.
For each process lazily ran, we get the size of arguments when they are RasterCube.

According the processes that need to be executed and the size of their data arguments to process, we estimate the duration and required resource profil for the ProcessGraph.

We are wondering if someone works on an estimation process as well.

We would be glad to exchange on this topic!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant