Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

local processing: support UDF's #158

Open
jdries opened this issue Sep 4, 2023 · 6 comments · May be fixed by #307
Open

local processing: support UDF's #158

jdries opened this issue Sep 4, 2023 · 6 comments · May be fixed by #307
Assignees

Comments

@jdries
Copy link

jdries commented Sep 4, 2023

Add UDF support, with the main goal of making it work for 'local' processing in the Python client.
The Python client already has the basics in place to run a UDF on a chunk of data:
https://github.com/Open-EO/openeo-python-client/blob/master/openeo/udf/run_code.py#L143

But it needs to be connected to the local processing implementation, which should be fairly similar to running one of the predefined functions.
To be seen which parent processes we want to support. apply_neighborhood is for sure very popular, but also apply_dimension is relevant.

The motivation for this is that UDF's are used in quite a few of the user workflows, so while users often continue to ask for local debugging, we can only point them to this feature when it supports UDF's.

@VincentVerelst
Copy link
Contributor

A first implementation of the run_udf process can be found here: https://github.com/VincentVerelst/openeo-processes-dask/tree/local-udf
Using apply as a parent process, UDFs that work on a numpy array are working. UDFs that manipulate a whole xarray object don't work, because that is not supported by apply_ufunc which is used in the apply process.
map_blocks (https://docs.xarray.dev/en/stable/generated/xarray.map_blocks.html#xarray.map_blocks) might be a solution for that.

@clausmichele
Copy link
Member

Hi @VincentVerelst, can you also provide an example code with a sample udf showing how to call this?

@clausmichele
Copy link
Member

Here a sample code to use it:

import logging
logging.basicConfig(level = logging.INFO)

import openeo
from openeo.local import LocalConnection
local_conn = LocalConnection("./")

url = "https://earth-search.aws.element84.com/v1/collections/sentinel-2-l2a"
spatial_extent =  {"east": 11.40, "north": 46.52, "south": 46.46, "west": 11.25}
temporal_extent = ["2022-06-01", "2022-06-10"]
bands = ["red", "nir"]
properties = {"eo:cloud_cover": dict(lt=80)}
s2_datacube = local_conn.load_stac(
    url=url,
    spatial_extent=spatial_extent,
    temporal_extent=temporal_extent,
    bands=bands,
    properties=properties,
)

b04 = s2_datacube.band("red")
b08 = s2_datacube.band("nir")
ndvi = (b08 - b04) / (b08 + b04)
ndvi_median = ndvi.reduce_dimension(dimension="time", reducer="median")

# Build a UDF object from an inline string with Python source code.
udf = openeo.UDF("""
from openeo.udf import XarrayDataCube

def apply_datacube(cube: XarrayDataCube, context: dict) -> XarrayDataCube:
    array = cube.get_array()
    print(array.shape)
    array.values = 0.0001 * array.values
    return cube
""")

# Or load the UDF code from a separate file.
# udf = openeo.UDF.from_file("udf-code.py")

# Apply the UDF to a cube.
rescaled_cube = ndvi_median.apply(process=udf)

print(rescaled_cube.execute())

@clausmichele
Copy link
Member

@VincentVerelst I noticed that you're basically calling the implementation available in the openeo-python-client.

A common misunderstanding occurs when someone tries to debug locally and then gets a different result on the cloud. This could be due to different chunk sizes used. Currently, if I check the shape of the array that the UDF code is manipulating, it loads everything without respecting the chunking. This needs to be adresses, so that chunks are used if present.

@jdries jdries linked a pull request Jan 6, 2025 that will close this issue
@jdries
Copy link
Author

jdries commented Jan 6, 2025

Reviving this one as we need it for openEO platform project.
@clausmichele you are right, the chunking is crucial. If we take the example code above, then I would expect the 'apply' process to do some sort of chunking, more specifically this apply_ufunc:
https://github.com/Open-EO/openeo-processes-dask/blob/8f6e05972016d258a13a8cb20200f619bf0d1fd6/openeo_processes_dask/process_implementations/cubes/apply.py#L29C14-L29C28

The only problem perhaps is that different implementations are somewhat allowed to use different types of chunking. For local debugging, it could perhaps be nice to have a way to somehow enable different chunking modes, allowing to debug if the UDF works in all cases. I would however not address that in this initial implementation.

@clausmichele
Copy link
Member

@jdries I agree that for the first implementation it would be enough to have a fixed chunk size, what's the default at VITO?

On our side @jzvolensky is the one working on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants