Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorize coordinates._construct_face_centroids() #1117

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

erogluorhan
Copy link
Member

@erogluorhan erogluorhan commented Dec 21, 2024

Closes #1116.

Overview

Optimizes coordinates.py/_construct_face_centroids() by replacing the for-loop with vectorization with the help of a masking method thanks to @hongyuchen1030 's older suggestion

Further optimization of the code would be possible, but maybe it is handled in future PRs.

FYI @philipc2 : The masking method this PR uses might be useful for cases we previously dealt with via partitioning, e.g. the nodal_averaging problem we look into last year

FYI @rajeeja : This is where the cartesian-based centroid calculations are being fixed. Welzl is a bit different but you may still want to look into it.

Results

#1116 shows the profiling results for the existing code for a 4GB SCREAM dataset where the execution time is around 5 mins.

The optimized code gives around 6 seconds, see below:

Screenshot 2024-12-21 at 8 34 37 AM

PR Checklist

General

  • An issue is linked created and linked
  • Add appropriate labels
  • Filled out Overview and Expected Usage (if applicable) sections

Testing

  • Tests cover all possible logical paths in your function

@rajeeja
Copy link
Contributor

rajeeja commented Dec 21, 2024

Good fix, will test welzl and see appropriate fix.

@philipc2 philipc2 added the run-benchmark Run ASV benchmark workflow label Dec 21, 2024
@philipc2 philipc2 added this to the Scalability & Performance milestone Dec 21, 2024
Copy link

github-actions bot commented Dec 21, 2024

ASV Benchmarking

Benchmark Comparison Results

Benchmarks that have improved:

Change Before [62ca314] After [41b915a] Ratio Benchmark (Parameter)
- 448M 397M 0.89 face_bounds.FaceBounds.peakmem_face_bounds(PosixPath('/home/runner/work/uxarray/uxarray/test/meshfiles/ugrid/geoflow-small/grid.nc'))
- 637M 394M 0.62 face_bounds.FaceBounds.peakmem_face_bounds(PosixPath('/home/runner/work/uxarray/uxarray/test/meshfiles/ugrid/quad-hexagon/grid.nc'))
- 1.76±0.07ms 600±9μs 0.34 mpas_ocean.ConnectivityConstruction.time_n_nodes_per_face('120km')
- 596±10μs 482±10μs 0.81 mpas_ocean.ConnectivityConstruction.time_n_nodes_per_face('480km')
- 6.12±0.1ms 4.78±0.07ms 0.78 mpas_ocean.ConstructFaceLatLon.time_cartesian_averaging('120km')
- 482M 380M 0.79 mpas_ocean.Integrate.peakmem_integrate('480km')

Benchmarks that have stayed the same:

Change Before [62ca314] After [41b915a] Ratio Benchmark (Parameter)
400M 394M 0.98 face_bounds.FaceBounds.peakmem_face_bounds(PosixPath('/home/runner/work/uxarray/uxarray/test/meshfiles/mpas/QU/oQU480.231010.nc'))
432M 424M 0.98 face_bounds.FaceBounds.peakmem_face_bounds(PosixPath('/home/runner/work/uxarray/uxarray/test/meshfiles/scrip/outCSne8/outCSne8.nc'))
18.8±0.2ms 18.8±0.2ms 1 face_bounds.FaceBounds.time_face_bounds(PosixPath('/home/runner/work/uxarray/uxarray/test/meshfiles/mpas/QU/oQU480.231010.nc'))
7.69±0.08ms 7.62±0.1ms 0.99 face_bounds.FaceBounds.time_face_bounds(PosixPath('/home/runner/work/uxarray/uxarray/test/meshfiles/scrip/outCSne8/outCSne8.nc'))
44.2±0.5ms 43.2±0.4ms 0.98 face_bounds.FaceBounds.time_face_bounds(PosixPath('/home/runner/work/uxarray/uxarray/test/meshfiles/ugrid/geoflow-small/grid.nc'))
3.94±0.05ms 3.89±0.1ms 0.99 face_bounds.FaceBounds.time_face_bounds(PosixPath('/home/runner/work/uxarray/uxarray/test/meshfiles/ugrid/quad-hexagon/grid.nc'))
2.98±0.02s 3.04±0.03s 1.02 import.Imports.timeraw_import_uxarray
677±10μs 664±10μs 0.98 mpas_ocean.CheckNorm.time_check_norm('120km')
446±3μs 445±10μs 1 mpas_ocean.CheckNorm.time_check_norm('480km')
665±7ms 648±8ms 0.98 mpas_ocean.ConnectivityConstruction.time_face_face_connectivity('120km')
42.6±0.4ms 41.5±0.4ms 0.98 mpas_ocean.ConnectivityConstruction.time_face_face_connectivity('480km')
3.62±0.05ms 3.66±0.04ms 1.01 mpas_ocean.ConstructFaceLatLon.time_cartesian_averaging('480km')
3.52±0.01s 3.49±0.01s 0.99 mpas_ocean.ConstructFaceLatLon.time_welzl('120km')
222±2ms 227±2ms 1.02 mpas_ocean.ConstructFaceLatLon.time_welzl('480km')
302±1ns 299±2ns 0.99 mpas_ocean.ConstructTreeStructures.time_ball_tree('480km')
810±2ns 883±2ns 1.09 mpas_ocean.ConstructTreeStructures.time_kd_tree('120km')
290±3ns 289±2ns 1 mpas_ocean.ConstructTreeStructures.time_kd_tree('480km')
441±5ms 436±4ms 0.99 mpas_ocean.CrossSection.time_const_lat('120km', 1)
222±3ms 219±1ms 0.99 mpas_ocean.CrossSection.time_const_lat('120km', 2)
114±0.5ms 115±2ms 1.01 mpas_ocean.CrossSection.time_const_lat('120km', 4)
361±5ms 356±2ms 0.99 mpas_ocean.CrossSection.time_const_lat('480km', 1)
181±2ms 179±1ms 0.99 mpas_ocean.CrossSection.time_const_lat('480km', 2)
93.0±0.7ms 93.4±1ms 1 mpas_ocean.CrossSection.time_const_lat('480km', 4)
122±1ms 122±0.7ms 1 mpas_ocean.DualMesh.time_dual_mesh_construction('120km')
8.80±0.07ms 8.61±0.3ms 0.98 mpas_ocean.DualMesh.time_dual_mesh_construction('480km')
1.09±0.01s 1.07±0s 0.99 mpas_ocean.GeoDataFrame.time_to_geodataframe('120km', False)
55.0±0.6ms 53.4±0.6ms 0.97 mpas_ocean.GeoDataFrame.time_to_geodataframe('120km', True)
86.5±0.5ms 86.4±1ms 1 mpas_ocean.GeoDataFrame.time_to_geodataframe('480km', False)
5.63±0.04ms 5.38±0.06ms 0.96 mpas_ocean.GeoDataFrame.time_to_geodataframe('480km', True)
339M 339M 1 mpas_ocean.Gradient.peakmem_gradient('120km')
316M 316M 1 mpas_ocean.Gradient.peakmem_gradient('480km')
2.75±0.01ms 2.77±0.01ms 1.01 mpas_ocean.Gradient.time_gradient('120km')
314±2μs 315±4μs 1 mpas_ocean.Gradient.time_gradient('480km')
244±3μs 231±6μs 0.95 mpas_ocean.HoleEdgeIndices.time_construct_hole_edge_indices('120km')
119±0.7μs 118±2μs 0.99 mpas_ocean.HoleEdgeIndices.time_construct_hole_edge_indices('480km')
402M 396M 0.99 mpas_ocean.Integrate.peakmem_integrate('120km')
173±4ms 172±3ms 0.99 mpas_ocean.Integrate.time_integrate('120km')
12.2±0.2ms 11.7±0.1ms 0.96 mpas_ocean.Integrate.time_integrate('480km')
347±2ms 350±2ms 1.01 mpas_ocean.MatplotlibConversion.time_dataarray_to_polycollection('120km', 'exclude')
347±3ms 348±2ms 1 mpas_ocean.MatplotlibConversion.time_dataarray_to_polycollection('120km', 'include')
346±3ms 347±2ms 1 mpas_ocean.MatplotlibConversion.time_dataarray_to_polycollection('120km', 'split')
23.2±0.2ms 23.2±0.3ms 1 mpas_ocean.MatplotlibConversion.time_dataarray_to_polycollection('480km', 'exclude')
23.1±0.1ms 23.3±0.3ms 1.01 mpas_ocean.MatplotlibConversion.time_dataarray_to_polycollection('480km', 'include')
22.9±0.1ms 23.0±0.1ms 1 mpas_ocean.MatplotlibConversion.time_dataarray_to_polycollection('480km', 'split')
55.2±0.1ms 55.6±0.3ms 1.01 mpas_ocean.RemapDownsample.time_inverse_distance_weighted_remapping
45.3±0.1ms 45.1±0.04ms 1 mpas_ocean.RemapDownsample.time_nearest_neighbor_remapping
355±0.4ms 356±2ms 1 mpas_ocean.RemapUpsample.time_inverse_distance_weighted_remapping
261±0.9ms 261±0.8ms 1 mpas_ocean.RemapUpsample.time_nearest_neighbor_remapping
312M 314M 1.01 quad_hexagon.QuadHexagon.peakmem_open_dataset
311M 312M 1 quad_hexagon.QuadHexagon.peakmem_open_grid
6.32±0.07ms 6.42±0.08ms 1.02 quad_hexagon.QuadHexagon.time_open_dataset
5.44±0.08ms 5.43±0.04ms 1 quad_hexagon.QuadHexagon.time_open_grid

Benchmarks that have got worse:

Change Before [62ca314] After [41b915a] Ratio Benchmark (Parameter)
+ 1.18±0μs 1.30±0μs 1.1 mpas_ocean.ConstructTreeStructures.time_ball_tree('120km')

@philipc2 philipc2 removed the run-benchmark Run ASV benchmark workflow label Dec 21, 2024
@philipc2
Copy link
Member

Nice work @erogluorhan

It appears that a few ASV benchmarks are failing (may or may not be due to the changes in this PR). I can look at it deeper into it on Monday.

@erogluorhan
Copy link
Member Author

All sound good, thanks both @rajeeja @philipc2 !

@erogluorhan
Copy link
Member Author

I can't see any failing benchmarks though! @philipc2

@philipc2
Copy link
Member

I can't see any failing benchmarks though! @philipc2

If you take a look at the GitHub Bot above, it shows which ones failed. If they failed before and after, it means something was wrong before this PR.

Looks like this PR didn't break anything, must have been something with the benchmark written before.

@erogluorhan
Copy link
Member Author

Ok, I see what you mean now, @philipc2 !

Also, re benchmarking, I'd expect mpas_ocean.ConstructFaceLatLon.time_cartesian_averaging('120km') to show the difference this PR creates with the optimization, but it does not. Dataset is pretty small I think for that, which makes me think big data benchmarking would be important in the future.

@philipc2 philipc2 added the run-benchmark Run ASV benchmark workflow label Dec 23, 2024
Comment on lines 335 to 337
centroid_x = np.sum(node_x[face_nodes] * mask, axis=1) / n_nodes_per_face
centroid_y = np.sum(node_y[face_nodes] * mask, axis=1) / n_nodes_per_face
centroid_z = np.sum(node_z[face_nodes] * mask, axis=1) / n_nodes_per_face
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This here is problematic for grids that have variable element sizes.

face_node_connectivity will contain fill values, which will lead to an index error here.

IndexError: index -9223372036854775808 is out of bounds for axis 0 with size 83886080

This wasn't an issue for the SCREAM dataset because it is completely composed of quads.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about fixing this and changing/standardizing the way we handle connectivity?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changing/standardizing the way we handle connectivity

Can you elaborate? I've explored storing connectivity as partitioned arrays in #978, although for something as simple as computing centroids, Numba seems to a significantly simpler implementation, performing closer to compiled language performance without needing to create an additional mask.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that with -1 etc. the mask approach fails, isn't that correct? so, I was thinking adopting a convention that allows for masking to work would simply things.

Not sure about the performance as I think NumPy's vectorized operations might be highly optimized for numerical computations and hopefully outperform the explicit loop here, even when those loops are parallelized with Numba.

@philipc2
Copy link
Member

Here is a minimal example that fails. Feel free to create a test case from this

node_lon = np.array([-20.0, 0.0, 20.0, -20, -40])
node_lat = np.array([-10.0, 10.0, -10.0, 10, -10])
face_node_connectivity = np.array([[0, 1, 2, -1], [0, 1, 3, 4]])

uxgrid = ux.Grid.from_topology(
    node_lon=node_lon,
    node_lat=node_lat,
    face_node_connectivity=face_node_connectivity,
    fill_value=-1,
)

uxgrid.construct_face_centers()

@philipc2
Copy link
Member

Consider this implementation with Numba.

@njit(cache=True, parallel=True)
def _construct_face_centroids(node_x, node_y, node_z, face_nodes, n_nodes_per_face):
    centroid_x = np.zeros((face_nodes.shape[0]), dtype=np.float64)
    centroid_y = np.zeros((face_nodes.shape[0]), dtype=np.float64)
    centroid_z = np.zeros((face_nodes.shape[0]), dtype=np.float64)
    n_face = n_nodes_per_face.shape[0]

    for i_face in prange(n_face):
        n_max_nodes = n_nodes_per_face[i_face]

        x = np.mean(node_x[face_nodes[i_face, 0:n_max_nodes]])
        y = np.mean(node_y[face_nodes[i_face, 0:n_max_nodes]])
        z = np.mean(node_z[face_nodes[i_face, 0:n_max_nodes]])

        x, y, z = _normalize_xyz_scalar(x, y, z)

        centroid_x[i_face] = x
        centroid_y[i_face] = y
        centroid_z[i_face] = z
    return centroid_x, centroid_y, centroid_z

Without partitioning our arrays, it's really difficult to find an elegant vectorized solution directly using NumPy.

@rajeeja
Copy link
Contributor

rajeeja commented Dec 23, 2024

Consider this implementation with Numba.

@njit(cache=True, parallel=True)
def _construct_face_centroids(node_x, node_y, node_z, face_nodes, n_nodes_per_face):
    centroid_x = np.zeros((face_nodes.shape[0]), dtype=np.float64)
    centroid_y = np.zeros((face_nodes.shape[0]), dtype=np.float64)
    centroid_z = np.zeros((face_nodes.shape[0]), dtype=np.float64)
    n_face = n_nodes_per_face.shape[0]

    for i_face in prange(n_face):
        n_max_nodes = n_nodes_per_face[i_face]

        x = np.mean(node_x[face_nodes[i_face, 0:n_max_nodes]])
        y = np.mean(node_y[face_nodes[i_face, 0:n_max_nodes]])
        z = np.mean(node_z[face_nodes[i_face, 0:n_max_nodes]])

        x, y, z = _normalize_xyz_scalar(x, y, z)

        centroid_x[i_face] = x
        centroid_y[i_face] = y
        centroid_z[i_face] = z
    return centroid_x, centroid_y, centroid_z

Without partitioning our arrays, it's really difficult to find an elegant vectorized solution directly using NumPy.

Yes, it was a sticky one, I left it for a later PR to optimize this.

@rajeeja
Copy link
Contributor

rajeeja commented Jan 10, 2025

this also might use a bit more memory due to the creation of the mask array - but that should be fine for larger grids and the speedup would be pretty good considering it uses vectorized operations and broadcasting. Seems to work and I'm okay with approving this.

@philipc2
Copy link
Member

@erogluorhan

Let me know if you have any thoughts about the Numba implementation that I shared.

@rajeeja
Copy link
Contributor

rajeeja commented Jan 11, 2025

@erogluorhan

Let me know if you have any thoughts about the Numba implementation that I shared.

Fine with the Numba suggestion also, we have modify this PR to use that, we use Numba all over the project.

@philipc2
Copy link
Member

@erogluorhan @rajeeja

With the parallel Numba implementation, I was able to obtain the following times on my local machine for a 3.75km MPAS grid (84 million nodes, 42 million faces)

image

@erogluorhan
Copy link
Member Author

erogluorhan commented Jan 16, 2025

@philipc2 thanks very much for catching the original issue in this PR and suggesting the Numba solution!

Also thank you and @rajeeja for the convo you kept running here!

Please see some considerations of mine below and let me know your thoughts about each of them, mostly for future, but at least for the scope of this PR, I am fine with the current Numba version, you all feel free to give it a review for this version:

  1. Even ~5 secs with the method I tried, and also the current execution time of ~8secs (in Philip's local machine) are also sub-ideal for permanent long term performance of this function. We need to bring it to a better spot (say around a second or two) for this setup (4GB SCREAM). That's because when the user wants to use a time-series data with several time-steps, maybe for an even finer-res grid, the performance should still be okay.

  2. Speaking of time-series, the user will likely use chunking schemes, e.g. each time step being a separate chunk. In such scenarios, we need to be sure Numba and Dask will play well together.

  3. While this current Numba implementation seems to have a decent performance for now, we saw that vectorization would give us even better performance with my original attempt (if only we didn't have the masking issue with nonuniform connectivity array), and I thought even further parts of the code in that case would be optimized in the future. That said, partitioning the connectivity as Philip attempted in DRAFT: API for accessing partitioned connectivity, reworked topological aggregations  #978 still seems like the best bet to me to still achieve a vectorized performance. @philipc2 , was there any difficulty in that code that made you stop and closing that PR?

Thoughts?

@philipc2
Copy link
Member

Hi @erogluorhan

Even ~5 secs with the method I tried, and also the current execution time of ~8secs (in Philip's local machine) are also sub-ideal for permanent long term performance of this function. We need to bring it to a better spot (say around a second or two) for this setup (4GB SCREAM). That's because when the user wants to use a time-series data with several time-steps, maybe for an even finer-res grid, the performance should still be okay.

This is a one-time execution, so I personally think a couple of seconds is acceptable, especially for grids with millions of elements. The comparison is also not direct, as I ran my benchmark on a 3.75km MPAS grid (~84 million nodes). Do you recall the size of the SCREAM grid?

Speaking of time-series, the user will likely use chunking schemes, e.g. each time step being a separate chunk. In such scenarios, we need to be sure Numba and Dask will play well together.

The centroid computation here is independent of any time-series data, since this is tied to the Grid and not a UxDataset.

While this current Numba implementation seems to have a decent performance for now, we saw that vectorization would give us even better performance with my original attempt (if only we didn't have the masking issue with nonuniform connectivity array), and I thought even further parts of the code in that case would be optimized in the future.

While I agree the speed is significantly in the masked approach better, it only appears that way due to the fixed-dimension. While partitioning as initially explored in #978 may improve our execution time, that PR has been stale for a while and I plan to re-visit it in the near future. However, for our needs, especially with the Face bounds optimization, Numba has proven to be an excellent choice for obtaining significant performance gains.

There was some discussion about Numba in #1124

Pinging @Huite if he has anything he'd like to comment on.

@Huite
Copy link

Huite commented Jan 16, 2025

Happy to comment!

You can easily circumvent the fill values for floating point data by utilizing NaNs and the relevant numpy methods.
E.g. index with the -1 value, then overwrite those values with NaN, then use the appropriate NaN-aware function to filter.

Vectorized

face_nodes_x = node_x[face_node_connectivity]
face_nodes_x[face_node_connectivity == -1] = np.nan
centroid_x = np.nanmean(face_nodes_x, axis=1)

This allocates two intermediate arrays (the face_nodes_x and the boolean mask). That isn't too bad, although I'm fairly confident numpy doesn't parallellize these operations.

Numba

With regards to numba, you're probably leaving quite some performance on the table:

 x = np.mean(node_x[face_nodes[i_face, 0:n_max_nodes]])

Note that the node_x[face_nodes[i_face, 0:n_max_nodes]] continuously allocates two small (heap-allocated) arrays. My guess is that this takes quite some memory, because it's right in the hottest part of the code. Ideally you pre-allocate and reuse the memory instead. The parallellization with prange makes this slightly trickier though.

You can always write your own non-allocating mean, e.g. something like:

def mean(face, node_x):
     for index in face:
           accum = 0.0 
           if index == -1:
                break
           accum += node_x[face]
      return accum / index

This might not have optimal performance either, because memory is accessed all over the place with node_x[face], which probably causes cache misses (depending on the size of the node_x).

You want to do indexing and computation of mean in parallel. The mean ideally operates over a contiguous piece of memory.

Centroid versus center-of-mass

But something else to consider is whether the mean is really what you're after.

I compute a mean for triangles, but use a center of mass approach otherwise. These are normally equivalent for convex polygons. The reason (which I should've commented in the code...!) is that there are quite a number of grids where there's stuff like hanging nodes (in quadtree meshes e.g.).

See this sketch in paint.

image

Node 3 is hanging. My expectation is that the user wants to get the spot marked by the x (which is the center). But if you compute the mean of the vertices, node 3 is going to "pull" the centroid towards the right, which is undesirable.

You can avoid this by using a center of mass approach instead, as I do:

https://github.com/Deltares/xugrid/blob/fc0bdcb0eaeff62887f1e62fadd61957629dff61/xugrid/ugrid/connectivity.py#L584

To deal with the fill values, I "close" the faces with close_polygons (as shapely calls it, I add another column to the face nodes, and I replace the fill values by the first node). The center of mass can then be computed easily, because the fill values will result in zero-area weights (thereby discounting their value).

My method generates relatively many temporary numpy arrays, but you could easily avoid those with fully scalar operations in numba.

My general philosophy is to rely on vectorized solution first, and primarily use numba when there's no other way, primarily to keep JIT latency down. Even though caching will make it only on first use, e.g. CI runners will not be able to utilize the caches. But it's debatable at which point numba becomes superior; I have 16 cores on my laptop so numba's prange will give huge speed ups compared to the vectorized approach...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run-benchmark Run ASV benchmark workflow
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optimize Face Centroid Calculations
5 participants