-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vectorize coordinates._construct_face_centroids()
#1117
base: main
Are you sure you want to change the base?
Conversation
Good fix, will test welzl and see appropriate fix. |
ASV BenchmarkingBenchmark Comparison ResultsBenchmarks that have improved:
Benchmarks that have stayed the same:
Benchmarks that have got worse:
|
Nice work @erogluorhan It appears that a few ASV benchmarks are failing (may or may not be due to the changes in this PR). I can look at it deeper into it on Monday. |
I can't see any failing benchmarks though! @philipc2 |
If you take a look at the GitHub Bot above, it shows which ones failed. If they failed before and after, it means something was wrong before this PR. Looks like this PR didn't break anything, must have been something with the benchmark written before. |
Ok, I see what you mean now, @philipc2 ! Also, re benchmarking, I'd expect |
uxarray/grid/coordinates.py
Outdated
centroid_x = np.sum(node_x[face_nodes] * mask, axis=1) / n_nodes_per_face | ||
centroid_y = np.sum(node_y[face_nodes] * mask, axis=1) / n_nodes_per_face | ||
centroid_z = np.sum(node_z[face_nodes] * mask, axis=1) / n_nodes_per_face |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This here is problematic for grids that have variable element sizes.
face_node_connectivity
will contain fill values, which will lead to an index error here.
IndexError: index -9223372036854775808 is out of bounds for axis 0 with size 83886080
This wasn't an issue for the SCREAM dataset because it is completely composed of quads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about fixing this and changing/standardizing the way we handle connectivity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changing/standardizing the way we handle connectivity
Can you elaborate? I've explored storing connectivity as partitioned arrays in #978, although for something as simple as computing centroids, Numba seems to a significantly simpler implementation, performing closer to compiled language performance without needing to create an additional mask.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that with -1 etc. the mask approach fails, isn't that correct? so, I was thinking adopting a convention that allows for masking to work would simply things.
Not sure about the performance as I think NumPy's vectorized operations might be highly optimized for numerical computations and hopefully outperform the explicit loop here, even when those loops are parallelized with Numba.
Here is a minimal example that fails. Feel free to create a test case from this node_lon = np.array([-20.0, 0.0, 20.0, -20, -40])
node_lat = np.array([-10.0, 10.0, -10.0, 10, -10])
face_node_connectivity = np.array([[0, 1, 2, -1], [0, 1, 3, 4]])
uxgrid = ux.Grid.from_topology(
node_lon=node_lon,
node_lat=node_lat,
face_node_connectivity=face_node_connectivity,
fill_value=-1,
)
uxgrid.construct_face_centers() |
Consider this implementation with Numba. @njit(cache=True, parallel=True)
def _construct_face_centroids(node_x, node_y, node_z, face_nodes, n_nodes_per_face):
centroid_x = np.zeros((face_nodes.shape[0]), dtype=np.float64)
centroid_y = np.zeros((face_nodes.shape[0]), dtype=np.float64)
centroid_z = np.zeros((face_nodes.shape[0]), dtype=np.float64)
n_face = n_nodes_per_face.shape[0]
for i_face in prange(n_face):
n_max_nodes = n_nodes_per_face[i_face]
x = np.mean(node_x[face_nodes[i_face, 0:n_max_nodes]])
y = np.mean(node_y[face_nodes[i_face, 0:n_max_nodes]])
z = np.mean(node_z[face_nodes[i_face, 0:n_max_nodes]])
x, y, z = _normalize_xyz_scalar(x, y, z)
centroid_x[i_face] = x
centroid_y[i_face] = y
centroid_z[i_face] = z
return centroid_x, centroid_y, centroid_z Without partitioning our arrays, it's really difficult to find an elegant vectorized solution directly using NumPy. |
Yes, it was a sticky one, I left it for a later PR to optimize this. |
this also might use a bit more memory due to the creation of the mask array - but that should be fine for larger grids and the speedup would be pretty good considering it uses vectorized operations and broadcasting. Seems to work and I'm okay with approving this. |
Let me know if you have any thoughts about the Numba implementation that I shared. |
Fine with the Numba suggestion also, we have modify this PR to use that, we use Numba all over the project. |
With the parallel Numba implementation, I was able to obtain the following times on my local machine for a 3.75km MPAS grid (84 million nodes, 42 million faces) |
@philipc2 thanks very much for catching the original issue in this PR and suggesting the Numba solution! Also thank you and @rajeeja for the convo you kept running here! Please see some considerations of mine below and let me know your thoughts about each of them, mostly for future, but at least for the scope of this PR, I am fine with the current Numba version, you all feel free to give it a review for this version:
Thoughts? |
Hi @erogluorhan
This is a one-time execution, so I personally think a couple of seconds is acceptable, especially for grids with millions of elements. The comparison is also not direct, as I ran my benchmark on a 3.75km MPAS grid (~84 million nodes). Do you recall the size of the SCREAM grid?
The centroid computation here is independent of any time-series data, since this is tied to the
While I agree the speed is significantly in the masked approach better, it only appears that way due to the fixed-dimension. While partitioning as initially explored in #978 may improve our execution time, that PR has been stale for a while and I plan to re-visit it in the near future. However, for our needs, especially with the Face bounds optimization, Numba has proven to be an excellent choice for obtaining significant performance gains. There was some discussion about Numba in #1124 Pinging @Huite if he has anything he'd like to comment on. |
Happy to comment! You can easily circumvent the fill values for floating point data by utilizing NaNs and the relevant numpy methods. Vectorizedface_nodes_x = node_x[face_node_connectivity]
face_nodes_x[face_node_connectivity == -1] = np.nan
centroid_x = np.nanmean(face_nodes_x, axis=1) This allocates two intermediate arrays (the face_nodes_x and the boolean mask). That isn't too bad, although I'm fairly confident numpy doesn't parallellize these operations. NumbaWith regards to numba, you're probably leaving quite some performance on the table: x = np.mean(node_x[face_nodes[i_face, 0:n_max_nodes]]) Note that the You can always write your own non-allocating def mean(face, node_x):
for index in face:
accum = 0.0
if index == -1:
break
accum += node_x[face]
return accum / index This might not have optimal performance either, because memory is accessed all over the place with You want to do indexing and computation of mean in parallel. The mean ideally operates over a contiguous piece of memory. Centroid versus center-of-massBut something else to consider is whether the mean is really what you're after. I compute a mean for triangles, but use a center of mass approach otherwise. These are normally equivalent for convex polygons. The reason (which I should've commented in the code...!) is that there are quite a number of grids where there's stuff like hanging nodes (in quadtree meshes e.g.). See this sketch in paint. Node 3 is hanging. My expectation is that the user wants to get the spot marked by the x (which is the center). But if you compute the mean of the vertices, node 3 is going to "pull" the centroid towards the right, which is undesirable. You can avoid this by using a center of mass approach instead, as I do: To deal with the fill values, I "close" the faces with My method generates relatively many temporary numpy arrays, but you could easily avoid those with fully scalar operations in numba. My general philosophy is to rely on vectorized solution first, and primarily use numba when there's no other way, primarily to keep JIT latency down. Even though caching will make it only on first use, e.g. CI runners will not be able to utilize the caches. But it's debatable at which point numba becomes superior; I have 16 cores on my laptop so numba's |
Closes #1116.
Overview
Optimizes
coordinates.py/_construct_face_centroids()
by replacing the for-loop with vectorization with the help of a masking method thanks to @hongyuchen1030 's older suggestionFurther optimization of the code would be possible, but maybe it is handled in future PRs.
FYI @philipc2 : The masking method this PR uses might be useful for cases we previously dealt with via partitioning, e.g. the nodal_averaging problem we look into last year
FYI @rajeeja : This is where the cartesian-based centroid calculations are being fixed. Welzl is a bit different but you may still want to look into it.
Results
#1116 shows the profiling results for the existing code for a 4GB SCREAM dataset where the execution time is around 5 mins.
The optimized code gives around 6 seconds, see below:
PR Checklist
General
Testing