-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimization of the graph network #48
Comments
Regarding the interface, it should look and work like this: # Create or load a model in any way
model = TorchMD_GN()
# Optional: train or do what ever you want with the model
# Optimize the model
from torchmdnet.optimize import optimize
optimized_model = optimize(model, some_optimization_options)
# Do the inference with the model
results = optimized_model.forward(z, pos, batch)
# Optional: convert the model into TorchScript and save for external use (e.g. OpenMM-Torch)
torch.jit.script(optimized_model).save('model.pt') It is similar, what is being implemented for the TorchANI optimization (https://github.com/raimis/NNPOps/blob/opt_ani/README.md#example). @PhilippThoelke @stefdoerr @giadefa any comments? |
For a moment, it seems all the PyTorch-Geometric packages are broken (pyg-team/pytorch_geometric#3660). |
@peastman I have just finished integrating NNPOps (#50). The performance (https://github.com/torchmd/torchmd-net/blob/main/benchmarks/graph_network.ipynb) is just 2-3 time better for the small molecules (10-100 atoms) and no significant improvement for the larger ones. I'll try to profile to get a better insight. At some, we should discuss, if we can make any further improvements. cc: @giadefa |
It would be useful to separate out all the different optimizations in NNPOps. Can you identify the effect of each one separately? Back when we first started designing it, we discussed requirements and decided it would be optimized for molecules of about 100 atoms. The code is all designed around that assumption. If we want good performance on much larger molecules, it would need to be written differently. For example, it uses a O(n^2) algorithm to build the neighbor list, which is very fast for small molecules and very slow for large ones. |
100 particles is a good case. However, we have two ways of running multiple
simulations.
In one we batch them where the same molecule is run in a single NN batch
for forces and your kernel does not batch.
In another way, we simply make multiple copies, e.g. 64 far enough and run
it as a single system but in this case the system size is more like 6400
particles.
I think that we need batching in the CUDA kernel for identical molecules
and cell lists which are very fast.
g
…On Tue, Feb 22, 2022 at 5:46 PM Peter Eastman ***@***.***> wrote:
It would be useful to separate out all the different optimizations in
NNPOps. Can you identify the effect of each one separately?
Back when we first started designing it, we discussed requirements and
decided it would be optimized for molecules of about 100 atoms. The code is
all designed around that assumption. If we want good performance on much
larger molecules, it would need to be written differently. For example, it
uses a O(n^2) algorithm to build the neighbor list, which is very fast for
small molecules and very slow for large ones.
—
Reply to this email directly, view it on GitHub
<#48 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB3KUOQDRFKIOXWBGYPNER3U4O4UXANCNFSM5IT6CSDQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
That would definitely need code changes to be efficient. You want it to know it only needs to check each atom against the other atoms in its own copy, not all the other copies. Spreading the copies out through space is also inaccurate. The further an atom is from the origin, the less precisely its position can be specified. |
On Tue, Feb 22, 2022 at 6:15 PM Peter Eastman ***@***.***> wrote:
That would definitely need code changes to be efficient. You want it to
know it only needs to check each atom against the other atoms in its own
copy, not all the other copies. Spreading the copies out through space is
also inaccurate. The further an atom is from the origin, the less precisely
its position can be specified.
Yes, the accuracy is a problem but it is quite efficient done this way.
Some tests on forces showed reasonable results though but I agree that it
is a problem. The other way is batching, can you add batching of multiple
copies of the same molecules in your CUDA kernel?
g
… —
Reply to this email directly, view it on GitHub
<#48 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB3KUOQ5S5EDVZCAV6YRUILU4PABXANCNFSM5IT6CSDQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
It's possible. Can you open an issue on the NNPOps repository describing exactly how you would want it to work? |
@raimis can yuo make an issue there as you probbably know the details of what you need in NNPOps. |
Just before going into NNPOps, I checked how much CUDA Graphs (https://pytorch.org/blog/accelerating-pytorch-with-cuda-graphs/) can help. CUDA Graphs don't work with TorchMD_GN due to rusty1s/pytorch_cluster#123. To circumvent this, I have implemented a fake neighbor search (#60), which assumes that all the atoms are neighbors, a.k.a. brute force. The results (https://github.com/raimis/torchmd-net/blob/poc_cuda_graph/benchmarks/graph_network.ipynb) are promising:
|
nice and could we batch that?
…On Wed, Feb 23, 2022 at 4:24 PM Raimondas Galvelis ***@***.***> wrote:
Just before going into NNPOps, I checked how much CUDA Graphs (
https://pytorch.org/blog/accelerating-pytorch-with-cuda-graphs/) can help.
CUDA Graphs don't work with TorchMD_GN due to rusty1s/pytorch_cluster#123
<rusty1s/pytorch_cluster#123>. To circumvent
this, I have implemented a fake neighbor search, which assume that all the
atoms are neighbors, a.k.a. brute force.
The results (
https://github.com/raimis/torchmd-net/blob/poc_cuda_graph/benchmarks/graph_network.ipynb)
are promising:
[image: image]
<https://user-images.githubusercontent.com/2469715/155347806-22cc0fb6-29eb-4cea-b504-b3c75e9f91ef.png>
- For alanine dipeptide (ALA2, 22 atoms) and testosterone (TST, 49
atoms), the brute force approach with CUDA Graphs beat everything else.
- For chignolin (CLN, 166 atoms), the brute force is not longer the
best and for larger systems it runs out of memory.
Ping: @giadefa <https://github.com/giadefa> @peastman
<https://github.com/peastman> @claudi <https://github.com/claudi>
—
Reply to this email directly, view it on GitHub
<#48 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB3KUOXNXMXRGQU6AYQQ43LU4T34FANCNFSM5IT6CSDQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
The current implementation doesn't support batching, but it could be implemented. |
That's interesting. It tells us that for the smaller molecules, the computation time is just dominated by kernel launch overhead. |
Optimization: round 2 I have wrote optimized kernels for the neighbor search (#61) and message passing (#69). The kernels are drop-in replacement for the generic kernels from PyTorch Geometric and have such optimizations:
Speed:
Full details in the notebook: https://github.com/raimis/torchmd-net/blob/poc_cuda_graph_2/benchmarks/graph_network.ipynb |
Optimization of the graph network (
TorchMD_GN
) with NNPOps (https://github.com/openmm/NNPOps).In a special case,
TorchMD_GN
is equivalent to SchNet (#45 (comment)), which is already supported by NNPOps:CFConvNeighbors
andCFConv
-- PyTorch wrapper for SchNet operations openmm/NNPOps#40TorchMD_GN
with NNPOps -- Accelerate the limited TorchMD_GN with NNPOps #50In general,
TorchMD_GN
needs these:CFConv
(rbf_type="expnorm"
)CFConv
(trainable_rbf=True
)CFConv
(activation="silu"
)CFConv
to accelerate the neighbor embedding (neighbor_embedding=True
)The text was updated successfully, but these errors were encountered: