-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training equivariant transformer with OptimizedDistance #203
Comments
Hi! |
#204 is merged, please pull and try again. |
Also, be careful with the parameters, the current ET expects Distance to return_vecs and have self loops. You should replace the Distance line with: self.distance = OptimizedDistance(
cutoff_lower,
cutoff_upper,
max_num_pairs=-max_num_neighbors,
return_vecs=True,
loop=True,
check_errors = False, # Note that this line will silently leave neighbors out of the list if there are too much
box = torch.diag(torch.tensor(pbc_box))
) |
Perfect, thank you so much! I ultimately need this model to compile down to a torchscript module for use with OpenMM, so are there any additional changes I should make to ensure that that will work? Or is replacing Distance with Optimized distance sufficient? |
That should be it! But if you find any issues just let us know. |
I tried training the equivariant transformer with the OptimizedDistance replaced as suggested. I am now running into the following error whenever the model tries to run through the test set after training:
For context, I am training the model using a dataset organized in an HDF5 format. Looking at the Note that this does not occur during the fitting stage, and the model can successfully train for any number of epochs. This only occurs after training is complete. In that sense, it does not seem like a critical issue. Thanks! |
I found a more serious issue with training the equivariant transformers using the My goal with the equivariant transformer is to train a model using the implemented periodic boundary conditions such that it can be used with the from openmmtorch import TorchForce
force = TorchForce("generated_mod.pt") #My generated TorchScript module I run into the following error when trying to generate a
Looking back at the documentation for
The slurm output file also has the following messages printed out:
For reference, my Distance line looks as follows: self.distance = OptimizedDistance(
cutoff_lower,
cutoff_upper,
max_num_pairs = -max_num_neighbors,
return_vecs = True,
loop = True,
resize_to_fit = False, #Setting this to True makes training work, but False causes crashing
check_errors = False,
box = torch.diag(torch.tensor(pbc_box))
) Any guidance on how to debug this would be greatly appreciated as I am not very familiar with programming in CUDA or interfacing with CUDA at a low level. Thank you! |
Your first error is probably the one described here: #205 CUDA graphs are a separated feature and not required at all for OpenMM-Torch compatibility, your error above is telling you that some operation in ET is not CUDA graph compatible (in particular mask.all()). We have not had success thus far in training with CUDA graphs. self.distance = OptimizedDistance(
cutoff_lower,
cutoff_upper,
max_num_pairs=-max_num_neighbors,
return_vecs=True,
loop=True,
box = torch.diag(torch.tensor(pbc_box))
) The worrying error you are getting is this: Could not find any similar ops to torchmdnet_neighbors::get_neighbor_pairs. This op may not exist or may not be currently supported in TorchScript. I assumed by TorchScripting the module all operations will be placed inside the pt file. Apparently not!. from openmmtorch import TorchForce
import torchmdnet.models.utils #This line will register the library to torch.ops
force = TorchForce("generated_mod.pt") #My generated TorchScript module I am sorry for this inconvenience, I will take a look to see if something can be done. |
Thanks for the clarifications, that was very helpful! I found that it was helpful to consolidate everything into one virtual environment because there were specific imports that were needed to make the TorchForce module to work, specifically: from openmmtorch import TorchForce
import torchmdnet.neighbors #Addresses the missing operations error
import torch_cluster, torch_geometric #Required for some other operations within ET
force = TorchForce("generated_mod.pt") #This now works With this set up, I can now run dynamics within openmm. Thanks for all your help! |
Hello,
I am currently trying to train the equivariant transformer model using the OptimizedDistance module by replacing the call to Distance() with OptimizedDistance() in torchmd-net/torchmdnet/models/torchmd_et.py. I want to train on a system with periodic boundary conditions. However, when I try running the training, I get the following traceback:
I saw in a previous commit that this file was removed, but it seems like the model cannot proceed with training without it. For reference, here is the change I made within torchmd_et.py:
I am running on one Nvidia H100 GPU. Any help/clarification would be greatly appreciated.
Thank you!
The text was updated successfully, but these errors were encountered: