Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about using RegioML #1

Open
ririjeong opened this issue Sep 6, 2022 · 5 comments
Open

Questions about using RegioML #1

ririjeong opened this issue Sep 6, 2022 · 5 comments

Comments

@ririjeong
Copy link

ririjeong commented Sep 6, 2022

Hi, I have some questions about RegioML

  1. When there is no probability on a site in molecule, does that mean LGBM model cannot predict a probability
    Why can't it predict the probability?

  2. What does it mean for black circle?

Thanks

@NicolaiRee
Copy link
Member

Hi ririjeong,

Thank you for your interest in RegioML!

To answer your questions:

  1. In the visual output, we chose to only highlight atoms with scores above 5%. However, if you e.g. visit the Google Colab Notebook for RegioML (https://t.co/49hfVKuklb?amp=1), we do print all the predicted probabilities for the EAS sites.

  2. The black circles are used in our paper to highlight the experimental reaction sites obtained from the Reaxys data.

Best wishes, Nicolai

@ririjeong
Copy link
Author

ririjeong commented Sep 13, 2022

Thank you for answering,

I tried colab code, but there still are some sites that have no predicted probabilites.
The molecule in attacted image has 16 sites but only has 12 probabilites.
I guess it is because of removing identical atoms. If so, Is there difference btw removing identical atoms and not removing them?

mol

I also have another question regarding model.
The shapes of descriptor are different as molecules are changed. Then how does the LGBM model treat various input to use for
predictions?

Sincerely, ririjeong,

@NicolaiRee
Copy link
Member

Hi again,

In the output lists only unique EAS sites are shown, so if there are identical sites these are not in the list. However, in the depiction part all EAS sites are taken into account. This means that an EAS site with a score above 5 % as well as identical atoms will be highlighted. This is done in the DescriptorCreator/molecule_svg.py file:
highlight_predicted, atom_scores = find_identical_atoms_with_scores(mol, highlight_predicted, atom_scores)
There is no difference between removing or keeping identical sites as the model predicts with atomic descriptors.

The shape of the atomic descriptor is always the same size (a 485-dimensional descriptor) no matter what molecule you are exploring. This is because the atomic descriptor is made from a sorting of the atomic CM5 charges according to the Cahn–Ingold–Prelog (CIP) rules. So you can think of this as a convolution of the atomic charges around the atom of interest. Please have a look at Fig. 1 in our paper and note that we stop the sorting at the 5th shell.

Best wishes, Nicolai

@ririjeong
Copy link
Author

ririjeong commented Sep 26, 2022

Thank you,

I tried to use RegioML but still there are some problems I could not solve.
I wanted to print out all probabilities of EAS site and found out this code.

image

I removed it and showed the result.

The picture on the left side is the result of before removing the code and
the picture on the right side is that of after removiing it.
질문사진

I tried to figure out the reason and extracted descriptors after removing the code.
I found out descriptors were different even though atomic sites were same.

What is the reason for changed result?? I cannot figure out what is wrong.

@NicolaiRee
Copy link
Member

Hi agian,

If you wish to output all the probabilities of all the possible EAS sites, I will recommend you to import the following in the regioML.py file:
from DescriptorCreator.find_atoms import find_identical_atoms_with_scores
and then add:
atom_indices, pred_proba = find_identical_atoms_with_scores(predictor.rdkit_mol, atom_indices, list(pred_proba))

Screenshot 2022-11-06 at 20 11 26

Screenshot 2022-11-06 at 20 09 08

RegioML is tested in this way and the performance you obtain should be identical to what we report in our paper.

However, I have investigated the issue a bit further and found the following reason.
So RegioML relies on a single conformer embedding followed by a fast SQM calculation to obtain the CM5 atomic charges.
The charges are then sorted into the input descriptors, which are used by the machine learning model to get a classification score. Here is a figure showing the calculated atomic charges for the particular molecule you are investigating:

Screenshot 2022-11-06 at 20 24 58

As you can see the calculated atomic charges are not completely identical for atoms with otherwise identical ranking. These small deviations results in slightly different input descriptors, which then result in a different classification score.
In fact, we could use this finding in a future version by training not only on unique EAS sites but all EAS sites. This would make the machine learning model more robust to these small deviations.

Once again thank you for your interest in RegioML!

Best wishes,
Nicolai

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants