New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GETAWAY descriptors seem nondeterministic #7264
Comments
What do you mean by non deterministic ? It depends on 3D distance and angles so it should be invariant to rotation only. Envoyé de mon iPhoneLe 16 mars 2024 à 22:37, Jakub Adamczyk ***@***.***> a écrit :
Describe the bug
When I run GETAWAY descriptors multiple times, I get different results.
To Reproduce
I use molecules from HIV dataset from MoleculeNet, but I suspect this behavior is the same everywhere.
First, I keep largest components and generate conformers:
chooser = LargestFragmentChooser()
X = [chooser.choose(mol) for mol in X]
X = [AddHs(mol) for mol in X]
embed_params = ETKDGv3()
embed_params.randomSeed = 0
conf_ids = []
for mol in X:
conf_id = EmbedMolecule(mol, embed_params)
conf_ids.append(conf_id)
X = [RemoveHs(mol) for mol in X]
for i in range(len(X)):
X[i].conf_id = conf_ids[i]
I save conformer IDs as attributes of Mol objects, as a simple way to pass them around.
Then calculating the GETAWAY descriptors:
X_1 = np.array([CalcGETAWAY(mol, confId=mol.conf_id) for mol in X])
X_2 = np.array([CalcGETAWAY(mol, confId=mol.conf_id) for mol in X])
Now, if I check equality, I get an error:
assert np.allclose(X_1, X_2)
This can be verified element-by-element with:
xs, ys = np.nonzero(~np.isclose(X_1, X_2))
for x, y in zip(xs, ys):
print(x, y, X_1[x, y], X_2[x, y])
Those differences can sometimes be very large:
0 13 2.194 2.263
0 22 1.0 0.0
0 23 6.0 4.0
0 38 0.0 nan
0 40 3.2388025299883392e+184 0.0
0 41 1.0 0.0
0 43 1.2955210119953357e+185 nan
0 50 0.0 0.041
0 53 2.2 2.281
0 92 0.0 0.033
0 93 2.846 2.911
0 100 0.0 0.463
Expected behavior
Deterministic calculation, or at least being able to set a random seed.
Configuration (please complete the following information):
RDKit version: 2023.9.5
OS: Ubuntu 22.04
Python version (if relevant): 3.11
Are you using conda? no
If you are not using conda: how did you install the RDKit? pip
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>
|
I mean that if you run exactly the same code twice, you will get two different results. This applies also when conformers are already calculated, as you can see from my code. This way, results are not reproducible. In particular, any ML code running this would have different features at any run. I think that either:
|
@thegodone the code is, indeed, nondeterministic, as commented here: rdkit/Code/GraphMol/Descriptors/GETAWAY.cpp Line 216 in a8d4912
This should be fixed, and this is what this issue concerns. |
Happy to have help on this issue. I was never find a fast alternative so
far.
Le mar. 19 mars 2024 à 15:07, Jakub Adamczyk ***@***.***> a
écrit :
… @thegodone <https://github.com/thegodone> the code is, indeed,
nondeterministic, as commented here:
https://github.com/rdkit/rdkit/blob/a8d4912f88ae2ea9ea7afa366ba5b9c0be09cb79/Code/GraphMol/Descriptors/GETAWAY.cpp#L216
This should be fixed, and this is what this issue concerns.
—
Reply to this email directly, view it on GitHub
<#7264 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJBWYV7O5M3BN6R2QCRTULYZBBATAVCNFSM6AAAAABEZTZIJOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBXGI3TOMBQHA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@j-adamczyk can you share a specific molecule where you noticed non-deterministic behavior? |
@greglandrum sure. I used HIV dataset from MoleculeNet, and smallest molecules from it (shortest SMILES, to be precise). I first generate conformers (with 1000 attempts), and then perform 2 GETAWAY calculations. I get nondeterministic results on all 10 smallest molecules:
Among 50 smallest SMILES, 48 give different results on two GETAWAY runs. Those 50 SMILES are:
Only |
We have two step in the computation the 3D generation that is
nondeterministic and the Getaway computation based on the 3D conformation.
@j-adamczyk : do you have for the same conformation cases where Getaway is
nondeterministic "running 100 calls with the same conformation" ?
Guillaume
Le mer. 10 avr. 2024 à 10:59, Jakub Adamczyk ***@***.***> a
écrit :
… @greglandrum <https://github.com/greglandrum> sure. I used HIV dataset
from MoleculeNet, and smallest molecules from it (shortest SMILES, to be
precise). I first generate conformers (with 1000 attempts), and then
perform 2 GETAWAY calculations.
I get nondeterministic results on all 10 smallest molecules:
['NCCS', 'CNC=O', '[Li]F', '[Li]Cl', 'CS(C)=O', 'O=[As]O', 'O=C(O)O', 'CC(=O)O', 'C1SCSCS1', 'OCC(S)CS']
Among 50 smallest SMILES, 48 give different results on two GETAWAY runs.
Those 50 SMILES are:
['NCCS', 'CNC=O', '[Li]F', '[Li]Cl', 'CS(C)=O', 'O=[As]O', 'O=C(O)O', 'CC(=O)O', 'C1SCSCS1', 'OCC(S)CS', 'NC(=O)NO', 'OCCNNCCO', 'CN(C)CCS', 'Cl[Cu]Cl', 'CCOCNC=O', 'NCCCNCCS', 'S=C1NCCS1', 'N=C1NCCS1', 'O=S1OCCO1', 'S=C1SCCS1', 'CN1CSCSC1', 'Brc1csnn1', 'CNCCSSCCN', 'N=C1SCCS1', 'S=C1NCCO1', 'CONC(C)=O', 'CNC(C)C#N', 'O=S1CSCS1', 'CC(=O)C=O', 'COC=CCCCO', 'N#CNC(=N)N', 'CSSC(SC)SC', 'O=C1CCCCN1', 'NN1CCOC1=O', 'c1cn[nH]c1', 'O=S1CCCCS1', 'CSCCC(N)CO', 'O=C(O)CCCO', 'NCCSSSSCCN', 'NCC1=NCCN1', 'Nc1cccnc1O', 'OCc1cccnc1', 'N#Cc1cnsc1', 'OCC1CCCC1O', 'CC=C(I)CBr', 'O=C(O)C1CC1', 'S=C1CCCCCN1', 'C=Cc1ccccn1', 'C=Cc1ccncc1', 'NC1CCCCCCC1']
Only 'CNCCSSCCN' and 'NCCSSSSCCN' give the same result.
—
Reply to this email directly, view it on GitHub
<#7264 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJBWYWRFOEQWKMSJIDFOPDY4T5PJAVCNFSM6AAAAABEZTZIJOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANBWHE2TKNBUG4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@thegodone to be precise, I first generate the conformations, and only then I calculate GETAWAY descriptors multiple times. So I get different descriptor values for the same conformation. You can see that in my code in the original question. For example, for SMILES
|
Describe the bug
When I run GETAWAY descriptors multiple times, I get different results.
To Reproduce
I use molecules from HIV dataset from MoleculeNet, but I suspect this behavior is the same everywhere.
First, I keep largest components and generate conformers:
I save conformer IDs as attributes of
Mol
objects, as a simple way to pass them around.Then calculating the GETAWAY descriptors:
Now, if I check equality, I get an error:
This can be verified element-by-element with:
Those differences can sometimes be very large:
Expected behavior
Deterministic calculation, or at least being able to set a random seed.
Configuration (please complete the following information):
The text was updated successfully, but these errors were encountered: