Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GETAWAY descriptors seem nondeterministic #7264

Open
j-adamczyk opened this issue Mar 16, 2024 · 8 comments
Open

GETAWAY descriptors seem nondeterministic #7264

j-adamczyk opened this issue Mar 16, 2024 · 8 comments
Labels

Comments

@j-adamczyk
Copy link

j-adamczyk commented Mar 16, 2024

Describe the bug

When I run GETAWAY descriptors multiple times, I get different results.

To Reproduce

I use molecules from HIV dataset from MoleculeNet, but I suspect this behavior is the same everywhere.

First, I keep largest components and generate conformers:

from rdkit.Chem.rdMolDescriptors import CalcGETAWAY
from rdkit.Chem import AddHs, MolFromSmiles, MolToSmiles, RemoveHs
from rdkit.Chem.MolStandardize.rdMolStandardize import LargestFragmentChooser
from rdkit.Chem.rdDistGeom import EmbedMolecule, ETKDGv3

chooser = LargestFragmentChooser()
X = [chooser.choose(mol) for mol in X]

X = [AddHs(mol) for mol in X]
embed_params = ETKDGv3()
embed_params.randomSeed = 0

conf_ids = []
for mol in X:
    conf_id = EmbedMolecule(mol, embed_params)
    conf_ids.append(conf_id)

X = [RemoveHs(mol) for mol in X]

for i in range(len(X)):
    X[i].conf_id = conf_ids[i]

I save conformer IDs as attributes of Mol objects, as a simple way to pass them around.

Then calculating the GETAWAY descriptors:

X_1 = np.array([CalcGETAWAY(mol, confId=mol.conf_id) for mol in X])
X_2 = np.array([CalcGETAWAY(mol, confId=mol.conf_id) for mol in X])

Now, if I check equality, I get an error:

assert np.allclose(X_1, X_2)

This can be verified element-by-element with:

xs, ys = np.nonzero(~np.isclose(X_1, X_2))
for x, y in zip(xs, ys):
    print(x, y, X_1[x, y], X_2[x, y])

Those differences can sometimes be very large:

0 13 2.194 2.263
0 22 1.0 0.0
0 23 6.0 4.0
0 38 0.0 nan
0 40 3.2388025299883392e+184 0.0
0 41 1.0 0.0
0 43 1.2955210119953357e+185 nan
0 50 0.0 0.041
0 53 2.2 2.281
0 92 0.0 0.033
0 93 2.846 2.911
0 100 0.0 0.463

Expected behavior
Deterministic calculation, or at least being able to set a random seed.

Configuration (please complete the following information):

  • RDKit version: 2023.9.5
  • OS: Ubuntu 22.04
  • Python version (if relevant): 3.11
  • Are you using conda? no
  • If you are not using conda: how did you install the RDKit? pip
@thegodone
Copy link
Contributor

thegodone commented Mar 17, 2024 via email

@j-adamczyk
Copy link
Author

I mean that if you run exactly the same code twice, you will get two different results. This applies also when conformers are already calculated, as you can see from my code. This way, results are not reproducible. In particular, any ML code running this would have different features at any run.

I think that either:

  1. Make the function deterministic, i.e. given conformers, it should return the same results for GETAWAY descriptors
  2. Add randomSeed argument to control this
  3. If options above are impossible, add a note to the docs that this is intended behavior

@j-adamczyk
Copy link
Author

j-adamczyk commented Mar 19, 2024

@thegodone the code is, indeed, nondeterministic, as commented here:

std::vector<double> clusterArray2(std::vector<double> data,

This should be fixed, and this is what this issue concerns.

@thegodone
Copy link
Contributor

thegodone commented Mar 19, 2024 via email

@greglandrum
Copy link
Member

@j-adamczyk can you share a specific molecule where you noticed non-deterministic behavior?
I have not been able to reproduce this.

@j-adamczyk
Copy link
Author

@greglandrum sure. I used HIV dataset from MoleculeNet, and smallest molecules from it (shortest SMILES, to be precise). I first generate conformers (with 1000 attempts), and then perform 2 GETAWAY calculations.

I get nondeterministic results on all 10 smallest molecules:

['NCCS', 'CNC=O', '[Li]F', '[Li]Cl', 'CS(C)=O', 'O=[As]O', 'O=C(O)O', 'CC(=O)O', 'C1SCSCS1', 'OCC(S)CS']

Among 50 smallest SMILES, 48 give different results on two GETAWAY runs. Those 50 SMILES are:

['NCCS', 'CNC=O', '[Li]F', '[Li]Cl', 'CS(C)=O', 'O=[As]O', 'O=C(O)O', 'CC(=O)O', 'C1SCSCS1', 'OCC(S)CS', 'NC(=O)NO', 'OCCNNCCO', 'CN(C)CCS', 'Cl[Cu]Cl', 'CCOCNC=O', 'NCCCNCCS', 'S=C1NCCS1', 'N=C1NCCS1', 'O=S1OCCO1', 'S=C1SCCS1', 'CN1CSCSC1', 'Brc1csnn1', 'CNCCSSCCN', 'N=C1SCCS1', 'S=C1NCCO1', 'CONC(C)=O', 'CNC(C)C#N', 'O=S1CSCS1', 'CC(=O)C=O', 'COC=CCCCO', 'N#CNC(=N)N', 'CSSC(SC)SC', 'O=C1CCCCN1', 'NN1CCOC1=O', 'c1cn[nH]c1', 'O=S1CCCCS1', 'CSCCC(N)CO', 'O=C(O)CCCO', 'NCCSSSSCCN', 'NCC1=NCCN1', 'Nc1cccnc1O', 'OCc1cccnc1', 'N#Cc1cnsc1', 'OCC1CCCC1O', 'CC=C(I)CBr', 'O=C(O)C1CC1', 'S=C1CCCCCN1', 'C=Cc1ccccn1', 'C=Cc1ccncc1', 'NC1CCCCCCC1']

Only 'CNCCSSCCN' and 'NCCSSSSCCN' give the same result.

@thegodone
Copy link
Contributor

thegodone commented Apr 10, 2024 via email

@j-adamczyk
Copy link
Author

@thegodone to be precise, I first generate the conformations, and only then I calculate GETAWAY descriptors multiple times. So I get different descriptor values for the same conformation. You can see that in my code in the original question.

For example, for SMILES "Brc1csnn1", I generate conformations, and I get conf_id = 0. When I calculate GETAWAY descriptors using that conformation two times, I get different values:

  • bit 38, first calculation gives 0.0, second nan
  • bit 41, first calculation gives 1.0, second 0.0
  • bit 42, first calculation gives -2.7464238958254607e+225 (basically +infinity), second nan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants