The substructures represented by certain bits in Morgan fingerprint binary vectors vary for different molecules #7094

Rainsmumu · 2024-01-26T00:56:12Z

Rainsmumu
Jan 26, 2024

In the field of materials informatics, a common workflow for predicting properties of some organic compounds involves first converting SMILES into Morgan fingerprints within a dataset. These fingerprints are then fed into interpretable models such as random forests. And then, using feature importance, important bits (substructures) are identified to gain chemical insights.

However, I have recently encountered an issue where the substructures represented by the same bits in Morgan fingerprints are different for different molecules.

For example, for these two molecules 'C=CC(=O)OCC(C)C(C)CC' and 'C=CC(=O)Oc1cccc(C(=O)OC)c1', both of them are converted into Morgan fingerprints with a length of 2048 and a radius of 3.

mol1 = Chem.MolFromSmiles('C=CC(=O)OCC(C)C(C)CC')
mol2 = Chem.MolFromSmiles('C=CC(=O)Oc1cccc(C(=O)OC)c1')

fp1 = mfpgen.GetCountFingerprintAsNumPy(mol1)
fp2 = mfpgen.GetCountFingerprintAsNumPy(mol2)

print(fp1[1949])
print(fp1[1949] == fp2[1949])

You can see that both of these two molecules have a value of 1 at bit 1949, but the substructures they represent
for 'C=CC(=O)OCC(C)C(C)CC':

for 'C=CC(=O)Oc1cccc(C(=O)OC)c1':

They are completely different. I understand how the Morgan algorithm works. These two different substructures have different identifiers, but when compressed into a 2048-bit vector, they have the same value when mod 2048. But my question is, for a dataset where the substructures represented by a certain feature column (i.e., bit) are not the same, when I use feature importance to assess the impact of different substructures, which substructure should I choose for bit 1949? Even without assessing feature importance, is it really acceptable for a ML model if the same column of features represents different meanings?

bp-kelley · 2024-01-27T15:57:18Z

bp-kelley
Jan 27, 2024
Collaborator

This is a common misconception about how Morgan Fingerprints work. There are two phases:

determine the invariants of atoms in a certain radius. This is similar to canonicalization. It generates an invariant "hash" for a given arrangement of atoms and records it. The number of hashes is not bounded. In the RDKit Book this example creates a sparse fingerprint

from rdkit.Chem import AllChem
>>> fpgen = AllChem.GetMorganGenerator(radius=2)
>>> m1 = Chem.MolFromSmiles('Cc1ccccc1')
>>> fp1 = fpgen.GetSparseCountFingerprint(m1)

The fingerprint is "sparse" in the sense that it acts like a set and contains the hashes found in the molecule. This representation has the characteristics you are looking for.

However, when you make a "Fingerprint" you need to take the larger set of invariants and stuff them into a smaller space, i.e. 2048 bits:

fp1 = fpgen.GetFingerprint(m1)
>>> fp1
<rdkit.DataStructs.cDataStructs.ExplicitBitVect object at 0x...>
>>> len(fp1)
2048

Obviously, taking a larger set and forcing them into a smaller set can cause collisions. This is what you are seeing here.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The substructures represented by certain bits in Morgan fingerprint binary vectors vary for different molecules #7094

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

The substructures represented by certain bits in Morgan fingerprint binary vectors vary for different molecules #7094

Rainsmumu Jan 26, 2024

Replies: 1 comment

bp-kelley Jan 27, 2024 Collaborator

Rainsmumu
Jan 26, 2024

bp-kelley
Jan 27, 2024
Collaborator