Why doesn't the rdkit MFP implementation set radius=2 bits for D1 atoms #7175

zackterray · 2024-02-21T03:45:55Z

zackterray
Feb 21, 2024

I was looking at the bit info map for the rdkit morgan fingerprints and I noticed that D1 atoms (the smarts definition) never set bits for radius = 2. Is this a feature of the original implementation of morgan fingerprints? Maybe I'm misunderstanding something about how the algorithm works, but it seems like this would be throwing out useful information if it's intentional

relevant code:

from rdkit.Chem import rdFingerprintGenerator
from rdkit.Chem.Draw import IPythonConsole
from rdkit import DataStructs
import rdkit

mfpgen = rdFingerprintGenerator.GetMorganGenerator(radius=2,fpSize=2048)

def mol_with_atom_index(mol):
    for atom in mol.GetAtoms():
        atom.SetAtomMapNum(atom.GetIdx())
    return mol

mol = Chem.MolFromSmiles('*CCN1C=C(C)C2=CC=CC=C21')

ao = rdFingerprintGenerator.AdditionalOutput()
# we have to ask for the information we're interested in by allocating space for it:
ao.AllocateAtomCounts()
ao.AllocateAtomToBits()
ao.AllocateBitInfoMap()

fp = mfpgen.GetFingerprint(mol,additionalOutput=ao)

mol_with_atom_index(mol)

for v in ao.GetBitInfoMap().values():
    for i in v:
        # select shells where radius = 2
        if i[1] == 2:
            print(i)

gives the output:

(7, 2)
(5, 2)
(9, 2)
(10, 2)
(11, 2)
(12, 2)
(2, 2)
(8, 2)
(4, 2)
(3, 2)

if you look at the mol with atom indexes, you can see that the two D1 atoms (0 and 6) do not set bits for radius=2

or alternatively you can see that two of the atoms only set two bits:
ao.GetAtomCounts()
(2, 3, 3, 3, 3, 3, 2, 3, 3, 3, 3, 3, 3)

another example:

mol = Chem.MolFromSmiles('SCC1=CC(CCC(O)=O)=CC2=CC(N)=CC=C21')
ao.GetAtomCounts()
(2, 3, 3, 3, 3, 3, 3, 3, 2, 2, 3, 3, 3, 3, 2, 3, 3, 3)

rdkit version: 2023.03.3

Answered by dehaenw

Feb 27, 2024

Hi,
this is because redundant environments (same environment, but higher radius or same radius but higher invariant) are removed. You can see the atom 0 radius 2 fragment is already covered by atom 1 radius 1 morgan environment. likewise, atom 6 radius 2 fragment is already covered by atom 5 radius 1 fragment. This is also discussed in the rogers and hahn ecfp paper the rdkit implementation is based on. There's also a "includeRedundantEnvironments" flag somewhere you can use in case you need this information for your use case.

View full answer

dehaenw · 2024-02-27T10:35:17Z

dehaenw
Feb 27, 2024

Hi,
this is because redundant environments (same environment, but higher radius or same radius but higher invariant) are removed. You can see the atom 0 radius 2 fragment is already covered by atom 1 radius 1 morgan environment. likewise, atom 6 radius 2 fragment is already covered by atom 5 radius 1 fragment. This is also discussed in the rogers and hahn ecfp paper the rdkit implementation is based on. There's also a "includeRedundantEnvironments" flag somewhere you can use in case you need this information for your use case.

2 replies

dehaenw Feb 27, 2024

and this is in general true for D1 atoms. If they are connected to another D1 atom, there will be two r1 fragments, one of them redundant. If they are connected to a D >1 atom, the environment centered on that atom will encompass the same environment but at a radius -1. This is why all D=1 r>1 environments are redundant.

zackterray Feb 27, 2024
Author

Thank for pointing me to the original paper, looking at their diagram with your explanation makes a lot of sense!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why doesn't the rdkit MFP implementation set radius=2 bits for D1 atoms #7175

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Why doesn't the rdkit MFP implementation set radius=2 bits for D1 atoms #7175

zackterray Feb 21, 2024

Replies: 1 comment · 2 replies

dehaenw Feb 27, 2024

dehaenw Feb 27, 2024

zackterray Feb 27, 2024 Author

zackterray
Feb 21, 2024

Replies: 1 comment 2 replies

dehaenw
Feb 27, 2024

zackterray Feb 27, 2024
Author