Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhanced stereochemistry leads to incorrect RegistrationHash for Meso compounds #7266

Open
mc-robinson opened this issue Mar 16, 2024 · 0 comments
Labels

Comments

@mc-robinson
Copy link

Describe the bug

Consider the simple Meso compound: F[C@H](Br)[C@@H](F)Br
image

Since the compound is Meso, the following molecule where each supposed stereocenter is flipped, F[C@@H](Br)[C@H](F)Br, is identical
image

As expected, these two SMILES, when converted to mols, give the same RegistrationHash in rdkit

from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import RegistrationHash
import typing as T

def get_registration_hash(mol: Chem.Mol) -> str:
    return RegistrationHash.GetMolHash(
        RegistrationHash.GetMolLayers(mol),
        hash_scheme=RegistrationHash.HashScheme.TAUTOMER_INSENSITIVE_LAYERS,
    )

# by me eye, all should give same hash
smis_to_reg_hash = [
    "F[C@H](Br)[C@@H](F)Br",
    "F[C@@H](Br)[C@H](F)Br",
]
for smi_to_reg_hash in smis_to_reg_hash:
    mol_to_reg_hash = Chem.MolFromSmiles(smi_to_reg_hash)
    reg_hash = get_registration_hash(mol_to_reg_hash)
    print(reg_hash)
    
>>>
de050130c16fc475c8376c8098b74f0962eacfce
de050130c16fc475c8376c8098b74f0962eacfce

However, when enhanced stereochemistry is added to these molecules, and the CXSMILES are used as input, the registration hashes begin to differ:

# by me eye, all should give same hash
smis_to_reg_hash = [
    "F[C@H](Br)[C@@H](F)Br",
    "F[C@@H](Br)[C@H](F)Br",
    "F[C@H](Br)[C@@H](F)Br |a:1,3|",
    "F[C@@H](Br)[C@H](F)Br |a:1,3|",
    "F[C@H](Br)[C@@H](F)Br |&1:1,3|", # if this is a nonsensical representation is likely a question
    "F[C@H](Br)[C@@H](F)Br |o1:1,3|", # if this is a nonsensical representation is likely a question
]
for smi_to_reg_hash in smis_to_reg_hash:
    mol_to_reg_hash = Chem.MolFromSmiles(smi_to_reg_hash)
    reg_hash = get_registration_hash(mol_to_reg_hash)
    print(reg_hash)
>>>
de050130c16fc475c8376c8098b74f0962eacfce
de050130c16fc475c8376c8098b74f0962eacfce
b772ec095909a257989b664b5e89a72abf0cb39f
b72de1d4c264144ca5cea0fa0a716602e5a37d55
3c54e06ee032a115c4ea135ac585743a4bc57bd1
a22a99a14844dc864e050fa05fca8f99956112a7

where only the first two are correct.

Additional context and discussion:

Examples that should not give the same hash
For this first example above, if helpful, I have also collected examples of molecules with enhanced stereo annotations that should not give the same registration hash. In these cases, the possibility of giving the non-meso form of the molecules precludes them from being the same species.

# these should not match the original
smis_to_reg_hash = [
    "F[C@H](Br)[C@@H](F)Br |&1:1,&2:3|", # non meso form could be chosen in mixture
    "F[C@H](Br)[C@@H](F)Br |o1:1,o2:3|",  # non meso form could be chosen as single species
]
for smi_to_reg_hash in smis_to_reg_hash:
    mol_to_reg_hash = Chem.MolFromSmiles(smi_to_reg_hash)
    reg_hash = get_registration_hash(mol_to_reg_hash)
    print(reg_hash)
>>>
5e855b2022c9df5d23f32c5fc935b1c25acd2b2d
f016097c3070c0e2102ef1a007ba572d8bda0121

How to handle AND, OR stereo with Meso compounds
This point perhaps requires a longer discussion, but the use of AND, OR groups in the drawing of Meso compounds is an interesting point. For example, in my experience, chemists may draw the following (CXSMILES N[C@@H]1CCC[C@H](N)C1 |&1:1,5|):
image
which is a bit unusual, since there is only one compound present (a mixture of two identical species), not a mixture of two different compounds.

From my perspective, if the RegistrationHash gave the same hash as the following (CXSMILES N[C@@H]1CCC[C@H](N)C1) in these cases, that would be ideal.
image

However, I could also understand the behavior if RegistrationHash rejected the AND species as a valid representation.

Code relating to the above example:

smis_to_reg_hash = [
    "N[C@@H]1CCC[C@H](N)C1 |&1:1,5|",
    "N[C@@H]1CCC[C@H](N)C1",
    "N[C@H]1CCC[C@@H](N)C1",
]
for smi_to_reg_hash in smis_to_reg_hash:
    mol_to_reg_hash = Chem.MolFromSmiles(smi_to_reg_hash)
    reg_hash = get_registration_hash(mol_to_reg_hash)
    print(reg_hash)
>>>
0e107c287d119b0de8171e98d6e0cb4089cc4edb
697895487237eb0798005225715386ca29450eec
697895487237eb0798005225715386ca29450eec

Inchi and Meso compound detection
A bit of an ancillary point, but will note that Inchi provides a method of detecting Meso compounds. Unfortunately, Inchi does not handle enhanced stereochemistry, which severely limits the use here:

from rdkit.Chem.MolKey.InchiInfo import InchiInfo
smi = "F[C@H](Br)[C@@H](F)Br"
mol = Chem.MolFromSmiles(smi)
info = InchiInfo(Chem.MolToInchi(mol))

# returns True if compound is Meso despite having 
# stereocenters in get_sp3_stereo['main']['non-isotopic'][0]
# see https://www.rdkit.org/docs/source/rdkit.Chem.MolKey.InchiInfo.html
print(f"is Meso according to Inchi: {info.get_sp3_stereo()['main']['non-isotopic'][2]}")
>>>
is Meso according to Inchi: True

I know this is a rather annoying, esoteric topic, but it does seem to come up a bit when doing exact matches of compound collections. Please let me know if there is any other info I could help provide, or if I can help brainstorm solutions. Thanks!

Configuration (please complete the following information):

  • RDKit version: 2023.09.5
  • OS: MacOS
  • Python version (if relevant): 3.11.5
  • Are you using conda? yes, but pip within conda to install rdkit
  • If you are using conda, which channel did you install the rdkit from? NA
  • If you are not using conda: how did you install the RDKit? pip, https://pypi.org/project/rdkit-pypi/
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant