You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Consider the simple Meso compound: F[C@H](Br)[C@@H](F)Br
Since the compound is Meso, the following molecule where each supposed stereocenter is flipped, F[C@@H](Br)[C@H](F)Br, is identical
As expected, these two SMILES, when converted to mols, give the same RegistrationHash in rdkit
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import RegistrationHash
import typing as T
def get_registration_hash(mol: Chem.Mol) -> str:
return RegistrationHash.GetMolHash(
RegistrationHash.GetMolLayers(mol),
hash_scheme=RegistrationHash.HashScheme.TAUTOMER_INSENSITIVE_LAYERS,
)
# by me eye, all should give same hash
smis_to_reg_hash = [
"F[C@H](Br)[C@@H](F)Br",
"F[C@@H](Br)[C@H](F)Br",
]
for smi_to_reg_hash in smis_to_reg_hash:
mol_to_reg_hash = Chem.MolFromSmiles(smi_to_reg_hash)
reg_hash = get_registration_hash(mol_to_reg_hash)
print(reg_hash)
>>>
de050130c16fc475c8376c8098b74f0962eacfce
de050130c16fc475c8376c8098b74f0962eacfce
However, when enhanced stereochemistry is added to these molecules, and the CXSMILES are used as input, the registration hashes begin to differ:
# by me eye, all should give same hash
smis_to_reg_hash = [
"F[C@H](Br)[C@@H](F)Br",
"F[C@@H](Br)[C@H](F)Br",
"F[C@H](Br)[C@@H](F)Br |a:1,3|",
"F[C@@H](Br)[C@H](F)Br |a:1,3|",
"F[C@H](Br)[C@@H](F)Br |&1:1,3|", # if this is a nonsensical representation is likely a question
"F[C@H](Br)[C@@H](F)Br |o1:1,3|", # if this is a nonsensical representation is likely a question
]
for smi_to_reg_hash in smis_to_reg_hash:
mol_to_reg_hash = Chem.MolFromSmiles(smi_to_reg_hash)
reg_hash = get_registration_hash(mol_to_reg_hash)
print(reg_hash)
>>>
de050130c16fc475c8376c8098b74f0962eacfce
de050130c16fc475c8376c8098b74f0962eacfce
b772ec095909a257989b664b5e89a72abf0cb39f
b72de1d4c264144ca5cea0fa0a716602e5a37d55
3c54e06ee032a115c4ea135ac585743a4bc57bd1
a22a99a14844dc864e050fa05fca8f99956112a7
where only the first two are correct.
Additional context and discussion:
Examples that should not give the same hash
For this first example above, if helpful, I have also collected examples of molecules with enhanced stereo annotations that should not give the same registration hash. In these cases, the possibility of giving the non-meso form of the molecules precludes them from being the same species.
# these should not match the original
smis_to_reg_hash = [
"F[C@H](Br)[C@@H](F)Br |&1:1,&2:3|", # non meso form could be chosen in mixture
"F[C@H](Br)[C@@H](F)Br |o1:1,o2:3|", # non meso form could be chosen as single species
]
for smi_to_reg_hash in smis_to_reg_hash:
mol_to_reg_hash = Chem.MolFromSmiles(smi_to_reg_hash)
reg_hash = get_registration_hash(mol_to_reg_hash)
print(reg_hash)
>>>
5e855b2022c9df5d23f32c5fc935b1c25acd2b2d
f016097c3070c0e2102ef1a007ba572d8bda0121
How to handle AND, OR stereo with Meso compounds
This point perhaps requires a longer discussion, but the use of AND, OR groups in the drawing of Meso compounds is an interesting point. For example, in my experience, chemists may draw the following (CXSMILES N[C@@H]1CCC[C@H](N)C1 |&1:1,5|):
which is a bit unusual, since there is only one compound present (a mixture of two identical species), not a mixture of two different compounds.
From my perspective, if the RegistrationHash gave the same hash as the following (CXSMILES N[C@@H]1CCC[C@H](N)C1) in these cases, that would be ideal.
However, I could also understand the behavior if RegistrationHash rejected the AND species as a valid representation.
Inchi and Meso compound detection
A bit of an ancillary point, but will note that Inchi provides a method of detecting Meso compounds. Unfortunately, Inchi does not handle enhanced stereochemistry, which severely limits the use here:
from rdkit.Chem.MolKey.InchiInfo import InchiInfo
smi = "F[C@H](Br)[C@@H](F)Br"
mol = Chem.MolFromSmiles(smi)
info = InchiInfo(Chem.MolToInchi(mol))
# returns True if compound is Meso despite having
# stereocenters in get_sp3_stereo['main']['non-isotopic'][0]
# see https://www.rdkit.org/docs/source/rdkit.Chem.MolKey.InchiInfo.html
print(f"is Meso according to Inchi: {info.get_sp3_stereo()['main']['non-isotopic'][2]}")
>>>
is Meso according to Inchi: True
I know this is a rather annoying, esoteric topic, but it does seem to come up a bit when doing exact matches of compound collections. Please let me know if there is any other info I could help provide, or if I can help brainstorm solutions. Thanks!
Configuration (please complete the following information):
RDKit version: 2023.09.5
OS: MacOS
Python version (if relevant): 3.11.5
Are you using conda? yes, but pip within conda to install rdkit
If you are using conda, which channel did you install the rdkit from? NA
Describe the bug
Consider the simple Meso compound:
F[C@H](Br)[C@@H](F)Br
Since the compound is Meso, the following molecule where each supposed stereocenter is flipped,
F[C@@H](Br)[C@H](F)Br
, is identicalAs expected, these two SMILES, when converted to mols, give the same RegistrationHash in rdkit
However, when enhanced stereochemistry is added to these molecules, and the CXSMILES are used as input, the registration hashes begin to differ:
where only the first two are correct.
Additional context and discussion:
Examples that should not give the same hash
For this first example above, if helpful, I have also collected examples of molecules with enhanced stereo annotations that should not give the same registration hash. In these cases, the possibility of giving the non-meso form of the molecules precludes them from being the same species.
How to handle AND, OR stereo with Meso compounds
This point perhaps requires a longer discussion, but the use of AND, OR groups in the drawing of Meso compounds is an interesting point. For example, in my experience, chemists may draw the following (CXSMILES
N[C@@H]1CCC[C@H](N)C1 |&1:1,5|
):which is a bit unusual, since there is only one compound present (a mixture of two identical species), not a mixture of two different compounds.
From my perspective, if the RegistrationHash gave the same hash as the following (CXSMILES
N[C@@H]1CCC[C@H](N)C1
) in these cases, that would be ideal.However, I could also understand the behavior if RegistrationHash rejected the AND species as a valid representation.
Code relating to the above example:
Inchi and Meso compound detection
A bit of an ancillary point, but will note that Inchi provides a method of detecting Meso compounds. Unfortunately, Inchi does not handle enhanced stereochemistry, which severely limits the use here:
I know this is a rather annoying, esoteric topic, but it does seem to come up a bit when doing exact matches of compound collections. Please let me know if there is any other info I could help provide, or if I can help brainstorm solutions. Thanks!
Configuration (please complete the following information):
The text was updated successfully, but these errors were encountered: