Bemis-Murcko scaffolds and their variants #6844
Unanswered
dehaenw
asked this question in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
As has been occasionally mentioned in discussions, the mailing list, etc, the RDKit Murcko scaffolds do not fully agree with those defined by Bemis and Murcko in their seminal paper (https://pubs.acs.org/doi/10.1021/jm9602928). This looks to be a conscious choice, see discussion here: #4093. All that considered, it is actually straightforward to use the functions provided by RDKit in order to obtain scaffolds true to the paper. The below post is an exploration of how to do that and how different these variants really are.
First things first: there is some degree of confusion about what is what. In their paper, Bemis and Murcko talk about two types of "Molecular Frameworks". Molecules are deconstructed into sidechains, rings and linkers. The framework is then the part of the molecule with sidechains removed (more on this later). This framework is called the "atomic framework" by them. This framework can be made generic by converting all bond types to single bond, and all atom types to carbon. This is what Bemis and Murcko call a "graph framework". Using the language used by Bajorath and coworkers in this paper: https://pubs.acs.org/doi/10.1021/ci200179y (section Definitions), I will refer to atomic framework as defined by Bemis and Murcko as BM scaffolds, and generic frameworks as defined by Bemis and Murcko as CSKs (Cyclic SKeletons).
Now on the main point of disagreement between approaches: the fate of exo-bonded sidechains on the ring.
C1CC1=O
andC1CC1=N
).There is an argument to be made for each of these depending on how generic you want your scaffolds to be. The fact is that, depending on which BM scaffold framework you choose, you will have a different amount of scaffolds.
A second complication occurs during the conversion to generic scaffolds or CSKs. In the RDKit case, because exo atoms are retained, simple flattening with MurckoScaffold.MakeScaffoldGeneric() turns these double bonded atoms into single bonded substituents. Fortunately, to remove these, just another round of MurckoScaffold.GetScaffoldForMol() suffices. The question is: are people doing that? Probably not, because the RDKit documentation suggests to just run MakeScaffoldGeneric() (https://www.rdkit.org/docs/GettingStartedInPython.html#murcko-decomposition). Again, there is an argument to be made for either case, but in this case there is a very big difference in total amount of unique scaffolds. (Bajorath and Bemis and Murcko's CSKs are the same as the two electron placeholder gets scrubbed).
Below, I have some example code that allows one to calculate these different scaffold variants to compare how different they really are in terms of occurrence.
You can see a visual representation of the difference between scaffold types:
Remark the two electron placeholder in the "True BM scaffold" is represented as
=*
This image was generated using the below code block:
I did the analysis on the ChEMBL set from Guacamol (Downloadable at https://figshare.com/articles/dataset/GuacaMol_All_SMILES/7322252 , 1.59M molecules). As you can see, pretty big differences. A few molecules were skipped because they have atoms with a degree > 4 that CSKs with carbons can't deal with.
As splitting by Murcko type scaffolds is widely used in various cheminformatics tasks, it is highly recommended to think properly about what the best scaffold for your task is. I hope this exploration of the differences between them may be of some help, and feel free to add any comments/disagreements!
Beta Was this translation helpful? Give feedback.
All reactions