Bemis-Murcko scaffolds and their variants #6844

dehaenw · 2023-10-27T12:29:47Z

dehaenw
Oct 27, 2023

As has been occasionally mentioned in discussions, the mailing list, etc, the RDKit Murcko scaffolds do not fully agree with those defined by Bemis and Murcko in their seminal paper (https://pubs.acs.org/doi/10.1021/jm9602928). This looks to be a conscious choice, see discussion here: #4093. All that considered, it is actually straightforward to use the functions provided by RDKit in order to obtain scaffolds true to the paper. The below post is an exploration of how to do that and how different these variants really are.

First things first: there is some degree of confusion about what is what. In their paper, Bemis and Murcko talk about two types of "Molecular Frameworks". Molecules are deconstructed into sidechains, rings and linkers. The framework is then the part of the molecule with sidechains removed (more on this later). This framework is called the "atomic framework" by them. This framework can be made generic by converting all bond types to single bond, and all atom types to carbon. This is what Bemis and Murcko call a "graph framework". Using the language used by Bajorath and coworkers in this paper: https://pubs.acs.org/doi/10.1021/ci200179y (section Definitions), I will refer to atomic framework as defined by Bemis and Murcko as BM scaffolds, and generic frameworks as defined by Bemis and Murcko as CSKs (Cyclic SKeletons).

Now on the main point of disagreement between approaches: the fate of exo-bonded sidechains on the ring.

RDKit: retain the first atom of the exo-bonded substituent (e.g. it distinguishes between C1CC1=O and C1CC1=N).
Bemis and Murcko: remove these substituents, but leave a two electron placeholder per exo bond (can be several per atom, for example sulfones). In Chart 2 of Murcko's paper it can be seen they really distinguish between two electron placeholder or not, as both benzylbenzene and the benzophenone derived scaffold are in there as distinct.
Bajorath: remove these substituents, and don't leave a placeholder. (for example, sulfonamide becomes *SN*)

There is an argument to be made for each of these depending on how generic you want your scaffolds to be. The fact is that, depending on which BM scaffold framework you choose, you will have a different amount of scaffolds.

A second complication occurs during the conversion to generic scaffolds or CSKs. In the RDKit case, because exo atoms are retained, simple flattening with MurckoScaffold.MakeScaffoldGeneric() turns these double bonded atoms into single bonded substituents. Fortunately, to remove these, just another round of MurckoScaffold.GetScaffoldForMol() suffices. The question is: are people doing that? Probably not, because the RDKit documentation suggests to just run MakeScaffoldGeneric() (https://www.rdkit.org/docs/GettingStartedInPython.html#murcko-decomposition). Again, there is an argument to be made for either case, but in this case there is a very big difference in total amount of unique scaffolds. (Bajorath and Bemis and Murcko's CSKs are the same as the two electron placeholder gets scrubbed).

Below, I have some example code that allows one to calculate these different scaffold variants to compare how different they really are in terms of occurrence.

You can see a visual representation of the difference between scaffold types:

Remark the two electron placeholder in the "True BM scaffold" is represented as =*

This image was generated using the below code block:

from rdkit import Chem
from rdkit.Chem import AllChem,Draw
from rdkit.Chem.Scaffolds import MurckoScaffold
PATT=Chem.MolFromSmarts("[$([D1]=[*])]")
REPL=Chem.MolFromSmarts("[*]")
def get_scaffold(mol,real_bm=True,use_csk=False,use_bajorath=False):
    Chem.RemoveStereochemistry(mol) #important for canonization of CSK!
    scaff=MurckoScaffold.GetScaffoldForMol(mol)
    if use_bajorath:
        scaff=AllChem.DeleteSubstructs(scaff, PATT)
    if real_bm:
        scaff=AllChem.ReplaceSubstructs(scaff,PATT,REPL,replaceAll=True)[0]                                          
    if use_csk:
        scaff=MurckoScaffold.MakeScaffoldGeneric(scaff)
        if real_bm:
            scaff=MurckoScaffold.GetScaffoldForMol(scaff)
    return scaff


m=Chem.MolFromSmiles("c1cc(F)ccc1OC(=O)C1CS(=NC)(=O)C1")

scaff_legends=["parent molecule","RDKit BM","True BM","Bajorath BM","RDKit generic","True CSK"]
rdkit_bm=get_scaffold(m,real_bm=False)
true_bm=get_scaffold(m,real_bm=True)
bajorath_bm=get_scaffold(m,use_bajorath=True)
rdkit_csk=get_scaffold(m,real_bm=False,use_csk=True)
true_csk=get_scaffold(m,real_bm=True,use_csk=True)

d = Draw.MolsToGridImage([m,rdkit_bm,true_bm,bajorath_bm,rdkit_csk,true_csk],legends=scaff_legends,molsPerRow=6)
display(d)

I did the analysis on the ChEMBL set from Guacamol (Downloadable at https://figshare.com/articles/dataset/GuacaMol_All_SMILES/7322252 , 1.59M molecules). As you can see, pretty big differences. A few molecules were skipped because they have atoms with a degree > 4 that CSKs with carbons can't deal with.

Scaffold	Unique scaffolds	Unique scaffolds present >10x
RDKit BM	470961	23030
True BM	465873	23051
Bajorath BM	439888	23004
RDKit CSK	193970	19960
True CSK	109935	13785

As splitting by Murcko type scaffolds is widely used in various cheminformatics tasks, it is highly recommended to think properly about what the best scaffold for your task is. I hope this exploration of the differences between them may be of some help, and feel free to add any comments/disagreements!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bemis-Murcko scaffolds and their variants #6844

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Bemis-Murcko scaffolds and their variants #6844

dehaenw Oct 27, 2023

Replies: 0 comments

dehaenw
Oct 27, 2023