MCS Similarity via fast MCS estimation #6265

shuan4638 · 2023-04-01T07:13:19Z

shuan4638
Apr 1, 2023

I found Tanimoto Similarity cannot reflect the global substructure similarity, so I created a similarity metric for a molecular pair (mol1 and mol2) using their maximum common structure (MCS), defined as

$$ McsSim = {2n(mol_{MCS})\over n(mol_1)+ n(mol_2)} $$

, where n(mol) denotes the number of atoms in the molecule.

And I made a script for faster estimation because I found rdkit.Chem.rdFMCS.FindMCS gets slow in some cases.

I tested this new metric by comparing the similarity with Tanimoto similarity on very similar molecules (mutations by CReM) and also the computational time of FastMCS with the FindMCS funciton.

Mutation example:

I tested with 200 molecules * 5 mutations, and here are the results!
Higher similarity score than Tanimoto Similarity and much shorter time (30K shorter in average) than rdkit.Chem.rdFMCS.FindMCS.

I wanted to ask if you could give me some feedback on the code and if it is worth it to be added to RDKit:
https://github.com/shuan4638/mcs_sim

greglandrum · 2023-04-03T09:33:51Z

greglandrum
Apr 3, 2023
Maintainer

Hi @shuan4638 using a fingerprint to speed up MCS is an interesting idea, but you should be aware that what you're doing is a pretty coarse approximation to the actual MCS. Here's a demonstration of that:

In [5]: m1 = Chem.MolFromSmiles('Cc1ccccc1')

In [6]: m2 = Chem.MolFromSmiles('c1ccccc1')

In [7]: MCS_similarity.rdkit_MCS_Sim(m1,m2)
Out[7]: (0.006679599871858954, 0.9230769230769231)

In [8]: MCS_similarity.fast_MCS_Sim(m1,m2)
Out[8]: (0.011676199967041612, 0.7692307692307693)

In [9]: m3 = Chem.MolFromSmiles('Cc1c(C)cccc1')

In [10]: MCS_similarity.rdkit_MCS_Sim(m3,m2)
Out[10]: (0.0005805999971926212, 0.8571428571428571)

In [11]: MCS_similarity.fast_MCS_Sim(m3,m2)
Out[11]: (0.00030700000934302807, 0.5714285714285714)

The MCS calculation identifies the C6 ring as being constant between the molecules, while your circular-fingerprint based approach only sees atoms which have exactly the same local environment.

I'm not sure what you mean by "global substructure similarity", can you provide an example of a few molecules where you think the standard fingerprint similarity is lower than it should be? (I assume the problem is that you're seeing similarities which you believe are too low)

0 replies

shuan4638 · 2023-04-03T09:56:44Z

shuan4638
Apr 3, 2023
Author

@greglandrum Thanks for the feedback.
Yes, I am aware of the difference between this fast estimation and real MCS due to the local environment comparison. I was trying to generate an approximation that requires a small amount of time, which is critical for large-scale screening.

For the term "global substructure similarity", I am not sure which word can best describe it. I want to have a metric that gives a high similarity score if most of the structures are identical. I know Tanimoto similarity is useful for functional group similarity , but it gives low similarity score even when I give very similar molecules.

Here are the similarities between the parent molecule and its mutations I showed in the post

Tanimoto : 0.47, 0.49, 0.49
rdkit_MCS: 0.96, 0.96, 0.83
fast_MCS: 0.88, 0.88, 0.75

Also, for the example in your reply (which I consider they are similar molecules), the Tanimoto similarities are low:

MCS_similarity.Tanimoto_Sim(m1,m2) # 0.27
MCS_similarity.Tanimoto_Sim(m2,m3) # 0.18

1 reply

shuan4638 Apr 13, 2023
Author

@greglandrum any thought to this?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MCS Similarity via fast MCS estimation #6265

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

MCS Similarity via fast MCS estimation #6265

shuan4638 Apr 1, 2023

Replies: 2 comments · 1 reply

greglandrum Apr 3, 2023 Maintainer

shuan4638 Apr 3, 2023 Author

shuan4638 Apr 13, 2023 Author

shuan4638
Apr 1, 2023

Replies: 2 comments 1 reply

greglandrum
Apr 3, 2023
Maintainer

shuan4638
Apr 3, 2023
Author

shuan4638 Apr 13, 2023
Author