MCS Similarity via fast MCS estimation #6265
Replies: 2 comments 1 reply
-
Hi @shuan4638 using a fingerprint to speed up MCS is an interesting idea, but you should be aware that what you're doing is a pretty coarse approximation to the actual MCS. Here's a demonstration of that:
The MCS calculation identifies the C6 ring as being constant between the molecules, while your circular-fingerprint based approach only sees atoms which have exactly the same local environment. I'm not sure what you mean by "global substructure similarity", can you provide an example of a few molecules where you think the standard fingerprint similarity is lower than it should be? (I assume the problem is that you're seeing similarities which you believe are too low) |
Beta Was this translation helpful? Give feedback.
-
@greglandrum Thanks for the feedback. For the term "global substructure similarity", I am not sure which word can best describe it. I want to have a metric that gives a high similarity score if most of the structures are identical. I know Tanimoto similarity is useful for functional group similarity , but it gives low similarity score even when I give very similar molecules. Here are the similarities between the parent molecule and its mutations I showed in the post Tanimoto : 0.47, 0.49, 0.49 Also, for the example in your reply (which I consider they are similar molecules), the Tanimoto similarities are low:
|
Beta Was this translation helpful? Give feedback.
-
I found Tanimoto Similarity cannot reflect the global substructure similarity, so I created a similarity metric for a molecular pair (mol1 and mol2) using their maximum common structure (MCS), defined as
, where n(mol) denotes the number of atoms in the molecule.
And I made a script for faster estimation because I found
rdkit.Chem.rdFMCS.FindMCS
gets slow in some cases.I tested this new metric by comparing the similarity with Tanimoto similarity on very similar molecules (mutations by CReM) and also the computational time of FastMCS with the
FindMCS
funciton.Mutation example:
I tested with 200 molecules * 5 mutations, and here are the results!
Higher similarity score than Tanimoto Similarity and much shorter time (30K shorter in average) than
rdkit.Chem.rdFMCS.FindMCS
.I wanted to ask if you could give me some feedback on the code and if it is worth it to be added to RDKit:
https://github.com/shuan4638/mcs_sim
Beta Was this translation helpful? Give feedback.
All reactions