Replies: 1 comment
-
This is a common misconception about how Morgan Fingerprints work. There are two phases:
The fingerprint is "sparse" in the sense that it acts like a set and contains the hashes found in the molecule. This representation has the characteristics you are looking for. However, when you make a "Fingerprint" you need to take the larger set of invariants and stuff them into a smaller space, i.e. 2048 bits:
Obviously, taking a larger set and forcing them into a smaller set can cause collisions. This is what you are seeing here. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
In the field of materials informatics, a common workflow for predicting properties of some organic compounds involves first converting SMILES into Morgan fingerprints within a dataset. These fingerprints are then fed into interpretable models such as random forests. And then, using feature importance, important bits (substructures) are identified to gain chemical insights.
However, I have recently encountered an issue where the substructures represented by the same bits in Morgan fingerprints are different for different molecules.
For example, for these two molecules 'C=CC(=O)OCC(C)C(C)CC' and 'C=CC(=O)Oc1cccc(C(=O)OC)c1', both of them are converted into Morgan fingerprints with a length of 2048 and a radius of 3.
You can see that both of these two molecules have a value of 1 at bit 1949, but the substructures they represent
for 'C=CC(=O)OCC(C)C(C)CC':
for 'C=CC(=O)Oc1cccc(C(=O)OC)c1':
They are completely different. I understand how the Morgan algorithm works. These two different substructures have different identifiers, but when compressed into a 2048-bit vector, they have the same value when mod 2048. But my question is, for a dataset where the substructures represented by a certain feature column (i.e., bit) are not the same, when I use feature importance to assess the impact of different substructures, which substructure should I choose for bit 1949? Even without assessing feature importance, is it really acceptable for a ML model if the same column of features represents different meanings?
Beta Was this translation helpful? Give feedback.
All reactions