Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rascal exactConnectionsMatch bug #7359

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
26 changes: 19 additions & 7 deletions Code/GraphMol/RascalMCES/RascalMCES.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -255,14 +255,26 @@ void getBondLabels(const ROMol &mol, const RascalOptions &opts,
getAtomLabels(mol, opts, atomLabels);
bondLabels = std::vector<std::string>(mol.getNumBonds());
for (const auto &b : mol.bonds()) {
if (b->getBeginAtom()->getAtomicNum() < b->getEndAtom()->getAtomicNum()) {
bondLabels[b->getIdx()] = atomLabels[b->getBeginAtomIdx()] +
std::to_string(b->getBondType()) +
atomLabels[b->getEndAtomIdx()];
if (b->getBeginAtom()->getAtomicNum() == b->getEndAtom()->getAtomicNum()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any other canonicalization that would be required? charge? hcount? Should you use getTotalDegree?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather selfishly, perhaps, I went with just what I needed to get the job done. Element and bond type are already encoded, so I don't think getTotalDegree would be any different from getDegree. Charge might be useful, maybe, and you could make a case for options for things like treating all halogens the same. Given that enhancements now go in in the bug fixing releases, I think it would be ok to work on an on-demand basis.

if (b->getEndAtom()->getDegree() < b->getBeginAtom()->getDegree()) {
bondLabels[b->getIdx()] = atomLabels[b->getEndAtomIdx()] +
std::to_string(b->getBondType()) +
atomLabels[b->getBeginAtomIdx()];
} else {
bondLabels[b->getIdx()] = atomLabels[b->getBeginAtomIdx()] +
std::to_string(b->getBondType()) +
atomLabels[b->getEndAtomIdx()];
}
} else {
bondLabels[b->getIdx()] = atomLabels[b->getEndAtomIdx()] +
std::to_string(b->getBondType()) +
atomLabels[b->getBeginAtomIdx()];
if (b->getBeginAtom()->getAtomicNum() < b->getEndAtom()->getAtomicNum()) {
bondLabels[b->getIdx()] = atomLabels[b->getBeginAtomIdx()] +
std::to_string(b->getBondType()) +
atomLabels[b->getEndAtomIdx()];
} else {
bondLabels[b->getIdx()] = atomLabels[b->getEndAtomIdx()] +
std::to_string(b->getBondType()) +
atomLabels[b->getBeginAtomIdx()];
}
}
}
}
Expand Down
34 changes: 34 additions & 0 deletions Code/GraphMol/RascalMCES/mces_catch.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1243,4 +1243,38 @@ TEST_CASE("Exact Connection Matches", "[basics]") {
SubstructMatch(*m2, *q2, smarts_res);
CHECK(smarts_res.size() == 1);
}
{
// The original implementation had a crappy order-dependence bug such
// that m1, m2 gave 12 bonds in the MCES but m3, m2 gave 10.
// They are both the same molecules.
auto m1 = "c1cccc(Cc2cccnc2)c1"_smiles;
REQUIRE(m1);
auto m2 = "c1ncccc1Oc1ccccc1"_smiles;
REQUIRE(m2);
RascalOptions opts;
opts.similarityThreshold = 0.5;
opts.allBestMCESs = false;
opts.exactConnectionsMatch = true;

auto res1 = rascalMCES(*m1, *m2, opts);
CHECK(res1.front().getBondMatches().size() == 12);

auto m3 = "c1ncccc1Cc1ccccc1"_smiles;
REQUIRE(m3);

auto res2 = rascalMCES(*m3, *m2, opts);
CHECK(res2.front().getBondMatches().size() == 12);

// and try really hard to break it.
SmilesWriteParams params;
params.doRandom = true;
for (int i = 0; i < 100; ++i) {
auto m3 = v2::SmilesParse::MolFromSmiles(MolToSmiles(*m1, params));
REQUIRE(m3);
auto m4 = v2::SmilesParse::MolFromSmiles(MolToSmiles(*m2, params));
REQUIRE(m4);
auto res2 = rascalMCES(*m3, *m4, opts);
CHECK(res2.front().getBondMatches().size() == 12);
}
}
}
4 changes: 3 additions & 1 deletion Code/GraphMol/SmilesParse/SmilesWrite.h
Original file line number Diff line number Diff line change
Expand Up @@ -103,7 +103,7 @@ RDKIT_SMILESPARSE_EXPORT std::string MolToSmiles(
RDKIT_SMILESPARSE_EXPORT std::string MolToSmiles(
const ROMol &mol, const SmilesWriteParams &params);

//! \brief returns canonical SMILES for a molecule
//! \brief returns SMILES for a molecule, canonical by default
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might have snuck in accidentally, not that more docs are bad!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I came across it whilst fixing this bug, and it didn't seem worth a separate PR.

/*!
\param mol : the molecule in question.
\param doIsomericSmiles : include stereochemistry and isotope information
Expand All @@ -119,6 +119,8 @@ RDKIT_SMILESPARSE_EXPORT std::string MolToSmiles(
\param allBondsExplicit : if true, symbols will be included for all bonds.
\param allHsExplicit : if true, hydrogen counts will be provided for every
atom.
\param doRandom : if true, the first atom in the SMILES string will be
selected at random and the SMILES string will not be canonical
*/
inline std::string MolToSmiles(const ROMol &mol, bool doIsomericSmiles = true,
bool doKekule = false, int rootedAtAtom = -1,
Expand Down