Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor smiles parsing code to use C++ lexer and parser (#7015) #7016

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

whosayn
Copy link
Contributor

@whosayn whosayn commented Jan 1, 2024

This refactors the smiles parsing procedure to separate the parsing and ROMol construction procedures. Given the intricate nature of the mol construction procedure, a 1D list of mol events made the most sense as the parsers output. This allowed me to ensure that the order in which atoms/bonds are created and properties are set follows that from the previous implementation.

I also added the FlexLexer.h header (source: https://github.com/westes/flex) to SmilesParse because C++
scanners generated by flex require it. We need this to be present whether users have flex installed or not, and this was the easiest way to achieve that.

Summary of changes:

    * Updated parsing-related error messages to point to the bad
      token and to include more informative messages. Eg.

        `
        [16:26:38] SMILES Parse Error: check for mistakes around position 13
        COc(c1)cccc1C#
        -------------^
        syntax error
        `

        `
        [16:27:32] SMILES Parse Error: check for mistakes around position 1
        [Bg]
        -^
        unsupported atom symbol
        `
    * Removed manual memory management from the ROMol construction
      procedure by using RAII classes (this doesn't apply to ing closures)
    * This also removes bad interactions between consecutive smiles
      parsing, which required conversion of bad inputs to follow a
      reset global state. See refactored SmilesParse::test.cpp::testFail
    * Fixes bugs like:
            * preventing hydrogens with defined chiralities
            * allowing branch atoms of the form `C1(.C1)`
            * allowing ring bonds like `C1.C%01`
            * restricting formal charges to -15 <= N <= 15

An important concern for this change is performance, so I ran all of the tests in SmilesParse::test.cpp x1000 and noticed that this change increases the runtime by about 5% i.e. from about 82ms to 86ms.

Reference Issue

Implements #7015

whosayn and others added 7 commits January 1, 2024 14:23
This refactors the smiles parsing procedure to separate the parsing and
ROMol construction procedures. Given the intricate nature of the mol
construction procedure, a 1D list of mol events made the most sense as
the parsers output. This allowed me to ensure that the order in which
atoms/bonds are created and properties are set follows that from the
previous implementation.

I also added a new project to External/flex (source:
https://github.com/westes/flex) because C++
scanners generated by flex require the FlexLexer.h header. We need this
to be present whether users have flex installed or not, and this was the
easiest way to achieve that.

Summary of changes:
        * Updated parsing-related error messages to point to the bad
          token and to include more informative messages. Eg.
            `
            [16:26:38] SMILES Parse Error: check for mistakes around position 13
            COc(c1)cccc1C#
            -------------^
            syntax error
            `

            `
            [16:27:32] SMILES Parse Error: check for mistakes around position 1
            [Bg]
            -^
            unsupported atom symbol
            `
        * Removed manual memory management from the ROMol construction
          procedure by using RAII classes (this doesn't apply to ing closures)
        * This also removes bad interactions between consecutive smiles
          parsing, which required conversion of bad inputs to follow a
          reset global state. See refactored SmilesParse::test.cpp::testFail
        * Fixes bugs like:
                * preventing hydrogens with defined chiralities
                * allowing branch atoms of the form `C1(.C1)`
                * allowing ring bonds like `C1.C%01`
                * restricting formal charges to -15 <= N <= 15

An important concern for this change is performance, so I ran all of the
tests in SmilesParse::test.cpp x1000 and noticed that this change
increases the runtime by about 5% i.e. from about 82ms to 86ms.
@whosayn whosayn marked this pull request as ready for review January 3, 2024 03:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant