Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor smiles parsing code to use C++ lexer and parser (#7015) #7016

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Commits on Jan 1, 2024

  1. Refactor smiles parsing code to use C++ lexer and parser (rdkit#7015)

    This refactors the smiles parsing procedure to separate the parsing and
    ROMol construction procedures. Given the intricate nature of the mol
    construction procedure, a 1D list of mol events made the most sense as
    the parsers output. This allowed me to ensure that the order in which
    atoms/bonds are created and properties are set follows that from the
    previous implementation.
    
    I also added a new project to External/flex (source:
    https://github.com/westes/flex) because C++
    scanners generated by flex require the FlexLexer.h header. We need this
    to be present whether users have flex installed or not, and this was the
    easiest way to achieve that.
    
    Summary of changes:
            * Updated parsing-related error messages to point to the bad
              token and to include more informative messages. Eg.
                `
                [16:26:38] SMILES Parse Error: check for mistakes around position 13
                COc(c1)cccc1C#
                -------------^
                syntax error
                `
    
                `
                [16:27:32] SMILES Parse Error: check for mistakes around position 1
                [Bg]
                -^
                unsupported atom symbol
                `
            * Removed manual memory management from the ROMol construction
              procedure by using RAII classes (this doesn't apply to ing closures)
            * This also removes bad interactions between consecutive smiles
              parsing, which required conversion of bad inputs to follow a
              reset global state. See refactored SmilesParse::test.cpp::testFail
            * Fixes bugs like:
                    * preventing hydrogens with defined chiralities
                    * allowing branch atoms of the form `C1(.C1)`
                    * allowing ring bonds like `C1.C%01`
                    * restricting formal charges to -15 <= N <= 15
    
    An important concern for this change is performance, so I ran all of the
    tests in SmilesParse::test.cpp x1000 and noticed that this change
    increases the runtime by about 5% i.e. from about 82ms to 86ms.
    whosayn committed Jan 1, 2024
    Configuration menu
    Copy the full SHA
    168148b View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    5c951a9 View commit details
    Browse the repository at this point in the history
  3. fix bad bison version

    whosayn committed Jan 1, 2024
    Configuration menu
    Copy the full SHA
    f2b48b0 View commit details
    Browse the repository at this point in the history

Commits on Jan 3, 2024

  1. Configuration menu
    Copy the full SHA
    4d623e4 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    efad133 View commit details
    Browse the repository at this point in the history
  3. fix build failure

    whosayn committed Jan 3, 2024
    Configuration menu
    Copy the full SHA
    a92384c View commit details
    Browse the repository at this point in the history
  4. include missing headers

    whosayn committed Jan 3, 2024
    Configuration menu
    Copy the full SHA
    4128316 View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    1b0f143 View commit details
    Browse the repository at this point in the history