Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Contextual Lexer Leaking "Spam" Terminals #1331

Open
davidmcnabnz opened this issue Aug 30, 2023 · 11 comments
Open

Contextual Lexer Leaking "Spam" Terminals #1331

davidmcnabnz opened this issue Aug 30, 2023 · 11 comments
Labels
discussion Discuss new features or other changes

Comments

@davidmcnabnz
Copy link
Sponsor

Describe the bug

In a contextual lexing setup, where numerous BasicLexer instances get spawned for the various contexts, many of these instances are being created with toxic "spam" terminals. In other words, their terminals lists are including terminals that have no possibility of contributing to the fulfilment of any rule in their context.

This is causing text to be mis-classified in places, resulting in creation of tokens which are illegal in the context, which in turn crashes the parse.

To Reproduce

At this stage I would struggle to distil my large proprietary parsing codebase into a simple example. With this bug report, I'm asking if there are any general troubleshooting tips for finding out why invalid terminals are leaking into the context, and how to prevent this, apart from writing some very aggressive introspection into my BasicLexer subclass.

I've tried juggling terminal priorities, but this is just an 'arms race' that doesn't resolve. If a terminal priority change fixes one context, it breaks others, and vice versa.

@MegaIng
Copy link
Member

MegaIng commented Aug 30, 2023

This is a limitation of the LALR parsing algorithm. The BasicLexers are created based on the states of the parser, which can't fully reflect the truth. The LALR algorithm can't parse all CFGs, so if this is limiting what you can parse, you can't use lalr and need to use earley (which can parse all CFGs).

@davidmcnabnz
Copy link
Sponsor Author

Thanks for the clarification @MegaIng. I was admittedly clinging on to LALR(1) out of past habit and from wanting to run the transformer during the parse. Now seems a very good time to migrate properly to Earley. It will probably make my grammar a lot simpler too.

For context, the language I'm parsing is full of edge cases where white space is sometimes essential and needing to be kept, versus cases where white space is irrelevant and needing to be discarded. LALR(1) parsing has been tough because it turns into a chaotic war of trying to stop text being eaten by the wrong terminals.

@davidmcnabnz
Copy link
Sponsor Author

I've given Early a good try today, but the lack of transparency makes debugging very difficult impossible. There's no equivalent to parse_interactive(), no ability to stream the tokens, rules and parser states. As well as that, the lexer is failing to detect a specific tagged string that the LALR(1) parser's lexing picked up easily, even when I bump its token to top priority.

To get around the problem of spam terminal leakage in LALR(1), I've added logic to my BasicLexer subclass to introspect current parser state, and use this to test for lexing errors, and send back correct 'patched' replacement tokens. This has got me back on track.

@erezsh
Copy link
Member

erezsh commented Aug 31, 2023

@davidmcnabnz Yes, the Earley parser is harder to debug. I wouldn't say impossible. The "problem" with Earley (which is also why it's so powerful) is that it doesn't know what rules and tokens it will choose until the last token has been input.

It might be possible to add some interactivity to the Earley parser, but it will be a different API to parse_interactive().

Anyway, glad to hear you're getting back on track. I wonder, what do you get from subclassing BasicLexer, that you can't do with parse_interactive() ?

@davidmcnabnz
Copy link
Sponsor Author

davidmcnabnz commented Sep 1, 2023

Thanks, @erezsh for the thoughts. You're right to question it - as it happens, subclassing BasicLexer dropped me into a worse rabbit hole.

Anyway, I've pivoted to a whole different approach which is simplifying the project, eliminating all terminal spam and "terminal priority arms races", and speeding up the task.

This approach is based on:

  1. Stripping down the grammar to coarser higher-level rules
  2. Migrating the analysis burden from the parser grammar to the transformer methods
  3. Ditching the built-in lexers
  4. Implementing my own custom lexer with an inbuilt push-down state engine
  5. In each of the lexer's state/stack patterns, maintaining an exact hand-coded linear allow-list of permissible terminals
  6. Implementing a special class for terminals, where instances contain not just the terminal's token name and regex pattern, but also fine-grained whitespace control options
  7. In each of the lexer's state/stack patterns' terminal allow-lists, placing unique instances of each terminal, with appropriate whitespace configs according to that state's context

This sounds like a lot, but it's turned out surprisingly easy. Also, it's helping me get an easier coverage of the legacy language's nuances, where significance of whitespace, and semantics of tokens, varies widely according to context.

Given that this newer methodology has resolved all the issues flagged in the ticket, I'd be happy for the ticket to close. Hopefully this discussion might help others finding themselves in similar struggles.

@erezsh erezsh added the discussion Discuss new features or other changes label Sep 3, 2023
@davidmcnabnz
Copy link
Sponsor Author

Nice milestone reached today, using the StateBasedLexer class, with all 5k+ lines of legacy code now landing in AST. It's been very easy to keep the lexer states harmonised with the grammar rules. No more mis-classification of text into incorrect Token types.

If anyone's interested, I'd be happy to refactor this class to remove the proprietary elements and make it available for general use, with examples.

@erezsh
Copy link
Member

erezsh commented Sep 4, 2023

Congrats on the milestone!

Sounds interesting, what would the API look like, more or less?

@davidmcnabnz
Copy link
Sponsor Author

The API for StateBasedLexer is deceptively simple, and follows the regular pattern of the Lark constructor's lexer parameter being passed as a class, then instantiated within Lark, then its .lex() method being iterated over to yield tokens.

The added capability arises from the states table and stack, where for each state, there's a hardwired list of eligible matching options, and where each match option contains token name, regex plus zero or more state transition commands (noop, reset, goto, push, pop). An Opt class is used internally to formalise the representation of these matching options, but can also be used directly by the user.

The legacy language's features made a simpler finite state machine (FSM) unworkable. It required a push-down automaton to properly track how the meanings of input text often change dramatically according to context. For instance, the same keyword in different states needs to yield different tokens to the parser.

@erezsh I'm thinking to publish this into a separate small repo of 'Lark examples'. But if you prefer, I could add the class into the examples section of my Lark fork, and send you a PR.

@erezsh
Copy link
Member

erezsh commented Sep 4, 2023

@davidmcnabnz Hard to say without seeing it first! Maybe it's safer to put it in a separate repo, and we can always copy it to the main one afterwards.

@davidmcnabnz
Copy link
Sponsor Author

@erezsh will do. Meanwhile, you could consider a slight relaxation on validating Lark's lexer constructor argument. As it is, it insists on custom lexers being a subclass of Lexer.

In my use case, this makes it harder to pass the state table into my own method which instantiates Lark. It hasn't been a problem because I'm hardwiring the specific state table into my class. But for more general use, the only way I'm seeing to do this is to declare a Lexer subclass inside a closure where the chosen state table is in scope - not an elegant mechanism. For example,

def create_parser(self, grammar, transformer, states):
    # create a closure with a class def to fulfil the validation requirements for lexer=
    class MyLexer(StateLexer):
        def __init__(self, lexer_conf):
            self.lexer_conf = lexer_conf
            self.statesTable = self.generateStateTable(states) # note that 'states' is in this closure scope
            ...
        def lex(self, data):
            while data:
                ...
                yield token
    parser = Lark(... transformer=transformer, lexer=StateLexer ...)
    return parser

@erezsh
Copy link
Member

erezsh commented Sep 13, 2023

@davidmcnabnz I don't think closures are the worst thing, but I agree it was nicer if there was a more relaxed way to do it. But at the same time, I don't want to lose the guidance and protection that an ABC provides. Maybe in the future we can change it to a Protocol.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Discuss new features or other changes
Projects
None yet
Development

No branches or pull requests

3 participants