[go] performance downgrade a lot after upgrade antlr from 4.9 to 4.13.1 #4613

longjiquan · 2024-05-14T12:23:52Z

Hello, guys, thanks for the awesome project, it saved a lot of time for us to build a parser. Recently, I tried to upgrade the antlr version from 4.9 to 4.13.1, and I followed the https://github.com/antlr/antlr4/blob/master/doc/go-target.md to do this. However, after both the antlr runtime and generated code was upgraded to 4.13, we didn't gain performance improvement. The flamegraph of pprof indicated that the newer version need more time to do the parsing, most of cpu time was consumed by the sync.Mutex inner ATN.NextTokensNoContext. I'm very confused about this because in fact a parser won't be used in multiple goroutines in our scenario. Can anyone give me some suggestions?

Version 4.9, the parsing cost 25.42% of cpu:

Version 4.13, the parsing cost 35.99% of cpu:

The text was updated successfully, but these errors were encountered:

jimidle · 2024-05-14T21:36:14Z

I would need to see your grammar. There are a couple of small performance improvement patches that I intend to release very shortly. But I had debated just not allowing multiple go routines , but people said they used them. I did some work on a build configuration for this, but did not finish it yet.

However, benchmarks are fickle. So:

Did you first run one parse without measuring to load the ATNs etc? You can also cause it to do that before parsing, but that is the easiest way. It looks like you have have done that given the elapsed times?

The version you were running was actually very broken. You are lucky it ran at all I think. I do not remember having to put that mutex in, but if I did, it is because it was needed in comparison to the Java target.
However, it looks like the one you point out is coming from the ErrorStrategy? Are you parsing things in error?

This version definitely performs much better, but if your grammar is very ambiguous, then all bets are off. The previous runtime probably did not do half the work it should have done.

If you supply your grammar, then I can take a look. But are you trying SLL parsing first? And using the DiagnosticErrorListener to work out where your grammar might be flakey? This runtime is definitely optimized for efficient grammars. Lexing is usually more CPU intensive than parsing. Hence I think that the issue might be that your grammar is being exposed with the new release.

I will also check to see what state my single threaded grammar was in.

FInally, you are using the latest Go compiler?

longjiquan · 2024-05-15T02:33:37Z

Thanks for the very quick reply, @jimidle . My grammar can be found at https://github.com/milvus-io/milvus/blob/master/internal/parser/planparserv2/Plan.g4. This version has not been launched in our production environment yet, we just compared it to see if any performance improved.
For other questions you mentioned:

We don't parse things in error, the syntax is correct.
Yes, we used an error listener, and it can be found at https://github.com/milvus-io/milvus/blob/master/internal/parser/planparserv2/error_listener.go.
Go version: go version go1.20.7 linux/amd64.

jimidle · 2024-05-15T16:06:23Z

Please note new release:

https://github.com/antlr4-go/antlr/releases/tag/v4.13.1

This allows you to turn off mutexes with the build tag -tags antlr.nomutex

I will check your grammar and listener in a few days. But in the meantime, upgrade your go to the latest and check your grammar for efficiency and ambiguities.

jimidle · 2024-05-15T20:08:07Z

I took a Quick Look at the grammar that’s your issue. It is very ambiguous and days not use left revision well for precedence. Binary ops are usually higher precedence than everything else other than patens. So the angle is likely calling your listener a ton of times my to try and tell it about about, which it ignores. The listener then looks at tokens and uses the locks. The new runtime will help with locks but don’t just stop there. Let’s redo your grammar so that you don’t have the issue in the first place. Jim

…

On Tue, May 14, 2024 at 20:33 Jiquan Long ***@***.***> wrote: Thanks for the very quick reply, @jimidle <https://github.com/jimidle> . My grammar can be found at https://github.com/milvus-io/milvus/blob/master/internal/parser/planparserv2/Plan.g4. This version has not been launched in our production environment yet, we just compared it to see if any performance improved. For other questions you mentioned: 1. We don't parse things in error, the syntax is correct. 2. Yes, we used an error listener, and it can be found at https://github.com/milvus-io/milvus/blob/master/internal/parser/planparserv2/error_listener.go . 3. Go version: go version go1.20.7 linux/amd64. — Reply to this email directly, view it on GitHub <#4613 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJ7TMFRJEO6INDOYC26NS3ZCLCRPAVCNFSM6AAAAABHWCHYRKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJRGQ3DSMZSGA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

longjiquan · 2024-05-17T09:00:41Z

Sorry for the late reply, I'm not feeling well these days and had some rest.
Thanks for your suggestions about how we avoid the mutex, and I'll benchmark the performance with this feature.
Could you give some suggestions about how I can reduce the ambiguity of the grammar?
Hope you will have a good weekend! @jimidle

jimidle · 2024-05-18T16:24:07Z

Sorry for the late reply, I'm not feeling well these days and had some rest. Thanks for your suggestions about how we avoid the mutex, and I'll benchmark the performance with this feature. Could you give some suggestions about how I can reduce the ambiguity of the grammar? Hope you will have a good weekend! @jimi

There are a number of improvements you can make here:

Simplify all those fragments they make it much harder to read - that does not affect performance though
Don't use '(' and other literals in the parser, make them real tokens
Do not include space in tokens: 'not in' should be two tokens
Your precedence is the wrong way round - higher precedence towards the top of the expression.
Use a case insensitive lexer

Your performance issue is caused by:

| expr op1 = (LT | LE) (Identifier | JSONIdentifier) op2 = (LT | LE) expr	 # Range
	| expr op1 = (GT | GE) (Identifier | JSONIdentifier) op2 = (GT | GE) expr    # ReverseRange
	| expr op = (LT | LE | GT | GE) expr					                     # Relational
	| expr op = (EQ | NE) expr

Don't separate the case for Identifier and JSONIdentifier etc. Don't duplicate teh constructs. The parser is likely trying that path, then backtracking, which is why you cannot use SLL parsing. I have not tested this, but I think that is the issue. Everything should be just expression
and you should use a semantic pass in a tree walker to check validity. Without actually testing this, something like:

grammar Plan;

expr:
	| LPAREN expr RPAREN											             # Parens
	| LBRACK expr (COMMA expr)* COMMA? RBRACK                                    # Array
	| expr op = NOT? IN expr                                                     # Term
    | expr BAND expr										                     # BitAnd
	| expr BXOR expr										                     # BitXor
	| expr BOR expr											                     # BitOr
	| expr POW expr											                     # Power
	| expr AND expr											                     # LogicalAnd
	| expr OR expr											                     # LogicalOr
	| op = (ADD | SUB | BNOT | NOT) expr					                     # Unary
	| expr LIKE StringLiteral                                                    # Like
//	| '(' typeName ')' expr									                     # Cast
	| expr op = (MUL | DIV | MOD) expr						                     # MulDivMod
	| expr op = (ADD | SUB) expr							                     # AddSub
	| expr op = (SHL | SHR) expr							                     # Shift
	| expr op = (IN | NIN)                                                       # NinIn
	| expr op1 = (LT | LE) expr op2 = (LT | LE) expr	                         # Range
	| expr op1 = (GT | GE) expr op2 = (GT | GE) expr                             # ReverseRange
	| expr op = (LT | LE | GT | GE) expr					                     # Relational
	| expr op = (EQ | NE) expr								                     # Equality
	| (JSONContains | ArrayContains) LPAREN expr COMMA expr RPAREN               # JSONContains
	| (JSONContainsAll | ArrayContainsAll) LPAREN expr COMMA expr RPAREN         # JSONContainsAll
	| (JSONContainsAny | ArrayContainsAny) LPAREN expr COMMA expr RPAREN        # JSONContainsAny
	| ArrayLength'('(Identifier | JSONIdentifier)')'                             # ArrayLength
	| EXISTS expr                                                                # Exists
    | IntegerConstant											                 # Integer
	| FloatingConstant										                     # Floating
	| BooleanConstant										                     # Boolean
	| StringLiteral											                     # String
	| Identifier											                     # Identifier
	| JSONIdentifier                                                             # JSONIdentifier
	| EmptyTerm                                                                  # EmptyTerm
	;

// typeName: ty = (BOOL | INT8 | INT16 | INT32 | INT64 | FLOAT | DOUBLE);

// BOOL: 'bool';
// INT8: 'int8';
// INT16: 'int16';
// INT32: 'int32';
// INT64: 'int64';
// FLOAT: 'float';
// DOUBLE: 'double';

LPAREN : '(' ;
RPAREN : ')' ;
COMMA : ',';
LBRACK : '[';
RBRACK : ']';
LT: '<';
LE: '<=';
GT: '>';
GE: '>=';
EQ: '==';
NE: '!=';

LIKE: 'like' | 'LIKE';
EXISTS: 'exists' | 'EXISTS';

ADD: '+';
SUB: '-';
MUL: '*';
DIV: '/';
MOD: '%';
POW: '**';
SHL: '<<';
SHR: '>>';
BAND: '&';
BOR: '|';
BXOR: '^';

AND: '&&' | 'and';
OR: '||' | 'or';

BNOT: '~';
NOT: '!' | 'not';

IN: 'in';
EmptyTerm: '[' (Whitespace | Newline)* ']';

// TODO: Don't do this, use case-insensitive lexer or [Jj][Ss][Oo][Nn] etc.
JSONContains: 'json_contains' | 'JSON_CONTAINS';
JSONContainsAll: 'json_contains_all' | 'JSON_CONTAINS_ALL';
JSONContainsAny: 'json_contains_any' | 'JSON_CONTAINS_ANY';

ArrayContains: 'array_contains' | 'ARRAY_CONTAINS';
ArrayContainsAll: 'array_contains_all' | 'ARRAY_CONTAINS_ALL';
ArrayContainsAny: 'array_contains_any' | 'ARRAY_CONTAINS_ANY';
ArrayLength: 'array_length' | 'ARRAY_LENGTH';

BooleanConstant: 'true' | 'True' | 'TRUE' | 'false' | 'False' | 'FALSE';

IntegerConstant:
	DecimalConstant
	| OctalConstant
	| HexadecimalConstant
	| BinaryConstant;

FloatingConstant:
	DecimalFloatingConstant
	| HexadecimalFloatingConstant;

Identifier: Nondigit (Nondigit | Digit)* | '$meta';

StringLiteral: EncodingPrefix? ('"' DoubleSCharSequence? '"' | '\'' SingleSCharSequence? '\'');
JSONIdentifier: Identifier('[' (StringLiteral | DecimalConstant) ']')+;

fragment EncodingPrefix: 'u8' | 'u' | 'U' | 'L';

fragment DoubleSCharSequence: DoubleSChar+;
fragment SingleSCharSequence: SingleSChar+;

fragment DoubleSChar: ~["\\\r\n] | EscapeSequence | '\\\n' | '\\\r\n';
fragment SingleSChar: ~['\\\r\n] | EscapeSequence | '\\\n' | '\\\r\n';
fragment Nondigit: [a-zA-Z_];
fragment Digit: [0-9];
fragment BinaryConstant: '0' [bB] [0-1]+;
fragment DecimalConstant: NonzeroDigit Digit* | '0';
fragment OctalConstant: '0' OctalDigit*;
fragment HexadecimalConstant: '0' [xX] HexadecimalDigitSequence;
fragment NonzeroDigit: [1-9];
fragment OctalDigit: [0-7];
fragment HexadecimalDigit: [0-9a-fA-F];
fragment HexQuad:
	HexadecimalDigit HexadecimalDigit HexadecimalDigit HexadecimalDigit;
fragment UniversalCharacterName:
	'\\u' HexQuad
	| '\\U' HexQuad HexQuad;
fragment DecimalFloatingConstant:
	FractionalConstant ExponentPart?
	| DigitSequence ExponentPart;
fragment HexadecimalFloatingConstant:
	'0' [xX] (
		HexadecimalFractionalConstant
		| HexadecimalDigitSequence
	) BinaryExponentPart;
fragment FractionalConstant:
	DigitSequence? '.' DigitSequence
	| DigitSequence '.';
fragment ExponentPart: [eE] [+-]? DigitSequence;
fragment DigitSequence: Digit+;
fragment HexadecimalFractionalConstant:
	HexadecimalDigitSequence? '.' HexadecimalDigitSequence
	| HexadecimalDigitSequence '.';
fragment HexadecimalDigitSequence: HexadecimalDigit+;
fragment BinaryExponentPart: [pP] [+-]? DigitSequence;
fragment EscapeSequence:
	'\\' ['"?abfnrtv\\]
	| '\\' OctalDigit OctalDigit? OctalDigit?
	| '\\x' HexadecimalDigitSequence
	| UniversalCharacterName;

Whitespace: [ \t]+ -> skip;

Newline: ( '\r' '\n'? | '\n') -> skip;

jimidle · 2024-05-18T16:25:10Z

@parrt This can be closed. The grammar was the problem, not ANTLR or the Go runtime.

longjiquan · 2024-05-20T02:45:13Z

Thanks a lot for the very detailed suggestions, @jimidle . I'll have a try to fix it.

parrt closed this as completed May 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[go] performance downgrade a lot after upgrade antlr from 4.9 to 4.13.1 #4613

[go] performance downgrade a lot after upgrade antlr from 4.9 to 4.13.1 #4613

longjiquan commented May 14, 2024

jimidle commented May 14, 2024 •

edited

longjiquan commented May 15, 2024

jimidle commented May 15, 2024

jimidle commented May 15, 2024 via email

longjiquan commented May 17, 2024 •

edited

jimidle commented May 18, 2024

jimidle commented May 18, 2024

longjiquan commented May 20, 2024

[go] performance downgrade a lot after upgrade antlr from 4.9 to 4.13.1 #4613

[go] performance downgrade a lot after upgrade antlr from 4.9 to 4.13.1 #4613

Comments

longjiquan commented May 14, 2024

jimidle commented May 14, 2024 • edited

longjiquan commented May 15, 2024

jimidle commented May 15, 2024

jimidle commented May 15, 2024 via email

longjiquan commented May 17, 2024 • edited

jimidle commented May 18, 2024

jimidle commented May 18, 2024

longjiquan commented May 20, 2024

jimidle commented May 14, 2024 •

edited

longjiquan commented May 17, 2024 •

edited