Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[go] performance downgrade a lot after upgrade antlr from 4.9 to 4.13.1 #4613

Closed
longjiquan opened this issue May 14, 2024 · 8 comments
Closed

Comments

@longjiquan
Copy link

Hello, guys, thanks for the awesome project, it saved a lot of time for us to build a parser. Recently, I tried to upgrade the antlr version from 4.9 to 4.13.1, and I followed the https://github.com/antlr/antlr4/blob/master/doc/go-target.md to do this. However, after both the antlr runtime and generated code was upgraded to 4.13, we didn't gain performance improvement. The flamegraph of pprof indicated that the newer version need more time to do the parsing, most of cpu time was consumed by the sync.Mutex inner ATN.NextTokensNoContext. I'm very confused about this because in fact a parser won't be used in multiple goroutines in our scenario. Can anyone give me some suggestions?

Version 4.9, the parsing cost 25.42% of cpu:
image

Version 4.13, the parsing cost 35.99% of cpu:
image

@jimidle
Copy link
Collaborator

jimidle commented May 14, 2024

I would need to see your grammar. There are a couple of small performance improvement patches that I intend to release very shortly. But I had debated just not allowing multiple go routines , but people said they used them. I did some work on a build configuration for this, but did not finish it yet.

However, benchmarks are fickle. So:

Did you first run one parse without measuring to load the ATNs etc? You can also cause it to do that before parsing, but that is the easiest way. It looks like you have have done that given the elapsed times?

The version you were running was actually very broken. You are lucky it ran at all I think. I do not remember having to put that mutex in, but if I did, it is because it was needed in comparison to the Java target.
However, it looks like the one you point out is coming from the ErrorStrategy? Are you parsing things in error?

This version definitely performs much better, but if your grammar is very ambiguous, then all bets are off. The previous runtime probably did not do half the work it should have done.

If you supply your grammar, then I can take a look. But are you trying SLL parsing first? And using the DiagnosticErrorListener to work out where your grammar might be flakey? This runtime is definitely optimized for efficient grammars. Lexing is usually more CPU intensive than parsing. Hence I think that the issue might be that your grammar is being exposed with the new release.

I will also check to see what state my single threaded grammar was in.

FInally, you are using the latest Go compiler?

@longjiquan
Copy link
Author

Thanks for the very quick reply, @jimidle . My grammar can be found at https://github.com/milvus-io/milvus/blob/master/internal/parser/planparserv2/Plan.g4. This version has not been launched in our production environment yet, we just compared it to see if any performance improved.
For other questions you mentioned:

  1. We don't parse things in error, the syntax is correct.
  2. Yes, we used an error listener, and it can be found at https://github.com/milvus-io/milvus/blob/master/internal/parser/planparserv2/error_listener.go.
  3. Go version: go version go1.20.7 linux/amd64.

@jimidle
Copy link
Collaborator

jimidle commented May 15, 2024

Please note new release:

https://github.com/antlr4-go/antlr/releases/tag/v4.13.1

This allows you to turn off mutexes with the build tag -tags antlr.nomutex

I will check your grammar and listener in a few days. But in the meantime, upgrade your go to the latest and check your grammar for efficiency and ambiguities.

@jimidle
Copy link
Collaborator

jimidle commented May 15, 2024 via email

@longjiquan
Copy link
Author

longjiquan commented May 17, 2024

Sorry for the late reply, I'm not feeling well these days and had some rest.
Thanks for your suggestions about how we avoid the mutex, and I'll benchmark the performance with this feature.
Could you give some suggestions about how I can reduce the ambiguity of the grammar?
Hope you will have a good weekend! @jimidle

@jimidle
Copy link
Collaborator

jimidle commented May 18, 2024

Sorry for the late reply, I'm not feeling well these days and had some rest. Thanks for your suggestions about how we avoid the mutex, and I'll benchmark the performance with this feature. Could you give some suggestions about how I can reduce the ambiguity of the grammar? Hope you will have a good weekend! @jimi

There are a number of improvements you can make here:

  • Simplify all those fragments they make it much harder to read - that does not affect performance though
  • Don't use '(' and other literals in the parser, make them real tokens
  • Do not include space in tokens: 'not in' should be two tokens
  • Your precedence is the wrong way round - higher precedence towards the top of the expression.
  • Use a case insensitive lexer

Your performance issue is caused by:

| expr op1 = (LT | LE) (Identifier | JSONIdentifier) op2 = (LT | LE) expr	 # Range
	| expr op1 = (GT | GE) (Identifier | JSONIdentifier) op2 = (GT | GE) expr    # ReverseRange
	| expr op = (LT | LE | GT | GE) expr					                     # Relational
	| expr op = (EQ | NE) expr		

Don't separate the case for Identifier and JSONIdentifier etc. Don't duplicate teh constructs. The parser is likely trying that path, then backtracking, which is why you cannot use SLL parsing. I have not tested this, but I think that is the issue. Everything should be just expression
and you should use a semantic pass in a tree walker to check validity. Without actually testing this, something like:

grammar Plan;

expr:
	| LPAREN expr RPAREN											             # Parens
	| LBRACK expr (COMMA expr)* COMMA? RBRACK                                    # Array
	| expr op = NOT? IN expr                                                     # Term
    | expr BAND expr										                     # BitAnd
	| expr BXOR expr										                     # BitXor
	| expr BOR expr											                     # BitOr
	| expr POW expr											                     # Power
	| expr AND expr											                     # LogicalAnd
	| expr OR expr											                     # LogicalOr
	| op = (ADD | SUB | BNOT | NOT) expr					                     # Unary
	| expr LIKE StringLiteral                                                    # Like
//	| '(' typeName ')' expr									                     # Cast
	| expr op = (MUL | DIV | MOD) expr						                     # MulDivMod
	| expr op = (ADD | SUB) expr							                     # AddSub
	| expr op = (SHL | SHR) expr							                     # Shift
	| expr op = (IN | NIN)                                                       # NinIn
	| expr op1 = (LT | LE) expr op2 = (LT | LE) expr	                         # Range
	| expr op1 = (GT | GE) expr op2 = (GT | GE) expr                             # ReverseRange
	| expr op = (LT | LE | GT | GE) expr					                     # Relational
	| expr op = (EQ | NE) expr								                     # Equality
	| (JSONContains | ArrayContains) LPAREN expr COMMA expr RPAREN               # JSONContains
	| (JSONContainsAll | ArrayContainsAll) LPAREN expr COMMA expr RPAREN         # JSONContainsAll
	| (JSONContainsAny | ArrayContainsAny) LPAREN expr COMMA expr RPAREN        # JSONContainsAny
	| ArrayLength'('(Identifier | JSONIdentifier)')'                             # ArrayLength
	| EXISTS expr                                                                # Exists
    | IntegerConstant											                 # Integer
	| FloatingConstant										                     # Floating
	| BooleanConstant										                     # Boolean
	| StringLiteral											                     # String
	| Identifier											                     # Identifier
	| JSONIdentifier                                                             # JSONIdentifier
	| EmptyTerm                                                                  # EmptyTerm
	;

// typeName: ty = (BOOL | INT8 | INT16 | INT32 | INT64 | FLOAT | DOUBLE);

// BOOL: 'bool';
// INT8: 'int8';
// INT16: 'int16';
// INT32: 'int32';
// INT64: 'int64';
// FLOAT: 'float';
// DOUBLE: 'double';

LPAREN : '(' ;
RPAREN : ')' ;
COMMA : ',';
LBRACK : '[';
RBRACK : ']';
LT: '<';
LE: '<=';
GT: '>';
GE: '>=';
EQ: '==';
NE: '!=';

LIKE: 'like' | 'LIKE';
EXISTS: 'exists' | 'EXISTS';

ADD: '+';
SUB: '-';
MUL: '*';
DIV: '/';
MOD: '%';
POW: '**';
SHL: '<<';
SHR: '>>';
BAND: '&';
BOR: '|';
BXOR: '^';

AND: '&&' | 'and';
OR: '||' | 'or';

BNOT: '~';
NOT: '!' | 'not';

IN: 'in';
EmptyTerm: '[' (Whitespace | Newline)* ']';

// TODO: Don't do this, use case-insensitive lexer or [Jj][Ss][Oo][Nn] etc.
JSONContains: 'json_contains' | 'JSON_CONTAINS';
JSONContainsAll: 'json_contains_all' | 'JSON_CONTAINS_ALL';
JSONContainsAny: 'json_contains_any' | 'JSON_CONTAINS_ANY';

ArrayContains: 'array_contains' | 'ARRAY_CONTAINS';
ArrayContainsAll: 'array_contains_all' | 'ARRAY_CONTAINS_ALL';
ArrayContainsAny: 'array_contains_any' | 'ARRAY_CONTAINS_ANY';
ArrayLength: 'array_length' | 'ARRAY_LENGTH';

BooleanConstant: 'true' | 'True' | 'TRUE' | 'false' | 'False' | 'FALSE';

IntegerConstant:
	DecimalConstant
	| OctalConstant
	| HexadecimalConstant
	| BinaryConstant;

FloatingConstant:
	DecimalFloatingConstant
	| HexadecimalFloatingConstant;

Identifier: Nondigit (Nondigit | Digit)* | '$meta';

StringLiteral: EncodingPrefix? ('"' DoubleSCharSequence? '"' | '\'' SingleSCharSequence? '\'');
JSONIdentifier: Identifier('[' (StringLiteral | DecimalConstant) ']')+;

fragment EncodingPrefix: 'u8' | 'u' | 'U' | 'L';

fragment DoubleSCharSequence: DoubleSChar+;
fragment SingleSCharSequence: SingleSChar+;

fragment DoubleSChar: ~["\\\r\n] | EscapeSequence | '\\\n' | '\\\r\n';
fragment SingleSChar: ~['\\\r\n] | EscapeSequence | '\\\n' | '\\\r\n';
fragment Nondigit: [a-zA-Z_];
fragment Digit: [0-9];
fragment BinaryConstant: '0' [bB] [0-1]+;
fragment DecimalConstant: NonzeroDigit Digit* | '0';
fragment OctalConstant: '0' OctalDigit*;
fragment HexadecimalConstant: '0' [xX] HexadecimalDigitSequence;
fragment NonzeroDigit: [1-9];
fragment OctalDigit: [0-7];
fragment HexadecimalDigit: [0-9a-fA-F];
fragment HexQuad:
	HexadecimalDigit HexadecimalDigit HexadecimalDigit HexadecimalDigit;
fragment UniversalCharacterName:
	'\\u' HexQuad
	| '\\U' HexQuad HexQuad;
fragment DecimalFloatingConstant:
	FractionalConstant ExponentPart?
	| DigitSequence ExponentPart;
fragment HexadecimalFloatingConstant:
	'0' [xX] (
		HexadecimalFractionalConstant
		| HexadecimalDigitSequence
	) BinaryExponentPart;
fragment FractionalConstant:
	DigitSequence? '.' DigitSequence
	| DigitSequence '.';
fragment ExponentPart: [eE] [+-]? DigitSequence;
fragment DigitSequence: Digit+;
fragment HexadecimalFractionalConstant:
	HexadecimalDigitSequence? '.' HexadecimalDigitSequence
	| HexadecimalDigitSequence '.';
fragment HexadecimalDigitSequence: HexadecimalDigit+;
fragment BinaryExponentPart: [pP] [+-]? DigitSequence;
fragment EscapeSequence:
	'\\' ['"?abfnrtv\\]
	| '\\' OctalDigit OctalDigit? OctalDigit?
	| '\\x' HexadecimalDigitSequence
	| UniversalCharacterName;

Whitespace: [ \t]+ -> skip;

Newline: ( '\r' '\n'? | '\n') -> skip;

@jimidle
Copy link
Collaborator

jimidle commented May 18, 2024

@parrt This can be closed. The grammar was the problem, not ANTLR or the Go runtime.

@parrt parrt closed this as completed May 18, 2024
@longjiquan
Copy link
Author

Thanks a lot for the very detailed suggestions, @jimidle . I'll have a try to fix it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants