Implement the 10k Test Suite #214

JCRPaquin · 2024-04-23T06:13:49Z

10k Test Suite - Catching query parsing regressions in development

What?

This PR implements a test suite of ~13k roughly deduplicated user queries to prevent parser regressions. It also includes some bug fixes that were necessary to get this test case working, including a particularly pernicious unicode normalization bug.

Why?

As we change Montysolr's query engine, it's important that users' queries continue to work as expected. This PR ensures there's no gap in functionality between versions.

Completion tracking

Initial implementation
Sign-off on the release of the query dataset

What next?

We'd like to start tracking performance for different query types over time. This involves actually running queries against a fixed instance of Montysolr (which we currently don't do).
The 10k test suite may be used in production as a canary to catch broken services before our users do.

I needed to patch an issue in PyUnicode so I made a new fork; will remove if/when it gets upstreamed.

Prior to this the code would error out because it expected at least 1 part. In cases where there are additional parenthesis around the string (it happens), and other miscellaneous inputs, there can be 0 parts instead.

This causes the unparsed author name to pass through the system. In previous versions there would be (incorrect) null characters added to the output of this pass if the author name couldn't be parsed.

This normalization pass helps to consolidate the Unicode code points in the string prior to other passes. Without this step some important parts of certain code points can be eliminated, resulting in mangled output.

JCRPaquin · 2024-04-23T06:21:37Z

Added @ehenneken and @aaccomazzi to double check the unicode normalization fixes.
Added @shinyichen for code review.

cc @kelockhart

aaccomazzi

Looks good to me

JCRPaquin added 12 commits April 22, 2024 21:14

Don't publish 10k test suite data yet

518369c

Add Jython fork as a submodule

a1ed86d

I needed to patch an issue in PyUnicode so I made a new fork; will remove if/when it gets upstreamed.

Ignore .DS_Store files

1cf2fce

Add Jython submodule

6866407

Include Jython Gradle project in Montysolr

92c28be

Add test for failing name parser test case

029b42f

Handle fully unparsable inputs in the Python code

bc5306d

Prior to this the code would error out because it expected at least 1 part. In cases where there are additional parenthesis around the string (it happens), and other miscellaneous inputs, there can be 0 parts instead.

Handle unparsable author names in the Java code

942db9f

This causes the unparsed author name to pass through the system. In previous versions there would be (incorrect) null characters added to the output of this pass if the author name couldn't be parsed.

Use NFKC normalization for author names

778d61a

This normalization pass helps to consolidate the Unicode code points in the string prior to other passes. Without this step some important parts of certain code points can be eliminated, resulting in mangled output.

Add test case for parenthesized author names

a8e0cd1

Allow start in position queries to be >= 0

ae00170

Add 10k test suite test case

7090d45

JCRPaquin assigned JCRPaquin and shinyichen Apr 23, 2024

JCRPaquin added testing query enhancement authors labels Apr 23, 2024

JCRPaquin assigned aaccomazzi, kelockhart and ehenneken Apr 23, 2024

aaccomazzi reviewed Apr 24, 2024

View reviewed changes

shinyichen approved these changes Apr 26, 2024

View reviewed changes

JCRPaquin added 2 commits April 29, 2024 11:36

Un-ignore the dataset

a931632

Add the query dataset

94d2461

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement the 10k Test Suite #214

Implement the 10k Test Suite #214

JCRPaquin commented Apr 23, 2024 •

edited

JCRPaquin commented Apr 23, 2024

aaccomazzi left a comment

Implement the 10k Test Suite #214

Are you sure you want to change the base?

Implement the 10k Test Suite #214

Conversation

JCRPaquin commented Apr 23, 2024 • edited

10k Test Suite - Catching query parsing regressions in development

What?

Why?

Completion tracking

What next?

JCRPaquin commented Apr 23, 2024

aaccomazzi left a comment

Choose a reason for hiding this comment

JCRPaquin commented Apr 23, 2024 •

edited