Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search misses "Nordic Optical" telescope results - possibly as a result of core optimization #179

Open
romanchyla opened this issue Jan 11, 2021 · 2 comments

Comments

@romanchyla
Copy link
Contributor

Because we don't store term vectors (due to size) this is terribly difficult to debug, but here is a review of what is known so far

Search for "nordic optical" in body or abstract, will find less documents than expected.

Now, hold your breath .... tada, but only for index built 1 week ago! An index which was built from scratch this Saturday is unaffected.

What is different? The index from last week has been compacted. The solr release building that index also had a bug (which should however only impact documents that had a synonym on the very first position of the indexed stream; and it resulted in docs being rejected -- i.e. not indexed)

Everything else is the same, including synonyms that are used for index time tokenization.

The problem is with search query, the following abstract:"nordic optical" becomes abstract:"nordic syn::optical"

collection1: 1203 results
collection2: 1170

when searching with =abstract:"nordic optical"

collection1: 1203 results
collection2: 1203

when searching abstract:"nordic syn::optical" (this one can only be done from inside Luke with whitespace analyzer):

collection2: 1170 results

So for 33 documents, the position of the token syn::optical -- looks like -- moved by 1. But I have no way to tell because we can't reconstruct the document due to missing term vectors.

This query: abstract:nordic NEAR1 abstract:optical

collection1: 1205
collection2: 1205

Which is totally confusing! -- PROXIMITY search only considers tokens that are next to each other, so it is (almost) the same thing as a phrase search. And I tried abstract:"syn::optical nordic" -- to verify the tokens were not swapped; that produces 0 results

At this point, the suspicion falls on core optimization -- to verify this theory, we'll have to repeat the same action. But we need to wait to have a new core built; not wanting to screw production (which works and is producing correct results)

@romanchyla
Copy link
Contributor Author

bit of debugging info:

  1. ssh -Y adsqb
  2. download luke and extract
  3. cd /proj.adsqb/var/lib/docker/volumes/backoffice_prod_montysolr_engine_data/_data/luke....
  4. ./luke.sh and open the index

@romanchyla
Copy link
Contributor Author

to verify (or find missing documents):

=body:"nordic optical" NOT body:"nordic optical"

if everything works as expected, the query must result in 0 docs (collection2 returns 224 docs right now; collection1 0)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants