Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

similar() + score doesn't produce the same results as topn(x, similar(), "score desc") #185

Open
romanchyla opened this issue Dec 23, 2021 · 3 comments

Comments

@romanchyla
Copy link
Contributor

(base) aaccomazzilap5:~ aaccomazzi$ alias curlads
alias curlads='curl -H '\''Authorization: Bearer:TOKEN'\'''
(base) aaccomazzilap5:~ aaccomazzi$ curlads 'https://ui.adsabs.harvard.edu/v1/search/query?fl=bibcode&p_=0&q=similar(%22solar%20wind%22%20SWEAP)&rows=500&sort=score%20desc%2C%20bibcode%20desc' > curl-similar.payload && curlads 'https://ui.adsabs.harvard.edu/v1/search/query?fl=bibcode&p_=0&q=topn(500%2C%20similar(%22solar%20wind%22%20SWEAP)%2C%20%22score%20desc%2C%20bibcode%20desc%22)&rows=500&sort=score%20desc%2C%20bibcode%20desc' > curl-topn.payload
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 25400  100 25400    0     0  59905      0 --:--:-- --:--:-- --:--:-- 59905
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 25439  100 25439    0     0   8756      0  0:00:02  0:00:02 --:--:--  8753
(base) aaccomazzilap5:~ aaccomazzi$ jq '{ bibcode: [ .response.docs[].bibcode ]}' < curl-similar.payload | sort > curl-similar.bib
(base) aaccomazzilap5:~ aaccomazzi$ jq '{ bibcode: [ .response.docs[].bibcode ]}' < curl-topn.payload | sort > curl-topn.bib
(base) aaccomazzilap5:~ aaccomazzi$ diff curl-similar.bib curl-topn.bib | wc -l

using stable sort (e.g. bibcode, or classic_score) produces the same results for both queries
in millions of documents scored, using score and floating point calculations, we can expect our of order results (even with breaking ties on bibcode); in the first 500 hits I see 24 docs with the same score (score desc without bibcode)
topn(...., "score desc") and similar(....)&sort=score+desc
the first 19 docs have the same scores, but the 20th has a different one -- by quite a large delta >0.001 - so that is not a floating point
4. scores are monotonically decreasing (for topn(...) without exterior sort); so that rules out a potential bug in the collector (but that is a weak indication)
5. at this point, I'm suspicious of this: https://github.com/romanchyla/montysolr/blob/8a1871e21004d2a92744265e30740815ee20506e/contrib/adsabs/src/java/org/apache/lucene/search/AbstractSecondOrderCollector.java#L443 -- the only thing we have available to us inside second order collector is a score (which was produced from score+bibcode higher up) but if those guys produce the same score, we cannot break the tie again

unfortunately, for the topn I can't get debug output -- there is a bug resulting in NullPointer exception - that I have to fix first; to really figure out why the scores are different

@aaccomazzi
Copy link
Member

One additional observation: our score is a function of the lucene score and the boost factor. Is it possible that topn uses just the lucene score in truncating the list? This would explain the different lists of papers. The topn list has papers with a cumulative count of 5,300 citations whereas the list selected from the 500 top similar papers has 52,000!

@romanchyla
Copy link
Contributor Author

@aaccomazzi your intuition is correct. Because how the custom scoring is implemented - lucene_score + (cite_read_boost + AA constant) we'll be getting different values even if both calls (i.e. topn and the top level custom rescoring) use the same parameters; they are doing different thing

It took me this long to fix the underlying bugs in the 2nd order collectors; these bugs were only affecting debug output - but were quite difficult to identify. I've then figured how to modify topn() -- tailor is deploying it to DEV as I write this; but I'm not yet ready to commit to using this in prod.

Also to note: in the dev topn() is wrapped by custom -- so we'll be doing the computation twice; but because of the rescoring, we'll be not getting the same order (as if when done once)

But we'll be able to test...

@romanchyla
Copy link
Contributor Author

placeholder: I'm going to include the customized topn() in the next release - but it is not yet the ideal solution

it is using the custom ads scoring formula custom(SecondOrderQuery(title:foo, collector=SecondOrderCollectorTopN(2)), sum(float(cite_read_boost),const(0.5))) where previously it would only use lucene score SecondOrderQuery(title:foo, collector=SecondOrderCollectorTopN(2))

the trouble is: docs get rescored twice (and change order) -- I have to solve this differently

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants