You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
(base) aaccomazzilap5:~ aaccomazzi$ alias curlads
alias curlads='curl -H '\''Authorization: Bearer:TOKEN'\'''
(base) aaccomazzilap5:~ aaccomazzi$ curlads 'https://ui.adsabs.harvard.edu/v1/search/query?fl=bibcode&p_=0&q=similar(%22solar%20wind%22%20SWEAP)&rows=500&sort=score%20desc%2C%20bibcode%20desc' > curl-similar.payload && curlads 'https://ui.adsabs.harvard.edu/v1/search/query?fl=bibcode&p_=0&q=topn(500%2C%20similar(%22solar%20wind%22%20SWEAP)%2C%20%22score%20desc%2C%20bibcode%20desc%22)&rows=500&sort=score%20desc%2C%20bibcode%20desc' > curl-topn.payload
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 25400 100 25400 0 0 59905 0 --:--:-- --:--:-- --:--:-- 59905
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 25439 100 25439 0 0 8756 0 0:00:02 0:00:02 --:--:-- 8753
(base) aaccomazzilap5:~ aaccomazzi$ jq '{ bibcode: [ .response.docs[].bibcode ]}' < curl-similar.payload | sort > curl-similar.bib
(base) aaccomazzilap5:~ aaccomazzi$ jq '{ bibcode: [ .response.docs[].bibcode ]}' < curl-topn.payload | sort > curl-topn.bib
(base) aaccomazzilap5:~ aaccomazzi$ diff curl-similar.bib curl-topn.bib | wc -l
using stable sort (e.g. bibcode, or classic_score) produces the same results for both queries
in millions of documents scored, using score and floating point calculations, we can expect our of order results (even with breaking ties on bibcode); in the first 500 hits I see 24 docs with the same score (score desc without bibcode)
topn(...., "score desc") and similar(....)&sort=score+desc
the first 19 docs have the same scores, but the 20th has a different one -- by quite a large delta >0.001 - so that is not a floating point
4. scores are monotonically decreasing (for topn(...) without exterior sort); so that rules out a potential bug in the collector (but that is a weak indication)
5. at this point, I'm suspicious of this: https://github.com/romanchyla/montysolr/blob/8a1871e21004d2a92744265e30740815ee20506e/contrib/adsabs/src/java/org/apache/lucene/search/AbstractSecondOrderCollector.java#L443 -- the only thing we have available to us inside second order collector is a score (which was produced from score+bibcode higher up) but if those guys produce the same score, we cannot break the tie again
unfortunately, for the topn I can't get debug output -- there is a bug resulting in NullPointer exception - that I have to fix first; to really figure out why the scores are different
The text was updated successfully, but these errors were encountered:
One additional observation: our score is a function of the lucene score and the boost factor. Is it possible that topn uses just the lucene score in truncating the list? This would explain the different lists of papers. The topn list has papers with a cumulative count of 5,300 citations whereas the list selected from the 500 top similar papers has 52,000!
@aaccomazzi your intuition is correct. Because how the custom scoring is implemented - lucene_score + (cite_read_boost + AA constant) we'll be getting different values even if both calls (i.e. topn and the top level custom rescoring) use the same parameters; they are doing different thing
It took me this long to fix the underlying bugs in the 2nd order collectors; these bugs were only affecting debug output - but were quite difficult to identify. I've then figured how to modify topn() -- tailor is deploying it to DEV as I write this; but I'm not yet ready to commit to using this in prod.
Also to note: in the dev topn() is wrapped by custom -- so we'll be doing the computation twice; but because of the rescoring, we'll be not getting the same order (as if when done once)
placeholder: I'm going to include the customized topn() in the next release - but it is not yet the ideal solution
it is using the custom ads scoring formula custom(SecondOrderQuery(title:foo, collector=SecondOrderCollectorTopN(2)), sum(float(cite_read_boost),const(0.5))) where previously it would only use lucene score SecondOrderQuery(title:foo, collector=SecondOrderCollectorTopN(2))
the trouble is: docs get rescored twice (and change order) -- I have to solve this differently
using stable sort (e.g. bibcode, or classic_score) produces the same results for both queries
in millions of documents scored, using score and floating point calculations, we can expect our of order results (even with breaking ties on bibcode); in the first 500 hits I see 24 docs with the same score (score desc without bibcode)
topn(...., "score desc") and similar(....)&sort=score+desc
the first 19 docs have the same scores, but the 20th has a different one -- by quite a large delta >0.001 - so that is not a floating point
4. scores are monotonically decreasing (for topn(...) without exterior sort); so that rules out a potential bug in the collector (but that is a weak indication)
5. at this point, I'm suspicious of this: https://github.com/romanchyla/montysolr/blob/8a1871e21004d2a92744265e30740815ee20506e/contrib/adsabs/src/java/org/apache/lucene/search/AbstractSecondOrderCollector.java#L443 -- the only thing we have available to us inside second order collector is a score (which was produced from score+bibcode higher up) but if those guys produce the same score, we cannot break the tie again
unfortunately, for the topn I can't get debug output -- there is a bug resulting in NullPointer exception - that I have to fix first; to really figure out why the scores are different
The text was updated successfully, but these errors were encountered: