New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement heuristics to handle TeX escaping in input metadata #34
Comments
Shouldn't what arXiv delivers always be LaTeX? From https://arxiv.org/help/prep:
Just below that:
|
Hmmm... There is obviously some inconsistency for how the data gets encoded in their OAI XML interface. For instance, this is how the author "Verdière" is found in https://arxiv.org/abs/1706.09805:
(which leads to the correct unicode string). But this is how the same word appears in the title of https://arxiv.org/abs/1706.07451:
I'll have to ask our arXiv friends about this... |
Hey folks; hopefully I can clear this up. Pinging @mhl10 in case I say false things (he has the most recent direct contact with this part of the code-base). TeX-isms are stored in the original metadata as provided by the submitter. In many places, when we render metadata we perform a TeX-to-UTF8 conversion. The translation tables and routines can be found here: https://github.com/arXiv/arxiv-base/blob/74f8019f4eb71bb210ee442f0f54aee84e881e35/arxiv/util/tex2utf.py . This is part of the So what you're seeing in the OAI XML is the translated and encoded form. It looks like you're looking at the RDF embedded on the abs page; it appears that we are simply not performing any TeX-to-UTF8 translation when rendering that RDF. See https://github.com/arXiv/arxiv-browse/blob/02ac45d21ea86e762e13d025db7b15b80cb9874f/browse/templates/abs/trackback_rdf.html . This might actually be a bug. |
Thanks for the explanation @erickpeirson. We don't use the embedded RDF data, just the OAI metadata and I still see an inconsistency in the way title and abstract are converted to UTF8 in your web pages (https://arxiv.org/abs/1706.07451) vs. OAI records (http://export.arxiv.org/oai2?verb=GetRecord;identifier=oai%3AarXiv%2Eorg%3A1706%2E07451;metadataPrefix=oai%5Fdc) So it looks to me like your OAI records contains some UTF8-converted fields (authors) and some non-converted ones (title and abstract), but this is based on me looking at just a few records. |
Ah, sorry, I misinterpreted the abs links. Yes, it appears that in the OAI2 interface (and I'm mucking through some old Perl here, so bear with me), we are performing TeX-to-UTF8 conversion on the authors, but only verifying that the title and abstract are valid UTF8 (not converting TeX-isms). Edit: the Python translation tables/routines that I linked above were ported from the Perl functions that perform TeX-to-UTF8 conversion on the authors/affiliations in the legacy OAI2 interface. So that should be an accurate indication of what we are doing there. |
Thanks again Erick for the pointer. Last question: since we're still doing processing of this content via our legacy (PERL) system ATM is the perl code in question also available? |
Good morning @aaccomazzi, the repo that includes that perl code is private, but I'll email you the relevant packages directly. |
ADS PERL legacy code now updated to perform tex-to-utf8 conversion for titles and abstracts, and all previously ingested arXiv metadata has now been reprocessed. Since Classic to SOLR is the primary pipeline for this content into the ADS system, tex encoding errors should now be gone. Still to be completed is the python implementation, but this should be trivial thanks to https://github.com/arXiv/arxiv-base/blob/master/arxiv/util/tex2utf.py |
arXiv records delivered via their OAI interface often include TeX formatting which we don't properly manage (see for example 2006math.....12720C or 2017arXiv170607451M). It would be nice to have an option in the parser which applies a 'deTeXing' to the input content.
The text was updated successfully, but these errors were encountered: