Implement heuristics to handle TeX escaping in input metadata #34

aaccomazzi · 2019-05-16T15:33:11Z

arXiv records delivered via their OAI interface often include TeX formatting which we don't properly manage (see for example 2006math.....12720C or 2017arXiv170607451M). It would be nice to have an option in the parser which applies a 'deTeXing' to the input content.

mantepse · 2019-05-16T16:17:41Z

Shouldn't what arXiv delivers always be LaTeX?

From https://arxiv.org/help/prep:

Bad character(s)

Our metadata fields only accept ASCII input. Unicode characters should be conveted to its TEX >equivalent (either through MathJax entry or for proper names use the appropriate accents).

A common problem experienced during submission, Bad character(s) in field..., is reporting UTF >chracters being entered into a field that expects ASCII character entry. This is usually caused by the >UTF characters being copied from your pdf viewer and then pasted as such during this step, rather >than being entered as ASCII text. The most common culprits are the curved quotation marks (for >example “ and ” instead of keyboard entry), long-hyphens (— or –), and fi/ff copied as a single >character.
If you can't figure it out, type it out.

Just below that:

Certain TEX accent commands may be used in this field.

aaccomazzi · 2019-05-16T19:52:58Z

Hmmm... There is obviously some inconsistency for how the data gets encoded in their OAI XML interface. For instance, this is how the author "Verdière" is found in https://arxiv.org/abs/1706.09805:

 <dc:creator>Verdi&#xe8;re, Nathalie</dc:creator>

(which leads to the correct unicode string). But this is how the same word appears in the title of https://arxiv.org/abs/1706.07451:

 <dc:title>The Extremal Function and Colin de Verdi\`{e}re Graph Parameter</dc:title>

I'll have to ask our arXiv friends about this...

erickpeirson · 2019-08-06T13:26:11Z

Hey folks; hopefully I can clear this up. Pinging @mhl10 in case I say false things (he has the most recent direct contact with this part of the code-base).

TeX-isms are stored in the original metadata as provided by the submitter. In many places, when we render metadata we perform a TeX-to-UTF8 conversion. The translation tables and routines can be found here: https://github.com/arXiv/arxiv-base/blob/74f8019f4eb71bb210ee442f0f54aee84e881e35/arxiv/util/tex2utf.py . This is part of the arxiv-base package on PyPI; feel free to use this however you like.

So what you're seeing in the OAI XML is the translated and encoded form.

It looks like you're looking at the RDF embedded on the abs page; it appears that we are simply not performing any TeX-to-UTF8 translation when rendering that RDF. See https://github.com/arXiv/arxiv-browse/blob/02ac45d21ea86e762e13d025db7b15b80cb9874f/browse/templates/abs/trackback_rdf.html . This might actually be a bug.

aaccomazzi · 2019-08-06T15:00:44Z

Thanks for the explanation @erickpeirson. We don't use the embedded RDF data, just the OAI metadata and I still see an inconsistency in the way title and abstract are converted to UTF8 in your web pages (https://arxiv.org/abs/1706.07451) vs. OAI records (http://export.arxiv.org/oai2?verb=GetRecord;identifier=oai%3AarXiv%2Eorg%3A1706%2E07451;metadataPrefix=oai%5Fdc)

So it looks to me like your OAI records contains some UTF8-converted fields (authors) and some non-converted ones (title and abstract), but this is based on me looking at just a few records.

erickpeirson · 2019-08-06T15:15:30Z

Ah, sorry, I misinterpreted the abs links. Yes, it appears that in the OAI2 interface (and I'm mucking through some old Perl here, so bear with me), we are performing TeX-to-UTF8 conversion on the authors, but only verifying that the title and abstract are valid UTF8 (not converting TeX-isms).

Edit: the Python translation tables/routines that I linked above were ported from the Perl functions that perform TeX-to-UTF8 conversion on the authors/affiliations in the legacy OAI2 interface. So that should be an accurate indication of what we are doing there.

aaccomazzi · 2019-08-07T01:01:13Z

Thanks again Erick for the pointer. Last question: since we're still doing processing of this content via our legacy (PERL) system ATM is the perl code in question also available?

mhl10 · 2019-08-07T13:15:14Z

Good morning @aaccomazzi, the repo that includes that perl code is private, but I'll email you the relevant packages directly.

aaccomazzi · 2019-08-12T17:03:23Z

ADS PERL legacy code now updated to perform tex-to-utf8 conversion for titles and abstracts, and all previously ingested arXiv metadata has now been reprocessed. Since Classic to SOLR is the primary pipeline for this content into the ADS system, tex encoding errors should now be gone.

Still to be completed is the python implementation, but this should be trivial thanks to https://github.com/arXiv/arxiv-base/blob/master/arxiv/util/tex2utf.py

aaccomazzi self-assigned this Aug 12, 2019

aaccomazzi added the enhancement label Aug 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement heuristics to handle TeX escaping in input metadata #34

Implement heuristics to handle TeX escaping in input metadata #34

aaccomazzi commented May 16, 2019

mantepse commented May 16, 2019

aaccomazzi commented May 16, 2019

erickpeirson commented Aug 6, 2019

aaccomazzi commented Aug 6, 2019

erickpeirson commented Aug 6, 2019 •

edited

aaccomazzi commented Aug 7, 2019

mhl10 commented Aug 7, 2019

aaccomazzi commented Aug 12, 2019

Implement heuristics to handle TeX escaping in input metadata #34

Implement heuristics to handle TeX escaping in input metadata #34

Comments

aaccomazzi commented May 16, 2019

mantepse commented May 16, 2019

aaccomazzi commented May 16, 2019

erickpeirson commented Aug 6, 2019

aaccomazzi commented Aug 6, 2019

erickpeirson commented Aug 6, 2019 • edited

aaccomazzi commented Aug 7, 2019

mhl10 commented Aug 7, 2019

aaccomazzi commented Aug 12, 2019

erickpeirson commented Aug 6, 2019 •

edited