Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement heuristics to handle TeX escaping in input metadata #34

Open
aaccomazzi opened this issue May 16, 2019 · 8 comments
Open

Implement heuristics to handle TeX escaping in input metadata #34

aaccomazzi opened this issue May 16, 2019 · 8 comments
Assignees

Comments

@aaccomazzi
Copy link
Member

arXiv records delivered via their OAI interface often include TeX formatting which we don't properly manage (see for example 2006math.....12720C or 2017arXiv170607451M). It would be nice to have an option in the parser which applies a 'deTeXing' to the input content.

@mantepse
Copy link

Shouldn't what arXiv delivers always be LaTeX?

From https://arxiv.org/help/prep:

Bad character(s)

Our metadata fields only accept ASCII input. Unicode characters should be conveted to its TEX >equivalent (either through MathJax entry or for proper names use the appropriate accents).

A common problem experienced during submission, Bad character(s) in field..., is reporting UTF >chracters being entered into a field that expects ASCII character entry. This is usually caused by the >UTF characters being copied from your pdf viewer and then pasted as such during this step, rather >than being entered as ASCII text. The most common culprits are the curved quotation marks (for >example “ and ” instead of keyboard entry), long-hyphens (— or –), and fi/ff copied as a single >character.
If you can't figure it out, type it out.

Just below that:

Certain TEX accent commands may be used in this field.

@aaccomazzi
Copy link
Member Author

Hmmm... There is obviously some inconsistency for how the data gets encoded in their OAI XML interface. For instance, this is how the author "Verdière" is found in https://arxiv.org/abs/1706.09805:

 <dc:creator>Verdi&#xe8;re, Nathalie</dc:creator>

(which leads to the correct unicode string). But this is how the same word appears in the title of https://arxiv.org/abs/1706.07451:

 <dc:title>The Extremal Function and Colin de Verdi\`{e}re Graph Parameter</dc:title>

I'll have to ask our arXiv friends about this...

@erickpeirson
Copy link

Hey folks; hopefully I can clear this up. Pinging @mhl10 in case I say false things (he has the most recent direct contact with this part of the code-base).

TeX-isms are stored in the original metadata as provided by the submitter. In many places, when we render metadata we perform a TeX-to-UTF8 conversion. The translation tables and routines can be found here: https://github.com/arXiv/arxiv-base/blob/74f8019f4eb71bb210ee442f0f54aee84e881e35/arxiv/util/tex2utf.py . This is part of the arxiv-base package on PyPI; feel free to use this however you like.

So what you're seeing in the OAI XML is the translated and encoded form.

It looks like you're looking at the RDF embedded on the abs page; it appears that we are simply not performing any TeX-to-UTF8 translation when rendering that RDF. See https://github.com/arXiv/arxiv-browse/blob/02ac45d21ea86e762e13d025db7b15b80cb9874f/browse/templates/abs/trackback_rdf.html . This might actually be a bug.

@aaccomazzi
Copy link
Member Author

Thanks for the explanation @erickpeirson. We don't use the embedded RDF data, just the OAI metadata and I still see an inconsistency in the way title and abstract are converted to UTF8 in your web pages (https://arxiv.org/abs/1706.07451) vs. OAI records (http://export.arxiv.org/oai2?verb=GetRecord;identifier=oai%3AarXiv%2Eorg%3A1706%2E07451;metadataPrefix=oai%5Fdc)

So it looks to me like your OAI records contains some UTF8-converted fields (authors) and some non-converted ones (title and abstract), but this is based on me looking at just a few records.

@erickpeirson
Copy link

erickpeirson commented Aug 6, 2019

Ah, sorry, I misinterpreted the abs links. Yes, it appears that in the OAI2 interface (and I'm mucking through some old Perl here, so bear with me), we are performing TeX-to-UTF8 conversion on the authors, but only verifying that the title and abstract are valid UTF8 (not converting TeX-isms).

Edit: the Python translation tables/routines that I linked above were ported from the Perl functions that perform TeX-to-UTF8 conversion on the authors/affiliations in the legacy OAI2 interface. So that should be an accurate indication of what we are doing there.

@aaccomazzi
Copy link
Member Author

Thanks again Erick for the pointer. Last question: since we're still doing processing of this content via our legacy (PERL) system ATM is the perl code in question also available?

@mhl10
Copy link

mhl10 commented Aug 7, 2019

Good morning @aaccomazzi, the repo that includes that perl code is private, but I'll email you the relevant packages directly.

@aaccomazzi
Copy link
Member Author

ADS PERL legacy code now updated to perform tex-to-utf8 conversion for titles and abstracts, and all previously ingested arXiv metadata has now been reprocessed. Since Classic to SOLR is the primary pipeline for this content into the ADS system, tex encoding errors should now be gone.

Still to be completed is the python implementation, but this should be trivial thanks to https://github.com/arXiv/arxiv-base/blob/master/arxiv/util/tex2utf.py

@aaccomazzi aaccomazzi self-assigned this Aug 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants