Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LaTeX in titles is wrongly escaped #2135

Open
michamos opened this issue Mar 23, 2021 · 5 comments
Open

LaTeX in titles is wrongly escaped #2135

michamos opened this issue Mar 23, 2021 · 5 comments

Comments

@michamos
Copy link

michamos commented Mar 23, 2021

While trying to improve handling of LaTeX in arXiv titles in INSPIRE (which are basically unstructured strings and might or might not contain LaTeX macros, which is problematic when wanting to make sense of them in BibTeX/LaTeX citation snippets), I checked how you're doing things, and noticed that you're wrongly escaping backslashes, producing invalid LaTeX.

Apologies if this is not the right place to report this bug, I know nothing about your architecture and this repo seemed the most active.

Expected Behavior

bibtex would contain something like

title = "{Measurement of the $\Sigma$ beam asymmetry for the $\omega$ photo-production off the proton and the neutron at GRAAL}"

in order to get a compilable title (Greek letters are not allowed outside of math mode by default). That's very hard to achieve as you'd need to somehow interpret the title. A valid fix to your current approach would be

title = "{Measurement of the \textbackslash{}Sigma\textbackslash{} beam asymmetry for the \textbackslash{}omega\textbackslash{} photo-production off the proton and the neutron at GRAAL}"

Note the addition of {} after the inserted macros to make sure they're not glued to the next word and spaces after the macro don't get swallowed.

Actual Behavior

bibtex output contains

title = "{Measurement of the \textbackslashSigma\textbackslash beam asymmetry for the \textbackslashomega\textbackslash photo-production off the proton and the neutron at GRAAL}"

but \textbackslashSigma is not a valid macro, and \textbackslash beam eats the space, producing \beam in the output.

Steps to Reproduce

Go to https://ui.adsabs.harvard.edu/abs/2013arXiv1306.5943V/exportcitation and look at the title in the bibtex snippet.

@marblestation
Copy link
Contributor

Not sure if this user-reported problem is an export or data problem actually @aaccomazzi @golnazads

@golnazads
Copy link

@marblestation I am adding an author format to export today. I shall look to see if I can fix this. good timing bringing this up.

@michamos
Copy link
Author

michamos commented May 10, 2021

Hi @marblestation,

(I'm not really a user, I'm part of the team running INSPIRE, and was looking at how you're doing things because we had similar issues). I think you're asking a very good question, and I think it's actually both a data and an export problem.

The export problem is that your LaTeX escaping is incorrect, generating macro names that are not defined as you don't separate them from the next LaTeX token correctly.

The root data problem is coming from arXiv, where there is no guarantee about whether the titles (and other fields such as abstracts or comments) contain LaTeX macros outside of math mode. They officially support a limited number of escape sequences to compensate for their lack of unicode support (such as \"o to write ö as in Schrödinger) but they actually support more (such as a Greek letters, so \mu gets rendered as μ on the arXiv splash page, even if out-of-the-box LaTeX will refuse to compile that as Greek macros are allowed only in math mode). On top of that, many records use TeX macros outside of math mode to convey some information, even if they don't render nicely on the arXiv side. That's not an issue there, but it becomes an issue for downstream services such as INSPIRE or ADS, which offer LaTeX based export formats and need to know whether a title is valid LaTeX to decide whether to escape it (as you're trying to do), or simply pass it through.

FYI, the strategy we've adopted is two-fold: we try to decode LaTeX macros outside of math mode when it's possible to do so without too much loss during harvesting (we're using a suitably configured pylatexenc for that). The untranslated bits might be valid macros but there's no way to know, so we use pylatexenc again to encode the whole thing when generating the LaTeX export formats.

Let me know if you have questions, we've thought about this issue quite a lot and have gone through several iterations before landing on this solution, see inspirehep/hepcrawl#299 for more info and test cases.

@golnazads
Copy link

golnazads commented May 10, 2021 via email

@aaccomazzi
Copy link
Member

The actual text in the title field is this: Measurement of the \Sigma\ beam asymmetry for the \omega\ photo-production off the proton and the neutron at GRAAL, so these are single slashes (they appear double within json since the first backslash is an escape).

Note that the data problem comes directly from arXiv as there are no math mode characters in the title surrounding the greek letters.

But to start, we should do what Micha suggested: outside of math mode replace a single backslash with \textbackslash{} rather than \textbackslash

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants