Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Named character references in HTML5 kill parsing #9

Open
kaefer3000 opened this issue Jan 8, 2015 · 2 comments
Open

Named character references in HTML5 kill parsing #9

kaefer3000 opened this issue Jan 8, 2015 · 2 comments

Comments

@kaefer3000
Copy link

Hi,

parsing http://schema.org/Person with rapper, I stumbled upon the following problem and I guess it is caused by librdfa: The page containes named character references (see HTML5 spec) such as   and because HTML5 does not come with a DTD, the ones exceeding the default XML escapes are considered undefined thus stopping the parsing with an error:
rapper: Error - - XML parser error: Entity 'nbsp' not defined
Yet, those escapes are allowed in HTML5 and around and it would therefore be great if librdfa could support them.

Cheers,

Tobias

@dajobe
Copy link
Member

dajobe commented Jan 8, 2015

There's not really work going on for librdfa. I mostly fix minor bugs and portability stuff for raptor.

@kaefer3000
Copy link
Author

Ah ok, but maybe it is not so much of a big deal to fix it. For example, there is a JSON file with all the named character references, maybe this information can be supplied to the xml parser, or the parser can be told to ignore that kind of errors. Also, even changing the doctype to HTML4 brings the parser from 11 to 64 triples (of 239 achieved with pyrdfa).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants