Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RDFa parser produces unexpected results with CDATA sections and entity references #4

Open
nxg opened this issue Feb 20, 2012 · 0 comments

Comments

@nxg
Copy link

nxg commented Feb 20, 2012

Consider the examples below, which parses an RDFa document, producing predicates content1, ..., content5.

(This is using raptor, rather than using librdfa directly, and indeed is copied from raptor bug report 495 at the suggestion of the raptor maintainer; that is, this is a slightly indirect bugreport -- I hope that's OK)

Tests content1, 2, 4 and 5 are, I think wrong.

For content1, 2, 4 and 5, the CDATA marked section is simply omitted. Although http://www.w3.org/TR/rdfa-syntax/ doesn't mention CDATA marked sections, there's nothing there that seems to warrant ignoring them.

Tests content1, 2 and 5 produce XMLLiteral data which includes both elements and entities. However in each of the three cases, the Turtle output has the characters denoted by entities (the &<>) appearing literally in the rdf:XMLLiteral, making it not valid XML. Ie they're not escaped in any way. I can't find anything, in either http://www.w3.org/TR/REC-rdf-syntax/ (which I suppose is the definition of rdf:XMLLiteral) or http://www.w3.org/TeamSubmission/turtle/ which spells out what the content of an rdf:XMLLiteral should be, but I would be surprised if invalid XML is allowed. I don't believe this is a (raptor) Turtle serialisation problem, since looking at the post-parse result programmatically shows that the CDATA marked sections have been removed, and the &<> are sitting unescaped in a string which should be an XMLLiteral.

<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE html PUBLIC '-//W3C//DTD XHTML+RDFa 1.0//EN' 'http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd'>
<html xmlns='http://www.w3.org/1999/xhtml' xmlns:ns='urn:ns#' xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'>
<head>
<title property='ns:title'>T</title>
<meta about='' property='ns:abstract' content='Abstract &lt;&gt;&amp;%' />
</head>
<body>
<!-- for cases below, see http://www.w3.org/TR/rdfa-syntax/ Sect. 6.3.1.3 -->
<!-- explicit XMLLiteral @datatype -->
<p property='ns:content1'
   datatype='rdf:XMLLiteral'
   >content1: <![CDATA[cdata<>&]]> <span>not</span>&amp;&lt;&gt;</p>
<!-- no @datatype, presence of elements implies it -->
<p property='ns:content2'
   >content2: <![CDATA[cdata<>&]]> <span>not</span>&#38;&#60;&#62;</p>
<!-- no @datatype, but no XML elements, so plain literal -->
<p property='ns:content3'
   >content3: plain content</p>
<!-- explicit empty @datatype, so interpreted as a plain literal -->
<p property='ns:content4'
   datatype=''
   >content4: <![CDATA[cdata<>&]]> <span>not</span>&amp;&#60;&#62;</p>
<!-- basically same as content2 above -->
<div property='ns:content5'
     ><p>content5: <![CDATA[cdata<>&]]> <span>not</span>&amp;&#60;&#62;</p></div>
</body></html>

And yes, I agree that the world would be a nicer place, if CDATA marked sections did not exist.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant