RDFa parser produces unexpected results with CDATA sections and entity references #4

nxg · 2012-02-20T21:41:38Z

Consider the examples below, which parses an RDFa document, producing predicates content1, ..., content5.

(This is using raptor, rather than using librdfa directly, and indeed is copied from raptor bug report 495 at the suggestion of the raptor maintainer; that is, this is a slightly indirect bugreport -- I hope that's OK)

Tests content1, 2, 4 and 5 are, I think wrong.

For content1, 2, 4 and 5, the CDATA marked section is simply omitted. Although http://www.w3.org/TR/rdfa-syntax/ doesn't mention CDATA marked sections, there's nothing there that seems to warrant ignoring them.

Tests content1, 2 and 5 produce XMLLiteral data which includes both elements and entities. However in each of the three cases, the Turtle output has the characters denoted by entities (the &<>) appearing literally in the rdf:XMLLiteral, making it not valid XML. Ie they're not escaped in any way. I can't find anything, in either http://www.w3.org/TR/REC-rdf-syntax/ (which I suppose is the definition of rdf:XMLLiteral) or http://www.w3.org/TeamSubmission/turtle/ which spells out what the content of an rdf:XMLLiteral should be, but I would be surprised if invalid XML is allowed. I don't believe this is a (raptor) Turtle serialisation problem, since looking at the post-parse result programmatically shows that the CDATA marked sections have been removed, and the &<> are sitting unescaped in a string which should be an XMLLiteral.

<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE html PUBLIC '-//W3C//DTD XHTML+RDFa 1.0//EN' 'http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd'>
<html xmlns='http://www.w3.org/1999/xhtml' xmlns:ns='urn:ns#' xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'>
<head>
<title property='ns:title'>T</title>
<meta about='' property='ns:abstract' content='Abstract &lt;&gt;&amp;%' />
</head>
<body>
<!-- for cases below, see http://www.w3.org/TR/rdfa-syntax/ Sect. 6.3.1.3 -->
<!-- explicit XMLLiteral @datatype -->
<p property='ns:content1'
   datatype='rdf:XMLLiteral'
   >content1: <![CDATA[cdata<>&]]> <span>not</span>&amp;&lt;&gt;</p>
<!-- no @datatype, presence of elements implies it -->
<p property='ns:content2'
   >content2: <![CDATA[cdata<>&]]> <span>not</span>&#38;&#60;&#62;</p>
<!-- no @datatype, but no XML elements, so plain literal -->
<p property='ns:content3'
   >content3: plain content</p>
<!-- explicit empty @datatype, so interpreted as a plain literal -->
<p property='ns:content4'
   datatype=''
   >content4: <![CDATA[cdata<>&]]> <span>not</span>&amp;&#60;&#62;</p>
<!-- basically same as content2 above -->
<div property='ns:content5'
     ><p>content5: <![CDATA[cdata<>&]]> <span>not</span>&amp;&#60;&#62;</p></div>
</body></html>

And yes, I agree that the world would be a nicer place, if CDATA marked sections did not exist.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RDFa parser produces unexpected results with CDATA sections and entity references #4

RDFa parser produces unexpected results with CDATA sections and entity references #4

nxg commented Feb 20, 2012

RDFa parser produces unexpected results with CDATA sections and entity references #4

RDFa parser produces unexpected results with CDATA sections and entity references #4

Comments

nxg commented Feb 20, 2012