Formats that don't support named graphs serialize `Dataset`s and `ConjunctiveGraph`s with non-default graphs without raising any errors #2393

aucampia · 2023-05-20T16:04:41Z

If RDFLib is used to serialize a Dataset or ConjunctiveGraph that contains non-default graphs [ref] as a format that does not support named graphs (i.e. N-Triples or Turtle) no error is raised, and the output is wrong.

Given this data in test/data/variants/diverse_quads.nq (equivalent to this trig document):

rdflib/test/data/variants/diverse_quads.nq

Lines 1 to 10 in ddcc4eb

    
           <http://example.com/subject> <http://example.com/predicate> "日本語の表記体系"@jpx <example:graph> . 
        
           <urn:example:subject> <example:predicate> <example:subject> <example:graph> . 
        
           <example:subject> <example:predicate> <example:object> <example:graph> . 
        
           <example:subject> <example:predicate> "12"^^<http://www.w3.org/2001/XMLSchema#integer> <example:graph> . 
        
           <example:subject> <example:predicate> <example:object> <urn:example:graph> . 
        
           <example:subject> <http://example.com/predicate> <http://example.com/object> <urn:example:graph> . 
        
           <example:subject> <http://example.com/predicate> "XSD string" <urn:example:graph> . 
        
           <example:subject> <example:predicate> <example:object> . 
        
           <http://example.com/subject> <http://example.com/predicate> "typeless" . 
        
           <urn:example:subject> <urn:example:predicate> <urn:example:object> .

Using rdfpipe to convert it to turtle gives:

$ pipx run --spec git+https://github.com/RDFLib/rdflib.git@main rdfpipe -o turtle test/data/variants/diverse_quads.nq; echo status=$?
⚠️  rdfpipe is already on your PATH and installed at /home/iwana/.local/bin/rdfpipe. Downloading and running anyway.
@prefix ns1: <http://example.com/> .
@prefix ns2: <urn:example:> .
@prefix ns3: <example:> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

ns1:subject ns1:predicate "typeless",
        "日本語の表記体系"@jpx .

ns2:subject ns3:predicate ns3:subject ;
    ns2:predicate ns2:object .

ns3:subject ns3:predicate ns3:object,
        12 ;
    ns1:predicate ns1:object,
        "XSD string" .

status=0

And using rdfpipe to convert it to ntriples gives:

$ pipx run --spec git+https://github.com/RDFLib/rdflib.git@main rdfpipe -o ntriples test/data/variants/diverse_quads.nq; echo status=$?
⚠️  rdfpipe is already on your PATH and installed at /home/iwana/.local/bin/rdfpipe. Downloading and running anyway.
/home/iwana/.local/pipx/.cache/bf79805433c3d94/lib64/python3.11/site-packages/rdflib/plugins/serializers/nt.py:40: UserWarning: NTSerializer always uses UTF-8 encoding. Given encoding was: None
  warnings.warn(
<http://example.com/subject> <http://example.com/predicate> "typeless" .
<example:subject> <http://example.com/predicate> <http://example.com/object> .
<urn:example:subject> <example:predicate> <example:subject> .
<example:subject> <example:predicate> "12"^^<http://www.w3.org/2001/XMLSchema#integer> .
<example:subject> <example:predicate> <example:object> .
<http://example.com/subject> <http://example.com/predicate> "日本語の表記体系"@jpx .
<urn:example:subject> <urn:example:predicate> <urn:example:object> .
<example:subject> <http://example.com/predicate> "XSD string" .
status=0

In both cases, the output is wrong, not just incomplete.

I would say the right behaviour is that an error is raised when a ConjunctiveGraph or Dataset with non-default graphs are serialized using a format that does not support named graphs.

I'm making this issue to get some feedback, but I will make a PR to fix it shortly.

There is a similar issue with riot from Jena [ref], but riots behaviour is not quite as bad

The text was updated successfully, but these errors were encountered:

aucampia · 2023-05-20T16:05:58Z

Please let me know if anyone disagrees with the proposed behaviour:

I would say the right behaviour is that an error is raised when a ConjunctiveGraph or Dataset with non-default graphs are serialized using a format that does not support named graphs.

CC: @RDFLib/core-reviewers

WhiteGobo · 2023-05-20T20:28:52Z

Before wrong output is generated by ConjunctiveGraph.serialize(format=turtle) i would expect an error. Even if Graph.serialize(format=turtle) work.

What would be the correct output? Only the data from the default graph or no data "overwritten" from the default graph? Or is there no (simple) correct turtle output for this conjunctive graph? I ask because i would expect exactly the given output, but i havent worked much with conjunctive graphs.

I lack the correct word for what i mean with overwritten. I mean something like this:

@prefix ns1: <http://example.com/> .
@prefix ns2: <urn:example:> .
@prefix ns3: <example:> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

ns1:subject ns1:predicate "typeless".
#The next line would "overwrite" information so it is ignored
#ns1:subject ns1:predicate  "日本語の表記体系"@jpx .

ns2:subject ns2:predicate ns2:object .

ns3:subject ns3:predicate ns3:object;
    ns1:predicate ns1:object,
        "XSD string" .

aucampia · 2023-05-20T21:59:14Z

What would be the correct output? Only the data from the default graph or no data "overwritten" from the default graph? Or is there no (simple) correct turtle output for this conjunctive graph?

When trying to serialize named graphs using a format that doesn't support named graphs (e.g. turtle) I think the only options are:

Error: I think the right option is to raise an Exception, if you are streaming output, there would still be some output, but at least the caller can tell something went wrong.
Incomplete output: This is what Jena's riot is doing [ref] - it only outputs the triples from the default graph. This is not wrong output, but incomplete output, as the output is a subset of quads.
Incorrect output: This is what RDFLib is doing. It serializes triples from named graphs without a graph label/name. Thus creating entirely new triples in the output that do not exist in the input.

Only the data from the default graph or no data "overwritten" from the default graph?
...
ns1:subject ns1:predicate "typeless".
#The next line would "overwrite" information so it is ignored
#ns1:subject ns1:predicate  "日本語の表記体系"@jpx .

RDF does not work like this, there is no ability for new triples to affect previous triples. If the same subject and predicate appears twice in an input document with a different object, then you just get another triple. If you want to prevent this, you would need to use SHACL or something in your ingestion pipeline, but just RDF itself just treats it as another triple.

https://www.w3.org/TR/rdf11-concepts/#section-rdf-graph

An RDF graph is a set of RDF triples.

So, as long as every part of the triple is unique in the set, it is a unique triple, and does not invalidate another triple with the same subject and predicate.

niklasl · 2023-05-20T22:11:52Z

Throwing an error is certainly most explicit. And while I presume it is backwards incompatible, if it is wrong right now, it seems better to error out than to change its behaviour.

It might be good to note how the RDF 1.1 Concepts defines content negotiation of RDF datasets:

If an RDF dataset is returned and the consumer is expecting an RDF graph, the consumer is expected to use the RDF dataset's default graph.

It is understandable why e.g Jena has made that decision, and it warrants consideration. But invoking serialization programmatically is not the same as content negotiation, but rather how you'd implement it, so if the user of the API expects an RDF graph, the user is to use the dataset_or_cg.default_context.

(The error should probably be helpful and suggest that. It might also be helpful to add an alias property named default_graph to Dataset, since "context" is non-standard naming for graphs; but that is a separate issue.)

To reason further,, the "role" of named graphs in a dataset may vary (some use the dataset as a union of graphs, some perhaps as versions of descriptions where only an explicitly chosen subset of them are considered "valid" or "active"). So it makes more sense to force an active choice. It could be useful to make it easy to serialize the union as a stream of triples, but that would be an additional feature.

aucampia · 2023-05-20T22:18:31Z

I posted to the public-rdf-dev mailing list also, https://lists.w3.org/Archives/Public/public-rdf-dev/2023May/0000.html

ghost · 2023-05-21T14:10:01Z

The RDF spec asserts:

RDF datasets MAY be used to express RDF content. When used in this way, a dataset SHOULD be understood to have at least the same content as its default graph.

If the user has specified a context-unaware serialization format, it’s not unreasonable to treat this as intentional and to return the serialized triples of the default graph because the Dataset default graph has no name¹ and can therefore be considered as selected by the specification of a context-unaware serialization format.

Is it perhaps more usefully viewed as an implementation-independent means of specifying serialization of a Dataset’s unnamed default graph?

¹ “a single unnamed, default RDF graph”

aucampia · 2023-05-21T14:41:57Z

If the user has specified a context-unaware serialization format, it’s not unreasonable to treat this as intentional and to return the serialized triples of the default graph because the Dataset default graph has no name¹ and can therefore be considered as selected by the specification of a context-unaware serialization format.

On the other hand, someone could have made a mistake in their code, and did not realize that the format at a specific point is or can be context-unaware. In this case, trying to guess what the user meant would be masking a bug in the users code, where as if they really did mean to serialize only the default context, there would have been an easy way for them to do it explicitly.

Is it perhaps more usefully viewed as an implementation-independent means of specifying serialization of a Dataset’s unnamed default graph?

There are already ways to do this much more explicitly and without ambiguity, which users should use instead:

dataset = Dataset()
## add named graphs and triples
dataset.default_context.serialize(format="turtle")

ghost · 2023-05-21T15:20:51Z

In this case, trying to guess what the user meant

Neither a guess nor an assumption, simply following the direction of the spec.

... an implementation-independent means of specifying serialization of a Dataset’s unnamed default graph?

There are already ways to do this much more explicitly and without ambiguity, which users should use instead:
dataset = Dataset()
## add named graphs and triples
dataset.default_context.serialize(format="turtle")

That's going to break in v7, dataset.default_context will become dataset.default_graph, my point is that the implementation details intrude, so “explicit” is also “brittle”.

aucampia · 2023-05-21T17:35:29Z

Neither a guess nor an assumption, simply following the direction of the spec.

If you are referring to this "RDF datasets may be used to express RDF content. When used in this way, a dataset should be understood to have at least the same content as its default graph." [ref] I'm not sure this applies to the snippet I shared. The "may" in the first sentence normative [ref], meaning it may also not be used to express RDF content. So even if you take the should in the second sentence as normative, you still are assuming the first "may" to be operative, at the very least it should be documented as such.

Is that using an RDF dataset to express RDF content? If that is, what would it look like when it is not being used to express RDF content.

That's going to break in v7, dataset.default_context will become dataset.default_graph, my point is that the implementation details intrude, so “explicit” is also “brittle”.

Version 7 is not released so what will break is undefined. But even if the V7 API does introduce breaking changes (as it should), that does not make the current API brittle, that just means we are using semantic versioning as intended.

The way I see it is: I don't like fragile software. If I ask software to do something, and it can't do it, it should error out, not do it half way, because I did not ask it to do it half way. That is why software interface should be as explicit as possible, I should explicitly be able to ask RDFLib, serialize the whole Dataset. To me, this is what I do when I call Dataset.serialize() - I ask it to serialize the whole dataset, but if there is really a good case why Dataset.serialize() should not mean "serialize the whole dataset" - then I think we should just make antoher method, Dataset.serialize_everything() - which does mean "serialize the whole dataset". I think ambiguities in the spec should not translate to ambiguities in our API.

aucampia · 2023-05-21T17:48:19Z

~~I made a poll, not that it necesarily matters but just to get a sense for preferences:~~

~~https://github.com/RDFLib/rdflib/discussions/2394~~

Never mind, I deleted it, it is a bit weird to have it separate from this, it will just split the conversation more, best that people just respond here with their preference.

aucampia · 2023-05-22T18:04:02Z

I think the right solution here is:

Change ConjunctiveGraph.serialize() to raise an exception if the selected format does not support quads.
Clarify the documentation of serialize() to indicate it is a request to serialize the whole Dataset or ConjunctiveGraph, and not a subset. This anyway seems like the reasonable thing, I don't see why calling serialize on an object should mean something other than serialize the object. If someone wants to serialize a subset of the Dataset or ConjunctiveGraph they should do so explicitly by selecting the exact subset they want, and then serializing it.

To me, this is the right solution because I think there should be a way to request to serialize the whole Dataset explicitly, and I don't see why it should not be Dataset.serialize(). If however someone can make a good argument why this should not be the way to serialize the whole Dataset I'm willing to consider it, but then we also need to select another way to request that the whole Dataset or ConjunctiveGraph be serialized, and we need to then write down what exactly serialize does if not serialize the object it was called on.

- fixes <#2393>

- fixes <RDFLib#2393>

- fixes <RDFLib#2393> checkpoint checkpoint checkpoint

This was referenced May 20, 2023

fix: ConjunctiveGraph.serialize to format as TriG by default (#1674) #2373

Open

Default serialization format for conjunctive graph is turtle, but turtle cannot handle quads (AFAIK) #1674

Open

aucampia added bug Something isn't working serialization Related to serialization. concept: RDF dataset Relates to the RDF datasets concept. labels May 20, 2023

aucampia self-assigned this May 20, 2023

aucampia mentioned this issue May 21, 2023

rdfpipe adds a graph name to triples in the default graph, consequently breaking round tripping #1804

Closed

aucampia added backwards incompatible will affect backwards compatibility, this includes things like breaking our interface interface bug An issue with our public interface labels May 22, 2023

aucampia mentioned this issue May 22, 2023

Triples from the default graph in RDF documents do not go into the default_context of Dataset or ConjunctiveGraph #2404

Closed

aucampia added breaking change This involves or proposes breaking RDFLib's public API. and removed backwards incompatible will affect backwards compatibility, this includes things like breaking our interface labels May 22, 2023

This was referenced Jun 8, 2023

Dataset: NQuads serialization only makes sense for context-aware stores! #2335

Closed

fix: raise an exception when serialization format does not support quads #2428

Closed

aucampia added a commit that referenced this issue Jun 8, 2023

fix: raise an exception when serialization format does not support quads

1745417

- fixes <#2393>

aucampia linked a pull request Jun 8, 2023 that will close this issue

fix: raise an exception when serialization format does not support quads #2430

Draft

aucampia added a commit to aucampia/rdflib that referenced this issue Jul 5, 2023

fix: raise an exception when serialization format does not support quads

0f7d24e

- fixes <RDFLib#2393>

aucampia added a commit to aucampia/rdflib that referenced this issue Aug 17, 2023

fix: raise an exception when serialization format does not support quads

2d709f8

- fixes <RDFLib#2393>

niklasl mentioned this issue Sep 10, 2023

Cataloguing Use Cases From The National Library Of Sweden w3c/rdf-ucr#23

Open

aucampia added a commit to aucampia/rdflib that referenced this issue Sep 24, 2023

fix: raise an exception when serialization format does not support quads

d280f6e

- fixes <RDFLib#2393> checkpoint checkpoint checkpoint

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Formats that don't support named graphs serialize `Dataset`s and `ConjunctiveGraph`s with non-default graphs without raising any errors #2393

Formats that don't support named graphs serialize `Dataset`s and `ConjunctiveGraph`s with non-default graphs without raising any errors #2393

aucampia commented May 20, 2023 •

edited

aucampia commented May 20, 2023

WhiteGobo commented May 20, 2023 •

edited

aucampia commented May 20, 2023 •

edited

niklasl commented May 20, 2023

aucampia commented May 20, 2023

ghost commented May 21, 2023

aucampia commented May 21, 2023 •

edited

ghost commented May 21, 2023

aucampia commented May 21, 2023

aucampia commented May 21, 2023 •

edited

aucampia commented May 22, 2023 •

edited

Formats that don't support named graphs serialize Datasets and ConjunctiveGraphs with non-default graphs without raising any errors #2393

Formats that don't support named graphs serialize Datasets and ConjunctiveGraphs with non-default graphs without raising any errors #2393

Comments

aucampia commented May 20, 2023 • edited

aucampia commented May 20, 2023

WhiteGobo commented May 20, 2023 • edited

aucampia commented May 20, 2023 • edited

niklasl commented May 20, 2023

aucampia commented May 20, 2023

ghost commented May 21, 2023

aucampia commented May 21, 2023 • edited

ghost commented May 21, 2023

aucampia commented May 21, 2023

aucampia commented May 21, 2023 • edited

aucampia commented May 22, 2023 • edited

Formats that don't support named graphs serialize `Dataset`s and `ConjunctiveGraph`s with non-default graphs without raising any errors #2393

Formats that don't support named graphs serialize `Dataset`s and `ConjunctiveGraph`s with non-default graphs without raising any errors #2393

aucampia commented May 20, 2023 •

edited

WhiteGobo commented May 20, 2023 •

edited

aucampia commented May 20, 2023 •

edited

aucampia commented May 21, 2023 •

edited

aucampia commented May 21, 2023 •

edited

aucampia commented May 22, 2023 •

edited