Optional skolemize blank nodes on parse #2736

edmondchuc · 2024-03-18T02:24:28Z

I have a use case where I need to preserve the blank node identifiers when loading data into a Graph object. To do this, I'd like an option on the rdflib.Graph.parse method to either provide a custom format (like ntriples-skolem) or a flag on the parse method (skolemize=True) to skolemize blank nodes before adding the statements into the graph.

The reason why this is needed is because RDF blank nodes are scoped to the local document. As soon as it is read into a new system (like an RDFLib graph object), the blank node identifiers are remapped and assigned a new blank node identifier. There's no guarantee that the blank node identifiers are preserved.

Some pseudocode usage:

from rdflib import Graph
from rdflib.compare import isomorphic

skolem_graph = Graph().parse("data.nt", format="ntriples", skolemize=True)
graph = Graph().parse("data.nt", format="ntriples")

assert isomorphic(skolem_graph.de_skolemize(), graph)

# I can use skolem_graph across systems with the blank node identifiers preserved from the original data.nt file.
skolem_graph.serialize(format="ntriples")
...

The text was updated successfully, but these errors were encountered:

WhiteGobo · 2024-03-19T10:24:39Z

I'll look into this. But it seems to me, as we had to work on both the store and on the parser for that.

I havent tried this and im sure there are some problems with that but:
Have you tried other means to skolemize your graph? for example create a skolemized version of your graph per hand an reusing the resulting bnode_context?

Something like this:

from rdflib import Graph
from rdflib.compare import isomorphic

bnode_context_A: MutableMapping[str, BNode] = {}
in_graph = Graph().parse("data.nt", format="ntriples", bnode_context=bnode_context_A)
bnode_context_B = {}
skolem_graph = Graph()
for ax in in_graph:
  for x in ax:
    if x not in bnode_context_B:
      bnode_context_B[x] = skolemize(x)
  skolem_graph.add((bnode_context_B.get(x, x) for x in ax))
bnode_context = {k, bnode_context_B[v] for k, v in bnode_context_A.items()}

graph = Graph().parse("data.nt", format="ntriples")

assert isomorphic(in_graph, graph)

I havent looked into how to get this then to work:

# I can use skolem_graph across systems with the blank node identifiers preserved from the original data.nt file.
skolem_graph.serialize(format="ntriples")

But you should be able to load now with persistent skolemization:

#This sould be the same graph as skolem_graph:
new_graph = Graph().parse("data.nt", format="ntriples", bnode_context=bnode_context)

edmondchuc · 2024-03-20T14:09:24Z

Perhaps this runnable example will explain it clearer.

from rdflib import Graph
from rdflib.compare import isomorphic

data = """
    <urn:object> <urn:hasPart> _:internal-bnode-id-1 .
    _:internal-bnode-id-1 <urn:value> "..." .
"""

skolem_graph = Graph().parse(data=data, format="ntriples").skolemize()
graph = Graph().parse(data=data, format="ntriples")

assert isomorphic(skolem_graph.de_skolemize(), graph)

# The output should contain the skolem IRI
# <https://rdflib.github.io/.well-known/genid/rdflib/internal-bnode-id-1>
# but instead, we get something like:
#
#     <https://rdflib.github.io/.wellknown/genid/rdflib/N19d54f84f7e84ba8a270ddb627e92cdb> <urn:value> "..." .
#     <urn:object> <urn:hasPart> <https://rdflib.github.io/.well-known/genid/rdflib/N19d54f84f7e84ba8a270ddb627e92cdb> .
#
# where N19d54f84f7e84ba8a270ddb627e92cdb is the remapped blank node id by RDFLib.
skolem_graph.print(format="ntriples")

If we are able to skolemize blank nodes at parse time, we should expect an output like this:

<urn:object> <urn:hasPart> <https://rdflib.github.io/.well-known/genid/rdflib/internal-bnode-id-1> .
<https://rdflib.github.io/.well-known/genid/rdflib/internal-bnode-id-1> <urn:value> "..." .

Essentially, without a change to the logic at parse time, it's impossible to skolemize blank nodes and preserve the identifiers in the original data.

WhiteGobo · 2024-03-25T20:51:43Z

Would it be enough to use an identity mapping for bnode_context?

from rdflib import Graph, BNode
from rdflib.compare import isomorphic

data = """
    <urn:object> <urn:hasPart> _:internal-bnode-id-1 .
    _:internal-bnode-id-1 <urn:value> "..." .
"""

from typing import MutableMapping
class IdMap(MutableMapping[str, BNode]):
    def __init__(self, dct=None):
        self.dct = {} if dct is None else dct 

    def __getitem__(self, key: str) -> BNode:
        return self.dct.setdefault(key, BNode(key))

    def __setitem__(self, key: str, value: BNode):
        self.dct[key] = value

    def __delitem__(self, key: str):
        return self.dct.__delitem__(key)

    def __iter__(self):
        return iter(self.dct)

    def __len__(self) -> int:
        return len(self.dct)


skolem_graph = Graph().parse(data=data, format="ntriples", bnode_context=IdMap())
for x in skolem_graph:
    print(x)

Im not sure how to make a transparent implemention of skolemization during parsing. I would rather invest time into the documentation of skolemization in rdflib and have a recipe of this somewhere.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optional skolemize blank nodes on parse #2736

Optional skolemize blank nodes on parse #2736

edmondchuc commented Mar 18, 2024

WhiteGobo commented Mar 19, 2024

edmondchuc commented Mar 20, 2024

WhiteGobo commented Mar 25, 2024 •

edited

Optional skolemize blank nodes on parse #2736

Optional skolemize blank nodes on parse #2736

Comments

edmondchuc commented Mar 18, 2024

WhiteGobo commented Mar 19, 2024

edmondchuc commented Mar 20, 2024

WhiteGobo commented Mar 25, 2024 • edited

WhiteGobo commented Mar 25, 2024 •

edited