Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optional skolemize blank nodes on parse #2736

Open
edmondchuc opened this issue Mar 18, 2024 · 3 comments
Open

Optional skolemize blank nodes on parse #2736

edmondchuc opened this issue Mar 18, 2024 · 3 comments

Comments

@edmondchuc
Copy link
Contributor

I have a use case where I need to preserve the blank node identifiers when loading data into a Graph object. To do this, I'd like an option on the rdflib.Graph.parse method to either provide a custom format (like ntriples-skolem) or a flag on the parse method (skolemize=True) to skolemize blank nodes before adding the statements into the graph.

The reason why this is needed is because RDF blank nodes are scoped to the local document. As soon as it is read into a new system (like an RDFLib graph object), the blank node identifiers are remapped and assigned a new blank node identifier. There's no guarantee that the blank node identifiers are preserved.

Some pseudocode usage:

from rdflib import Graph
from rdflib.compare import isomorphic

skolem_graph = Graph().parse("data.nt", format="ntriples", skolemize=True)
graph = Graph().parse("data.nt", format="ntriples")

assert isomorphic(skolem_graph.de_skolemize(), graph)

# I can use skolem_graph across systems with the blank node identifiers preserved from the original data.nt file.
skolem_graph.serialize(format="ntriples")
...
@WhiteGobo
Copy link
Contributor

I'll look into this. But it seems to me, as we had to work on both the store and on the parser for that.

I havent tried this and im sure there are some problems with that but:
Have you tried other means to skolemize your graph? for example create a skolemized version of your graph per hand an reusing the resulting bnode_context?

Something like this:

from rdflib import Graph
from rdflib.compare import isomorphic

bnode_context_A: MutableMapping[str, BNode] = {}
in_graph = Graph().parse("data.nt", format="ntriples", bnode_context=bnode_context_A)
bnode_context_B = {}
skolem_graph = Graph()
for ax in in_graph:
  for x in ax:
    if x not in bnode_context_B:
      bnode_context_B[x] = skolemize(x)
  skolem_graph.add((bnode_context_B.get(x, x) for x in ax))
bnode_context = {k, bnode_context_B[v] for k, v in bnode_context_A.items()}

graph = Graph().parse("data.nt", format="ntriples")

assert isomorphic(in_graph, graph)

I havent looked into how to get this then to work:

# I can use skolem_graph across systems with the blank node identifiers preserved from the original data.nt file.
skolem_graph.serialize(format="ntriples")

But you should be able to load now with persistent skolemization:

#This sould be the same graph as skolem_graph:
new_graph = Graph().parse("data.nt", format="ntriples", bnode_context=bnode_context)

@edmondchuc
Copy link
Contributor Author

Perhaps this runnable example will explain it clearer.

from rdflib import Graph
from rdflib.compare import isomorphic

data = """
    <urn:object> <urn:hasPart> _:internal-bnode-id-1 .
    _:internal-bnode-id-1 <urn:value> "..." .
"""

skolem_graph = Graph().parse(data=data, format="ntriples").skolemize()
graph = Graph().parse(data=data, format="ntriples")

assert isomorphic(skolem_graph.de_skolemize(), graph)

# The output should contain the skolem IRI
# <https://rdflib.github.io/.well-known/genid/rdflib/internal-bnode-id-1>
# but instead, we get something like:
#
#     <https://rdflib.github.io/.wellknown/genid/rdflib/N19d54f84f7e84ba8a270ddb627e92cdb> <urn:value> "..." .
#     <urn:object> <urn:hasPart> <https://rdflib.github.io/.well-known/genid/rdflib/N19d54f84f7e84ba8a270ddb627e92cdb> .
#
# where N19d54f84f7e84ba8a270ddb627e92cdb is the remapped blank node id by RDFLib.
skolem_graph.print(format="ntriples")

If we are able to skolemize blank nodes at parse time, we should expect an output like this:

<urn:object> <urn:hasPart> <https://rdflib.github.io/.well-known/genid/rdflib/internal-bnode-id-1> .
<https://rdflib.github.io/.well-known/genid/rdflib/internal-bnode-id-1> <urn:value> "..." .

Essentially, without a change to the logic at parse time, it's impossible to skolemize blank nodes and preserve the identifiers in the original data.

@WhiteGobo
Copy link
Contributor

WhiteGobo commented Mar 25, 2024

Would it be enough to use an identity mapping for bnode_context?

from rdflib import Graph, BNode
from rdflib.compare import isomorphic

data = """
    <urn:object> <urn:hasPart> _:internal-bnode-id-1 .
    _:internal-bnode-id-1 <urn:value> "..." .
"""

from typing import MutableMapping
class IdMap(MutableMapping[str, BNode]):
    def __init__(self, dct=None):
        self.dct = {} if dct is None else dct 

    def __getitem__(self, key: str) -> BNode:
        return self.dct.setdefault(key, BNode(key))

    def __setitem__(self, key: str, value: BNode):
        self.dct[key] = value

    def __delitem__(self, key: str):
        return self.dct.__delitem__(key)

    def __iter__(self):
        return iter(self.dct)

    def __len__(self) -> int:
        return len(self.dct)


skolem_graph = Graph().parse(data=data, format="ntriples", bnode_context=IdMap())
for x in skolem_graph:
    print(x)

Im not sure how to make a transparent implemention of skolemization during parsing. I would rather invest time into the documentation of skolemization in rdflib and have a recipe of this somewhere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants