Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

blank nodes reuse memory addresses, causing problems for persistent stores #418

Open
doriantaylor opened this issue Jul 11, 2020 · 3 comments

Comments

@doriantaylor
Copy link
Contributor

Here's another good one: I created rdf-lmdb a while ago but only recently noticed, incidentally when I had a structure containing a lot of RDF::Lists, that blank node IDs got derived from memory addresses that would reliably repeat across runs on Linux (although only occasionally on macOS which is why it took me so long to notice). The net effect is that I would get bnode structures in persistent storage that were all snarled up together.

Now, I should probably go and make some kind of bnode mapping table in rdf-lmdb, but I'm also wondering if it would make sense to generalize such a mapping table, since it's a common thing. That, or at least have some way to override RDF::Node's ID generating function in the singleton class (which is what I ended up doing to solve my short-term problem).

Anyway, I think what I am proposing is a module that one can include into, for example, a persistent RDF::Repository, that will automatically map whatever bnodes get thrown at it to guaranteed-unique (does that count as skolemized?) identifiers that are passed along to the persistent store, and translated back on the way back out. This mapping would be kept around for the duration of the process. As for the guaranteed-unique identifiers, the answer is of course UUIDs (which indeed the current RDF::Node implementation is already capable of), though because they are shorter, I propose the using the uuid-ncname representation I designed some time ago for just this purpose. See my monkey-patched implementation for what that looks like:

[8] pry(main)> n = RDF::Node.new
=> #<RDF::Node:0x3e8(_:Enc7Jm1BRnK9y9jKSMCzBJ)>
[9] pry(main)> n.id
=> "Enc7Jm1BRnK9y9jKSMCzBJ"
[10] pry(main)> UUID::NCName.from_ncname n.id, version: 1
=> "9dcec99b-5051-49ca-9f72-f63292302cc1"
@gkellogg
Copy link
Member

Here's another good one: I created rdf-lmdb a while ago but only recently noticed, incidentally when I had a structure containing a lot of RDF::Lists, that blank node IDs got derived from memory addresses that would reliably repeat across runs on Linux (although only occasionally on macOS which is why it took me so long to notice).

Yes, indeed, the offending code is here:

@id = (id || "g#{__id__.to_i.abs}").to_s.freeze

The net effect is that I would get bnode structures in persistent storage that were all snarled up together.

If you're simply serializing with the default IDs, then, yes, that could be a problem.

Now, I should probably go and make some kind of bnode mapping table in rdf-lmdb, but I'm also wondering if it would make sense to generalize such a mapping table, since it's a common thing. That, or at least have some way to override RDF::Node's ID generating function in the singleton class (which is what I ended up doing to solve my short-term problem).

It's common for parsers to use such a pattern to distinguish the blank nodes read from those created, even though they may use the same identifiers, they become different objects. For example:

https://github.com/ruby-rdf/rdf-turtle/blob/develop/lib/rdf/turtle/reader.rb#L229-L233

def bnode(value = nil)
  return RDF::Node.new unless value
  @bnode_cache ||= {}
  @bnode_cache[value.to_s] ||= RDF::Node.new(value)
end

Anyway, I think what I am proposing is a module that one can include into, for example, a persistent RDF::Repository, that will automatically map whatever bnodes get thrown at it to guaranteed-unique (does that count as skolemized?)

No, not really, skolemized BNodes are intended to be used in serializations, which are IRIs, that can be turned back into the original BNodes. It makes sense for persistent stores, where there is some row identifier that can be retrieved. They are also not generic, but specific to a given server instance. I don't think they're used too much, and screw up things like lists, which depend on their being BNodes and not URIs.

identifiers that are passed along to the persistent store, and translated back on the way back out.

... but yes, sort of, in the way you describe using them.

This mapping would be kept around for the duration of the process. As for the guaranteed-unique identifiers, the answer is of course UUIDs (which indeed the current RDF::Node implementation is already capable of), though because they are shorter, I propose the using the uuid-ncname representation I designed some time ago for just this purpose. See my monkey-patched implementation for what that looks like:

[8] pry(main)> n = RDF::Node.new
=> #<RDF::Node:0x3e8(_:Enc7Jm1BRnK9y9jKSMCzBJ)>
[9] pry(main)> n.id
=> "Enc7Jm1BRnK9y9jKSMCzBJ"
[10] pry(main)> UUID::NCName.from_ncname n.id, version: 1
=> "9dcec99b-5051-49ca-9f72-f63292302cc1"

Maybe it makes sense to just change the default naming scheme to do something like this, and get away from object identifiers altogether. Would that obviate the need for a separate module?

Otherwise, then sure, a module with defined behavior that could be included in a Repository/DataStore implementation might be useful.

@doriantaylor
Copy link
Contributor Author

I thought about that, but the user can always define the bnode identifier to be whatever they want. If that gets passed through as-is to a persistent store then that can cause problems (e.g. _:genid123).

I'm just thinking it's a common pattern just like you highlighted from the Turtle reader; it would merit an include-able utility module that just did the translation to something guaranteed-unique if the node ID wasn't already detectably that way.

@gkellogg
Copy link
Member

Please go ahead and do a PR for this, then. @no-reply may want to chime in on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants