support named graphs in RDF-Serialization and TriG format #329

bastiion · 2017-10-01T13:52:06Z

I stumbled upon the problem, not being able to parse TriG Files from an ontology source. Today
quads are commonly used to store extra information about a specific ontology graph. Even so, quad support would be nice for future banana-rdf versions, this branch adds the possibility to read a certain named graph or all named graphs without loosing its resource identifier.
The name of a named graph can either be an URI or a blank node.
The current approach uses a Map[Option[Rdf#Node], Rdf#Graph] where the key None represents the default graph.

For Jena and Sesame implementation all tests pass, feeding them with some examples listed on w3c-trig.

- added resolver to sbt

- without the need of real quads this adds the possibility to gather inividual named graphs from resources - resources could be rdf serializations, which support quads

see https://www.w3.org/TR/trig/

- trig format as first supported format

;-) for the sake of completenes

This reverts commit 5f500d8.

- to satisfy scala 2.12 build

bblfish

The aim of this patch is a good one, and it would be useful to integrate Quads. I note the following:

• the Quad type in RDF trait does not seem to be used apart from in a private method (I would still be ok with it, for future work)
• why not use the GraphStore trait ?

One thing I think is clearly wrong:
• the deserialiser has too many methods

Also there are a few odd changes in the build.sbt.

But see the detailed comments I left in the code. I may have forgotten something along the way.

bblfish · 2017-10-03T09:49:08Z

rdf/shared/src/main/scala/org/w3/banana/RDF.scala

@@ -4,6 +4,7 @@ trait RDF {
  // types related to the RDF datamodel
  type Graph
  type Triple
+  type Quad


I think it is a good idea to define the notion of Quad since it is used in Jena and Sesame.

I notice that you don't really use it elsewhere than in private methods.
Usually, when one adds a type in the RDF trait it has to come with methods to be able to deal with it RDFOps and one has to add a nice QuadW syntax trait to make it nice to use those methods eg - egq.subject, q.relation, q.object, q.where instead of getSubect(q).

I added the Quad definition to show how it would be defined in the concrete implementations. I know it doesn't belong to the patch, as it adds no value and would confuse others.

Originally I was thinking about adding Quads on this level, but I don't know exactly how to implement it, with preferable minimal changes in banana rdf.... Maybe I will put some afford in it later.

yes, I understand very well. Quads make a lot of sense if what the parser returns is a stream of quads. If what it returns is a DataSet, then you don't necessarily need them.

So looking at Jena I see they use a PipedRDFIterator[Quad](). And Sesame actually has a method like parser.setRDFHandler(collector) which is also something that processes the Quads one at a time.

So I suggest instead of trying to extend RDFReader, you could try to create a new io trait RDFStream[X] that would parse a document and return an Input or Output Stream of Xs (either Triples or Quads).

That is what both Jena and Sesame do.

bblfish · 2017-10-03T09:59:46Z

build.sbt

-              if (isSnapshot.value) "snapshots" else "releases"
-            }
-          )
+          "Artifactory Realm" at "http://wgserver2:8081/artifactory/nubivis;build.timestamp=" + new java.util.Date().getTime


I use the deleted code to publish to the web server bblfish.net.
Here you are making the code less general, and specific to your case.
What is the problem you were trying to solve here?

bblfish · 2017-10-03T10:05:39Z

build.sbt

-  resolvers += "apache-repo-releases" at "http://repository.apache.org/content/repositories/releases/",
+  resolvers ++= Seq(
+    "apache-repo-releases" at "http://repository.apache.org/content/repositories/releases/",
+    Resolver.mavenLocal


just out of curiosity what does this do that the previous code does not? (I hope it does what I'd like it to)

bblfish · 2017-10-03T10:06:47Z

project/Dependencies.scala

@@ -71,7 +71,7 @@ object Dependencies {
   * @see http://www.openrdf.org/
   * @see http://repo1.maven.org/maven2/org/openrdf/sesame/
   */
-  val sesameVersion = "2.9.0"
+  val sesameVersion = "2.8.9-SNAPSHOT"


Btw. perhaps you can review Pull request #327 ? Would you be able to build on that?

bblfish · 2017-10-03T10:08:28Z

build.sbt

      )
    case other => Seq()
  }
 }

 lazy val commonSettings = publicationSettings ++ defaultScalariformSettings ++ Seq(
  organization := "org.w3",
-  scalaVersion := "2.12.1",
+  scalaVersion := "2.11.8",


why are you downgrading here?

bblfish · 2017-10-03T10:30:36Z

rdf/shared/src/main/scala/org/w3/banana/io/RDFQuadReader.scala

+
+trait RDFQuadReader[Rdf <: RDF, M[_], +S] extends RDFReader[Rdf, M, S] {
+
+  def readAll(is: InputStream, base: String): M[Map[Option[Rdf#Node], Rdf#Graph]]


This raises the question as to whether one should have a Dataset type in the RDF trait, in which case
one could return an M[Dataset]. But that would require finding the methods that all datasets have in common and implementing those in RDFOps. So before doing that it is perhaps worth thinking about this carefully.

One way to think of this is to ask: how different are the Jena and Sesame implementations, and how strong is the definition of a Dataset.

If you hardwire things like that then this would raise the question as to why not just return a GraphStore, since we already have that abstraction.

I am just asking here with an open mind on this.

Exactly this is the reason why I have chosen the "simple", "pragmatic" way. To get a clean support for "named graphs" there are a lot of options to weight before deciding and I'm relatively new to banana-rdf and scala in general, so I might not be the perfect person to reason.

Forget about whether you need an Map[Option[Rdf#Node], Rdf#Graph], an RDFStore, a GraphStore or whatever. Let the user decide on that, and just return a stream. That is the lowest level of thing that you can correctly produce. All these more advanced data structures have their use and none of them are further than a stream.fold(...) away.

Btw. I just noticed this in com.github.jsonldjava.core.RDFDataset

** * Starting to migrate away from using plain java Maps as the internal RDF * dataset store. Currently each item just wraps a Map based on the old format * so everything doesn't break. Will phase this out once everything is using the * new format. * * @author Tristan * */

Which makes my point that there are many ways of doing Datasets. I'd like to know what he is moving to, but have not yet researched that.

When I say use a Stream, I should say a well known Seq from the Scala library like Seq, Iterator, ... Iterator would be good as its use is compatible with using very little memory (an Iterator[Quad] could have just the last quad in memory). List, Set, ... would be wrong as it means the whole sequence needs to be in memory, and that would lose the whole point of having such an abstraction.

This should actually be so simple that if we find a better abstraction - from scalaz or cats for example - it should be very easy to upgrade to those.

bblfish · 2017-10-03T10:49:07Z

rdf/shared/src/main/scala/org/w3/banana/io/RDFQuadReader.scala

+
+  def readDefaultGraph(is: InputStream, base: String): M[Rdf#Graph]
+
+  def readDefaultGraph(in: Reader, base: String): M[Rdf#Graph]


All of these methods seem to me to be ones that should be on a GraphStore or DataSet, not on a deserializer.

Or is there a danger that a DataSet is too big, and filtering should occur during parsing? From the implementations, this does not look to be the case, as the filtering happens after the parsing.

When I first implemented the interface, I was thinking of bigger graphs, where filtering during parse time could make sense. Unfortunately I couldn't find a solution in Jena and Sesame to do so. So I agree, these methods aren't necessary, except they could reduce some temporary memory overhead.

yes, I don't think those frameworks are very good for parsing huge graphs for that reason. The right way I think in the end to do this would be to have asynchronous parsers- that need to be written -- to produce (akka) streams of sets of quads. Then one could easily add a filter to the Stream as they are designed for that.

But in the meantime adding support for existing parsers provided by Jena and Sesame would be useful - to start off they are blocking parsers. But since they don't do it right, and to do it right one needs a completely different framework, I'd just remove those methods. Perhaps if they start offering filtering at some point then those methods could be added, but at present, this does not add anything and it makes one think one is getting some value where one is not.

well actually if you see that they do use streams, then they are not that bad. They are just not as good as if they were non-blocking.

bblfish · 2017-10-03T10:49:43Z

rdf/shared/src/main/scala/org/w3/banana/io/RDFQuadReader.scala

+
+import org.w3.banana.RDF
+
+trait RDFQuadReader[Rdf <: RDF, M[_], +S] extends RDFReader[Rdf, M, S] {


I am dubious about the need to extend RDFReader here for reasons given below.

bblfish · 2017-10-03T11:10:23Z

sesame/src/main/scala/org/w3/banana/sesame/Sesame.scala

@@ -11,6 +11,7 @@ trait Sesame extends RDF {
  // types related to the RDF datamodel
  type Graph = Model
  type Triple = Statement
+  type Quad = Statement


ok, one good thing about having Quad is that it makes clear that Sesame's Statement is actually a Quad.

bblfish · 2017-10-03T11:10:55Z

sesame/src/main/scala/org/w3/banana/sesame/io/SesameRDFReader.scala

+  }
+
+
+}


see comments on super trait.

bastiion · 2017-10-03T14:23:31Z

I apologise for the untidy patch, that contains some changes necessary for my own projects. I tried to clean up the majority, but branching from the beginning on would have been the better choice.
I will carefully think about the options you mentioned.
Implementations should take into account that graph names can be blank nodes, which make sense only in the context of the RDF resource.

bblfish · 2017-10-03T15:32:27Z

No problem, the reason there are code reviews is exactly to catch missing things.

So I think the way to go here is to keep Quad, and to not try to subclass RDFReader or RDFWriter, but instead to create a very simple class that returns an Iterator/Seq of Triple or Quads. This will then allow people to filter the stream by using the well-known tools to filter Iterators/Seq and that come by default with the Scala libraries. (It would still be worth deciding which Scala Stream variant would be good to use - without buying into the full akka stack at the moment).

Then it should be something nearly as easy as a fold on that stream to create whatever structure the user most likes. He can add filtering before if he wants to ignore quads, or he can just send them onto a database, without ever creating a Dataset.

If you do that it would be a big +1 from me.

Btw. you could keep your Dataset structure Map[Option[Node],Graph] and most of your tests as examples of how to transform a Iteratee/Seq into such a structure.

bblfish · 2020-09-04T08:23:28Z

I need to look at this as Quad store support is important.

bastiion · 2020-09-06T14:25:20Z

Unfortunately it's long ago that I've tried to contribute to this project, nevertheless I'd like to support it as good as possible. If there is anything I can do to improve this PR, tell me

bastiion added 9 commits October 1, 2017 14:47

small fix in order to treat an rdf-langString like an xsd.string

81e6e9f

update sesame library to snapshot version

5f500d8

- added resolver to sbt

RDFQuadReader for named graph resources

be73c81

- without the need of real quads this adds the possibility to gather inividual named graphs from resources - resources could be rdf serializations, which support quads

tests for TriG rdf serialization

8dd1052

see https://www.w3.org/TR/trig/

AbstractQuadSesameReader for quad rdf serialization in sesame

5964d90

- trig format as first supported format

JenaRDFQuadReader for quad rdf serialization in jena

a599fe0

- trig format as first supported format

sample Quad representation in Plantain

d465576

;-) for the sake of completenes

Revert "update sesame library to snapshot version"

6aa5873

This reverts commit 5f500d8.

changed JavaConversions to JavaConverters in reader implementation

c67430d

- to satisfy scala 2.12 build

bblfish requested changes Oct 3, 2017

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support named graphs in RDF-Serialization and TriG format #329

support named graphs in RDF-Serialization and TriG format #329

bastiion commented Oct 1, 2017

bblfish left a comment

bblfish Oct 3, 2017

bastiion Oct 3, 2017

bblfish Oct 3, 2017

bblfish Oct 3, 2017

bblfish Oct 3, 2017

bblfish Oct 3, 2017

bblfish Oct 3, 2017

bblfish Oct 3, 2017

bastiion Oct 3, 2017

bblfish Oct 3, 2017

bblfish Oct 4, 2017

bblfish Oct 3, 2017

bastiion Oct 3, 2017

bblfish Oct 3, 2017 •

edited

bblfish Oct 3, 2017

bblfish Oct 3, 2017

bblfish Oct 3, 2017

bblfish Oct 3, 2017

bastiion commented Oct 3, 2017

bblfish commented Oct 3, 2017 •

edited

bblfish commented Sep 4, 2020

bastiion commented Sep 6, 2020


		trait RDFQuadReader[Rdf <: RDF, M[_], +S] extends RDFReader[Rdf, M, S] {

		def readAll(is: InputStream, base: String): M[Map[Option[Rdf#Node], Rdf#Graph]]


		def readDefaultGraph(is: InputStream, base: String): M[Rdf#Graph]

		def readDefaultGraph(in: Reader, base: String): M[Rdf#Graph]


		import org.w3.banana.RDF

		trait RDFQuadReader[Rdf <: RDF, M[_], +S] extends RDFReader[Rdf, M, S] {

support named graphs in RDF-Serialization and TriG format #329

Are you sure you want to change the base?

support named graphs in RDF-Serialization and TriG format #329

Conversation

bastiion commented Oct 1, 2017

bblfish left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bblfish Oct 3, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bastiion commented Oct 3, 2017

bblfish commented Oct 3, 2017 • edited

bblfish commented Sep 4, 2020

bastiion commented Sep 6, 2020

bblfish Oct 3, 2017 •

edited

bblfish commented Oct 3, 2017 •

edited