Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support named graphs in RDF-Serialization and TriG format #329

Open
wants to merge 9 commits into
base: series/0.8.x
Choose a base branch
from

Conversation

bastiion
Copy link

@bastiion bastiion commented Oct 1, 2017

I stumbled upon the problem, not being able to parse TriG Files from an ontology source. Today
quads are commonly used to store extra information about a specific ontology graph. Even so, quad support would be nice for future banana-rdf versions, this branch adds the possibility to read a certain named graph or all named graphs without loosing its resource identifier.
The name of a named graph can either be an URI or a blank node.
The current approach uses a Map[Option[Rdf#Node], Rdf#Graph] where the key None represents the default graph.

For Jena and Sesame implementation all tests pass, feeding them with some examples listed on w3c-trig.

 - without the need of real quads this
  adds the possibility to gather inividual
  named graphs from resources
 - resources could be rdf serializations, which
  support quads
 - trig format as first supported format
 - trig format as first supported format
;-) for the sake of completenes
Copy link
Member

@bblfish bblfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The aim of this patch is a good one, and it would be useful to integrate Quads. I note the following:

• the Quad type in RDF trait does not seem to be used apart from in a private method (I would still be ok with it, for future work)
• why not use the GraphStore trait ?

One thing I think is clearly wrong:
• the deserialiser has too many methods

Also there are a few odd changes in the build.sbt.

But see the detailed comments I left in the code. I may have forgotten something along the way.

@@ -4,6 +4,7 @@ trait RDF {
// types related to the RDF datamodel
type Graph
type Triple
type Quad
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is a good idea to define the notion of Quad since it is used in Jena and Sesame.

I notice that you don't really use it elsewhere than in private methods.
Usually, when one adds a type in the RDF trait it has to come with methods to be able to deal with it RDFOps and one has to add a nice QuadW syntax trait to make it nice to use those methods eg - egq.subject, q.relation, q.object, q.where instead of getSubect(q).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the Quad definition to show how it would be defined in the concrete implementations. I know it doesn't belong to the patch, as it adds no value and would confuse others.

Originally I was thinking about adding Quads on this level, but I don't know exactly how to implement it, with preferable minimal changes in banana rdf.... Maybe I will put some afford in it later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I understand very well. Quads make a lot of sense if what the parser returns is a stream of quads. If what it returns is a DataSet, then you don't necessarily need them.

So looking at Jena I see they use a PipedRDFIterator[Quad](). And Sesame actually has a method like parser.setRDFHandler(collector) which is also something that processes the Quads one at a time.

So I suggest instead of trying to extend RDFReader, you could try to create a new io trait RDFStream[X] that would parse a document and return an Input or Output Stream of Xs (either Triples or Quads).

That is what both Jena and Sesame do.

build.sbt Outdated
if (isSnapshot.value) "snapshots" else "releases"
}
)
"Artifactory Realm" at "http://wgserver2:8081/artifactory/nubivis;build.timestamp=" + new java.util.Date().getTime
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use the deleted code to publish to the web server bblfish.net.
Here you are making the code less general, and specific to your case.
What is the problem you were trying to solve here?

build.sbt Outdated
resolvers += "apache-repo-releases" at "http://repository.apache.org/content/repositories/releases/",
resolvers ++= Seq(
"apache-repo-releases" at "http://repository.apache.org/content/repositories/releases/",
Resolver.mavenLocal
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just out of curiosity what does this do that the previous code does not? (I hope it does what I'd like it to)

@@ -71,7 +71,7 @@ object Dependencies {
* @see http://www.openrdf.org/
* @see http://repo1.maven.org/maven2/org/openrdf/sesame/
*/
val sesameVersion = "2.9.0"
val sesameVersion = "2.8.9-SNAPSHOT"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw. perhaps you can review Pull request #327 ? Would you be able to build on that?

build.sbt Outdated
)
case other => Seq()
}
}

lazy val commonSettings = publicationSettings ++ defaultScalariformSettings ++ Seq(
organization := "org.w3",
scalaVersion := "2.12.1",
scalaVersion := "2.11.8",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you downgrading here?


trait RDFQuadReader[Rdf <: RDF, M[_], +S] extends RDFReader[Rdf, M, S] {

def readAll(is: InputStream, base: String): M[Map[Option[Rdf#Node], Rdf#Graph]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This raises the question as to whether one should have a Dataset type in the RDF trait, in which case
one could return an M[Dataset]. But that would require finding the methods that all datasets have in common and implementing those in RDFOps. So before doing that it is perhaps worth thinking about this carefully.

One way to think of this is to ask: how different are the Jena and Sesame implementations, and how strong is the definition of a Dataset.

If you hardwire things like that then this would raise the question as to why not just return a GraphStore, since we already have that abstraction.

I am just asking here with an open mind on this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly this is the reason why I have chosen the "simple", "pragmatic" way. To get a clean support for "named graphs" there are a lot of options to weight before deciding and I'm relatively new to banana-rdf and scala in general, so I might not be the perfect person to reason.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forget about whether you need an Map[Option[Rdf#Node], Rdf#Graph], an RDFStore, a GraphStore or whatever. Let the user decide on that, and just return a stream. That is the lowest level of thing that you can correctly produce. All these more advanced data structures have their use and none of them are further than a stream.fold(...) away.

Btw. I just noticed this in com.github.jsonldjava.core.RDFDataset

**
 * Starting to migrate away from using plain java Maps as the internal RDF
 * dataset store. Currently each item just wraps a Map based on the old format
 * so everything doesn't break. Will phase this out once everything is using the
 * new format.
 *
 * @author Tristan
 *
 */

Which makes my point that there are many ways of doing Datasets. I'd like to know what he is moving to, but have not yet researched that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I say use a Stream, I should say a well known Seq from the Scala library like Seq, Iterator, ... Iterator would be good as its use is compatible with using very little memory (an Iterator[Quad] could have just the last quad in memory). List, Set, ... would be wrong as it means the whole sequence needs to be in memory, and that would lose the whole point of having such an abstraction.

This should actually be so simple that if we find a better abstraction - from scalaz or cats for example - it should be very easy to upgrade to those.


def readDefaultGraph(is: InputStream, base: String): M[Rdf#Graph]

def readDefaultGraph(in: Reader, base: String): M[Rdf#Graph]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of these methods seem to me to be ones that should be on a GraphStore or DataSet, not on a deserializer.

Or is there a danger that a DataSet is too big, and filtering should occur during parsing? From the implementations, this does not look to be the case, as the filtering happens after the parsing.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I first implemented the interface, I was thinking of bigger graphs, where filtering during parse time could make sense. Unfortunately I couldn't find a solution in Jena and Sesame to do so. So I agree, these methods aren't necessary, except they could reduce some temporary memory overhead.

Copy link
Member

@bblfish bblfish Oct 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I don't think those frameworks are very good for parsing huge graphs for that reason. The right way I think in the end to do this would be to have asynchronous parsers- that need to be written -- to produce (akka) streams of sets of quads. Then one could easily add a filter to the Stream as they are designed for that.

But in the meantime adding support for existing parsers provided by Jena and Sesame would be useful - to start off they are blocking parsers. But since they don't do it right, and to do it right one needs a completely different framework, I'd just remove those methods. Perhaps if they start offering filtering at some point then those methods could be added, but at present, this does not add anything and it makes one think one is getting some value where one is not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well actually if you see that they do use streams, then they are not that bad. They are just not as good as if they were non-blocking.


import org.w3.banana.RDF

trait RDFQuadReader[Rdf <: RDF, M[_], +S] extends RDFReader[Rdf, M, S] {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am dubious about the need to extend RDFReader here for reasons given below.

@@ -11,6 +11,7 @@ trait Sesame extends RDF {
// types related to the RDF datamodel
type Graph = Model
type Triple = Statement
type Quad = Statement
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, one good thing about having Quad is that it makes clear that Sesame's Statement is actually a Quad.

}


}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comments on super trait.

@bastiion
Copy link
Author

bastiion commented Oct 3, 2017

I apologise for the untidy patch, that contains some changes necessary for my own projects. I tried to clean up the majority, but branching from the beginning on would have been the better choice.
I will carefully think about the options you mentioned.
Implementations should take into account that graph names can be blank nodes, which make sense only in the context of the RDF resource.

@bblfish
Copy link
Member

bblfish commented Oct 3, 2017

No problem, the reason there are code reviews is exactly to catch missing things.

So I think the way to go here is to keep Quad, and to not try to subclass RDFReader or RDFWriter, but instead to create a very simple class that returns an Iterator/Seq of Triple or Quads. This will then allow people to filter the stream by using the well-known tools to filter Iterators/Seq and that come by default with the Scala libraries. (It would still be worth deciding which Scala Stream variant would be good to use - without buying into the full akka stack at the moment).

Then it should be something nearly as easy as a fold on that stream to create whatever structure the user most likes. He can add filtering before if he wants to ignore quads, or he can just send them onto a database, without ever creating a Dataset.

If you do that it would be a big +1 from me.

Btw. you could keep your Dataset structure Map[Option[Node],Graph] and most of your tests as examples of how to transform a Iteratee/Seq into such a structure.

@bblfish
Copy link
Member

bblfish commented Sep 4, 2020

I need to look at this as Quad store support is important.

@bastiion
Copy link
Author

bastiion commented Sep 6, 2020

Unfortunately it's long ago that I've tried to contribute to this project, nevertheless I'd like to support it as good as possible. If there is anything I can do to improve this PR, tell me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants