-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support named graphs in RDF-Serialization and TriG format #329
base: series/0.8.x
Are you sure you want to change the base?
Conversation
- added resolver to sbt
- without the need of real quads this adds the possibility to gather inividual named graphs from resources - resources could be rdf serializations, which support quads
- trig format as first supported format
- trig format as first supported format
;-) for the sake of completenes
This reverts commit 5f500d8.
- to satisfy scala 2.12 build
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The aim of this patch is a good one, and it would be useful to integrate Quads. I note the following:
• the Quad
type in RDF
trait does not seem to be used apart from in a private method (I would still be ok with it, for future work)
• why not use the GraphStore trait ?
One thing I think is clearly wrong:
• the deserialiser has too many methods
Also there are a few odd changes in the build.sbt
.
But see the detailed comments I left in the code. I may have forgotten something along the way.
@@ -4,6 +4,7 @@ trait RDF { | |||
// types related to the RDF datamodel | |||
type Graph | |||
type Triple | |||
type Quad |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is a good idea to define the notion of Quad since it is used in Jena and Sesame.
I notice that you don't really use it elsewhere than in private methods.
Usually, when one adds a type in the RDF
trait it has to come with methods to be able to deal with it RDFOps
and one has to add a nice QuadW
syntax trait to make it nice to use those methods eg - egq.subject
, q.relation
, q.object
, q.where
instead of getSubect(q)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the Quad definition to show how it would be defined in the concrete implementations. I know it doesn't belong to the patch, as it adds no value and would confuse others.
Originally I was thinking about adding Quads on this level, but I don't know exactly how to implement it, with preferable minimal changes in banana rdf.... Maybe I will put some afford in it later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I understand very well. Quads make a lot of sense if what the parser returns is a stream of quads. If what it returns is a DataSet, then you don't necessarily need them.
So looking at Jena I see they use a PipedRDFIterator[Quad]()
. And Sesame actually has a method like parser.setRDFHandler(collector)
which is also something that processes the Quads one at a time.
So I suggest instead of trying to extend RDFReader, you could try to create a new io trait RDFStream[X]
that would parse a document and return an Input or Output Stream of Xs (either Triples or Quads).
That is what both Jena and Sesame do.
build.sbt
Outdated
if (isSnapshot.value) "snapshots" else "releases" | ||
} | ||
) | ||
"Artifactory Realm" at "http://wgserver2:8081/artifactory/nubivis;build.timestamp=" + new java.util.Date().getTime |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use the deleted code to publish to the web server bblfish.net.
Here you are making the code less general, and specific to your case.
What is the problem you were trying to solve here?
build.sbt
Outdated
resolvers += "apache-repo-releases" at "http://repository.apache.org/content/repositories/releases/", | ||
resolvers ++= Seq( | ||
"apache-repo-releases" at "http://repository.apache.org/content/repositories/releases/", | ||
Resolver.mavenLocal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just out of curiosity what does this do that the previous code does not? (I hope it does what I'd like it to)
project/Dependencies.scala
Outdated
@@ -71,7 +71,7 @@ object Dependencies { | |||
* @see http://www.openrdf.org/ | |||
* @see http://repo1.maven.org/maven2/org/openrdf/sesame/ | |||
*/ | |||
val sesameVersion = "2.9.0" | |||
val sesameVersion = "2.8.9-SNAPSHOT" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw. perhaps you can review Pull request #327 ? Would you be able to build on that?
build.sbt
Outdated
) | ||
case other => Seq() | ||
} | ||
} | ||
|
||
lazy val commonSettings = publicationSettings ++ defaultScalariformSettings ++ Seq( | ||
organization := "org.w3", | ||
scalaVersion := "2.12.1", | ||
scalaVersion := "2.11.8", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are you downgrading here?
|
||
trait RDFQuadReader[Rdf <: RDF, M[_], +S] extends RDFReader[Rdf, M, S] { | ||
|
||
def readAll(is: InputStream, base: String): M[Map[Option[Rdf#Node], Rdf#Graph]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This raises the question as to whether one should have a Dataset
type in the RDF trait, in which case
one could return an M[Dataset]
. But that would require finding the methods that all datasets have in common and implementing those in RDFOps
. So before doing that it is perhaps worth thinking about this carefully.
One way to think of this is to ask: how different are the Jena and Sesame implementations, and how strong is the definition of a Dataset.
If you hardwire things like that then this would raise the question as to why not just return a GraphStore, since we already have that abstraction.
I am just asking here with an open mind on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly this is the reason why I have chosen the "simple", "pragmatic" way. To get a clean support for "named graphs" there are a lot of options to weight before deciding and I'm relatively new to banana-rdf and scala in general, so I might not be the perfect person to reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forget about whether you need an Map[Option[Rdf#Node], Rdf#Graph]
, an RDFStore
, a GraphStore
or whatever. Let the user decide on that, and just return a stream. That is the lowest level of thing that you can correctly produce. All these more advanced data structures have their use and none of them are further than a stream.fold(...)
away.
Btw. I just noticed this in com.github.jsonldjava.core.RDFDataset
**
* Starting to migrate away from using plain java Maps as the internal RDF
* dataset store. Currently each item just wraps a Map based on the old format
* so everything doesn't break. Will phase this out once everything is using the
* new format.
*
* @author Tristan
*
*/
Which makes my point that there are many ways of doing Datasets. I'd like to know what he is moving to, but have not yet researched that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I say use a Stream, I should say a well known Seq from the Scala library like Seq, Iterator, ... Iterator
would be good as its use is compatible with using very little memory (an Iterator[Quad]
could have just the last quad in memory). List
, Set
, ... would be wrong as it means the whole sequence needs to be in memory, and that would lose the whole point of having such an abstraction.
This should actually be so simple that if we find a better abstraction - from scalaz or cats for example - it should be very easy to upgrade to those.
|
||
def readDefaultGraph(is: InputStream, base: String): M[Rdf#Graph] | ||
|
||
def readDefaultGraph(in: Reader, base: String): M[Rdf#Graph] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of these methods seem to me to be ones that should be on a GraphStore
or DataSet
, not on a deserializer.
Or is there a danger that a DataSet is too big, and filtering should occur during parsing? From the implementations, this does not look to be the case, as the filtering happens after the parsing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I first implemented the interface, I was thinking of bigger graphs, where filtering during parse time could make sense. Unfortunately I couldn't find a solution in Jena and Sesame to do so. So I agree, these methods aren't necessary, except they could reduce some temporary memory overhead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I don't think those frameworks are very good for parsing huge graphs for that reason. The right way I think in the end to do this would be to have asynchronous parsers- that need to be written -- to produce (akka) streams of sets of quads. Then one could easily add a filter to the Stream as they are designed for that.
But in the meantime adding support for existing parsers provided by Jena and Sesame would be useful - to start off they are blocking parsers. But since they don't do it right, and to do it right one needs a completely different framework, I'd just remove those methods. Perhaps if they start offering filtering at some point then those methods could be added, but at present, this does not add anything and it makes one think one is getting some value where one is not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well actually if you see that they do use streams, then they are not that bad. They are just not as good as if they were non-blocking.
|
||
import org.w3.banana.RDF | ||
|
||
trait RDFQuadReader[Rdf <: RDF, M[_], +S] extends RDFReader[Rdf, M, S] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am dubious about the need to extend RDFReader here for reasons given below.
@@ -11,6 +11,7 @@ trait Sesame extends RDF { | |||
// types related to the RDF datamodel | |||
type Graph = Model | |||
type Triple = Statement | |||
type Quad = Statement |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, one good thing about having Quad
is that it makes clear that Sesame's Statement is actually a Quad.
} | ||
|
||
|
||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see comments on super trait.
I apologise for the untidy patch, that contains some changes necessary for my own projects. I tried to clean up the majority, but branching from the beginning on would have been the better choice. |
No problem, the reason there are code reviews is exactly to catch missing things. So I think the way to go here is to keep Quad, and to not try to subclass RDFReader or RDFWriter, but instead to create a very simple class that returns an Iterator/Seq of Triple or Quads. This will then allow people to filter the stream by using the well-known tools to filter Iterators/Seq and that come by default with the Scala libraries. (It would still be worth deciding which Scala Stream variant would be good to use - without buying into the full akka stack at the moment). Then it should be something nearly as easy as a fold on that stream to create whatever structure the user most likes. He can add filtering before if he wants to ignore quads, or he can just send them onto a database, without ever creating a Dataset. If you do that it would be a big +1 from me. Btw. you could keep your Dataset structure |
I need to look at this as Quad store support is important. |
Unfortunately it's long ago that I've tried to contribute to this project, nevertheless I'd like to support it as good as possible. If there is anything I can do to improve this PR, tell me |
I stumbled upon the problem, not being able to parse TriG Files from an ontology source. Today
quads are commonly used to store extra information about a specific ontology graph. Even so, quad support would be nice for future banana-rdf versions, this branch adds the possibility to read a certain named graph or all named graphs without loosing its resource identifier.
The name of a named graph can either be an URI or a blank node.
The current approach uses a
Map[Option[Rdf#Node], Rdf#Graph]
where the keyNone
represents the default graph.For Jena and Sesame implementation all tests pass, feeding them with some examples listed on w3c-trig.