Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

assert(): cli-friendly validation + entailment api for data pipelines #198

Open
usalu opened this issue Aug 31, 2023 · 2 comments
Open

assert(): cli-friendly validation + entailment api for data pipelines #198

usalu opened this issue Aug 31, 2023 · 2 comments

Comments

@usalu
Copy link

usalu commented Aug 31, 2023

Following the discussion and the related issues [1][2][3][...], many people (including me) would be interested in natively accessing entailment.

First, I also believe that a validate function shouldn't return an entailed input. For that the concept is very wrong.

The name Shape Constraint Language in general doesn't suggest entailment. Same as SPARQL Update makes conceptwise no sense but of course it is very usefull in solving practical problems.

Here a practical example that I think depicts a very general problem:


Company A decided to create a digital twin and equip their buildings with sensors. Company B decided the same. Company A and B are not related. While company A used FIWARE for their digital twin, company B uses BRICK and a custom broker solution.

Time has passed. They both need to retrofit their buildings and for that they need another company for consultancy. Company C chooses company D to do simulation for them. Company B heared about surrogate modelling and hires company E which claims to be able to more accurretly predict energy consumption because they can also use the measured data in their model.

Company D uses a long established standard SIM for simulation. Company E uses their custom machine learning models. Both produce the same SIMREP for which the non-profit organization has already produced standard visualization VISREP which are importable by data visualization platforms.

Now, company A needs to entail and validate their graph to be SIM compliant. For that company A and C write FIWARE2SIM.

Company E already knows about BRICK and therefore they already have inhouse developed BRICK2SURROGATESIM.

Following data pipelines would be possible:

COMPANYAGRAPH | assert(FIWARE2SIM) | assert(SIM) | assert(SIMREP) | assert(VISREP)

COMPANYBGRAPH | assert(BRICK2SURROGATESIM) | assert(SURROGATESIM) | assert(SIMREP) | assert(VISREP)

An important difference to conventional pipelines is that the output of one pipeline doesn't necessarily need to be pruned but it can be the Union of previous pipelines (sh:closed being false) which is only possbile due to the nature of rdf. This means that assert(SURROGATESIM) can access parts of the COMPANYGRAPH if wanted/needed.


Instead of a building it could be about any other digital twin. Instead of a digital twin it could be about any product. Instead of a product it could be about any other subject. Instead of simulation or visualization it could be about any other service.

In general it would solve the problem that even if people use rdf as their data model there is still a lot of duck tapping required to make such pipelines work. Manual imperative labour again and again |**Unless you use our closed api which comforts you!**|

As concept I would suggest assert because it can include both validation and entailment. The value of assert is excatly the combination of both. In the validated entailment of shacl shape a -> shacl b is what brings value. For validation only shacl core is wonderfull. For entailment only, rdflib, dotnetrdf or whatever is wonderfull. Why hustle so much to create a stupid multiply function? Conistency and being able to use it inside a decarative environment (SPARQL). What I don't like about assert is that it focuses more on the validation than the entailment but I think the entailment is more important. I guess entail also just works.

The api would be as simple as that:

assert
stdin: data graph
stderr: validation graph
stdout: entailed graph

Now for this to work properly SHACL Functions and SHACL Javascript are vital.
For SHACL Functions, I have already proposed a solution to dynamically register extension functions inside an issue for rdflib.
Due to the imperative nature of SHACL JS, debugging should be possible. The easiest solution I can imagine is to create a wrapper webservice in which the JS functions fetchs into but that would be more hacky solution. Instead of that then there could be directly a SPARQL endpoint and the SPARQL Function which calls SERVICE. Better would be to integrate that into pyducktape2. Following the issues there, there seems to nothing like that. But I guess due to the complexity of such a pipeline, the JS code in general will be small and just trying to glue whats not possible in SPARQL, so that doesn't have such a high priority.

As you might have guessed, I am more of an application guy. Therfore the openworld owl reasoning stuff is not really my interest but I am rather for closed world shacl validation. While it is technically possible to run entailment with both engines, from a practical standpoint I can't see the usage of both (or do you have an example which uses owl for something that shacl can't provide?). Therefore I would set the shacl af and js flags to true and ont flags to false by default. After all it is pyshacl, right?

Long story short: Everything is already there to make this because you have done amazing work implement it! Just one more function and that's it. I can creat a pull request if the feature is wanted. :)

@ashleysommer
Copy link
Collaborator

ashleysommer commented Sep 1, 2023

Hi @usalu
Thanks for the very detailed issue write-up. It is clear that you have put a great deal of research into this and have a good understanding of the issue at hand, the requirements of PySHACL, and the current limitations of the software.

First, I also believe that a validate function shouldn't return an entailed input. For that the concept is very wrong.

Thank you for validating my opinion on that. Most people who raise this issue seem to think that validating should modify their input graph by default and return the modified graph. That is not only violating the W3C SHACL spec, but does not make sense conceptually for a validator.

Here a practical example that I think depicts a very general problem:

I admit, despite reading it several times, I am having a lot of trouble following and understanding your example. It seems very specific to a particular application case, and is not general at all. It had too much specific detail to be a general example case, and not enough detail for me to understand the problem you are attempting to explain.

As concept I would suggest assert because it can include both validation and entailment.

From the rest of your writeup, I gather that you are asking about two different things:

  1. Ability to run an assert procedure, that combines validation of the input dataset, plus entailment, and returning the entailed graph as the output. (Same as requested in Access to inferred triples #20, Does pyShacl support entailment outputs using advanced features? #78, Returning modified data graph instead of validation report? #189, and discussed in [Discussion] PySHACL Alternate Modes #60).
  2. Something about the implementation of SHACL Functions, in the form of SPARQLFunctions and SHACL-JS.

In response to issue 1.:
After reading back through all of the issues related to this request, and re-reading the discussion in #60, I have come to the understanding that there are two different features that need to be implemented here. There is something like assert that you described, where validation is performed as normal with graph expansion and SHACL Rules, and if the input graph does not fail validation it returns the expanded graph to stdout. Secondly, an inflate operation, similar to assert it runs graph expansion and SHACL Rules per default, but skips Shape Constraints checking, and returns the expanded graph to stdout. The main challenge with that is deciding exactly what needs to be added to the output graph. The consensus seems to be, it should include RDFS/OWL inferencing (if enabled), as well as SHACLRule entailment, but not include the triples from the mix-in ontology file. This will require the use of a second in-memory datagraph, specifically for the purposes of delivering to the output, but it can be done.

In response to issue 2.:
I read through your linked RDFLib issue, and unless I am misunderstanding your request, I think you have missed that RDFLib already has the ability to register a custom SPARQL function into the SPARQL engine. register_custom_function().

Now for this to work properly SHACL Functions and SHACL Javascript are vital.

PySHACL already has full support for SHACL Functions from the SHACL-AF spec for more than two years. Specifically, it implements SPARQLFunction using RDFLib register_custom_function(), and it implements SHACL-JS JSFunctions using pyduktape2. So what you are describing is already possible (aside from the debugging ability).

@usalu
Copy link
Author

usalu commented Sep 1, 2023

@ashleysommer Thank you for the quick and detailed answer!

There is something like assert that you described, where validation is performed as normal with graph expansion and SHACL Rules, and if the input graph does not fail validation it returns the expanded graph to stdout. Secondly, an inflate operation, similar to assert it runs graph expansion and SHACL Rules per default, but skips Shape Constraints checking, and returns the expanded graph to stdout.

The general idea behind data piplines was to share mappings between two different SHACL shapes in a resuable way, entirely descriptive way. To be more precise it would be something like a qualified data pipeline, because it does not only pipe one graph and returns a modified graph but actually the outputgraph would be SHACL validated. Only this is the reason why the pipeline is reusable. It acts like a statically defined functions but instead of static types as schema, you have a shacl shape. Only the combination of two together in one api like assert makes it powerfull.

Think of shacl being Interface Definition Language like protobuf and a transpiler from one IDL-Definition into another one at the same time.

The main challenge with that is deciding exactly what needs to be added to the output graph.

Yes, exactly.

The consensus seems to be, it should include RDFS/OWL inferencing (if enabled), as well as SHACLRule entailment, but not include the triples from the mix-in ontology file. This will require the use of a second in-memory datagraph, specifically for the purposes of delivering to the output, but it can be done.

I would leave all the OWL-RL and OWL related inferencing out because in my understanding OWL and SHACL have completly different purposes despite them doing technically the same (checking a schema, inferencing triples and reasoning whether the input graph is valid).

For OWL, I see the main value in searching for knowledge inside an arbitrarly large graph which has more knowledge that I can ever understand. The idea is: Here is a complex ontology from which I (think) I understand the rules and here is an arbitrarly complex graph. Please give me back everything you know, so that I can find out something new. Aka open world.

For SHACL, I see the main value in limiting how a graph looks like. Not the entire WWW. Only something that I can process. This limit is what creates freedom for creating interoperable behaviour. Something like OpenAPI and JSON Schema for microservices. A qualified data pipelines would be like the source code for a microservice which itself is a graph.

I read through your linked RDFLib issue, and unless I am misunderstanding your request, I think you have missed that RDFLib already has the ability to register a custom SPARQL function into the SPARQL engine. register_custom_function().

Currently in my understanding this is only possible at "compile time". What I was proposing is a way to use a graph like that to create the function and register it at runtime which uses the definition of the. If you look at the example:

@prefix ex: <http://example.com/ns#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix sh: <http://www.w3.org/ns/shacl#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix ex: <http://example.com/ns#> .

ex:multiply
	a sh:SPARQLFunction ;
	rdfs:comment "Multiplies its two arguments $op1 and $op2." ;
	sh:parameter [
		sh:path ex:op1 ;
		sh:datatype xsd:integer ;
		sh:description "The first operand" ;
	] ;
	sh:parameter [
		sh:path ex:op2 ;
		sh:datatype xsd:integer ;
		sh:description "The second operand" ;
	] ;
	sh:returnType xsd:integer ;
	sh:select """
		SELECT ($op1 * $op2 AS ?result)
		WHERE {
		}
		""" .

then you see that SELECT ($op1 * $op2 AS ?result) is not a valid sparql query. Of course you can write quickly a function which replaces these arguments by python arguments (I used the existing initBindings from rdflib) but you also use a native f-string approach by pyhton:

def multiply(grpah, op1, op2):
   return graph.query(f'SELECT ({op1} * {op2} AS ?result)')

But that would be again at time of definition and not at runtime. So the issue is about using metaprograming to define such functions and register them at runtime.

This is necissary because the shacl graph of a qualified data pipeline has to be reusable. It wouldn't work. if you had to pull all python implementations and manually register them.


Let me try give a more detailed explanation:
grafik

The COMPANYAGRAPH would be a custom shacl shape COMPANYASHAPE of Company A. The FIWARE2SIM shacl shape contains all mapping behaviour to transform a COMPANYASHAPE into a SIM shape graph which itself contains all mapping behaviour to translate into SIMREP which itself contains all mapping behaviour to VISREP.

The original shape has geometry (3D), the SIM shape has simulation related information (2D + energy characteristics such as how many people per m²) and returns calculates the energy demand (simply by multiplying areas with usage) for individual rooms. The SIMREP shape is about reporting energy behaviour (e.g. in relative units such as kWh/m²*year which divides the energy use per m², etc).

You can see it as a General Purpose programming Language which accepts one shape and returns another shape.

Here a totaly different application:
grafik

It would be pipeline for computing transdiciplinary connections from an article.

Hopefully, I these examples on graph level help understand the idea. Feel free to tell me if things are still unclear to you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants