Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to catch violations with properties that are not defined #215

Open
mhoangvslev opened this issue Nov 15, 2023 · 8 comments
Open

Failed to catch violations with properties that are not defined #215

mhoangvslev opened this issue Nov 15, 2023 · 8 comments

Comments

@mhoangvslev
Copy link

mhoangvslev commented Nov 15, 2023

Given the datashape below, which is a json-ld markup:

{
    "@context": "http://schema.org",
    "@type": "Product",
    "@id": "https://www.slimmingeats.com/blog/hoisin-chicken-actifry-stove-top",
    "name": "Hoisin Chicken (Actifry or Stove top) | Slimming Eats",
    "description": "Heavenly Tender Hoisin Chicken - a quick simple dish that is ready in less than 20 minutes and can be cooked in an Actifry or on the Stove Top. Gluten Free, Dairy Free, Slimming Eats and Weight Watchers friendly.",
    "url": "https://www.slimmingeats.com/blog/hoisin-chicken-actifry-stove-top",
    "image": "https://www.slimmingeats.com/blog/wp-content/uploads/2016/06/hoisin-chicken-24-480x480.jpg",
    "brand": {
        "@type": "Brand",
        "name": "Slimming Eats"
    },
    "aggregateRating": {
        "@type": "AggregateRating",
        "ratingValue": 5.0,
        "reviewCount": 16
    },
    "offers": {
        "@type": "Offer",
        "price": 0.0,
        "priceCurrency": "USD",
        "availability": "http://schema.org/InStock"
    },
    "recipeCategory": "Main Dish",
    "recipeCuisine": "Chinese",
    "cookTime": "PT26M",
    "nutrition": {
        "@type": "NutritionInformation",
        "calories": "225",
        "totalFat": "2.8g",
        "saturatedFat": "0.1g",
        "cholesterol": "95mg",
        "sodium": "403mg",
        "carbohydrates": "10.3g",
        "fiber": "1.2g",
        "sugar": "5.5g",
        "protein": "39.2g"
    },
  "pouet": "pouet"
}

Given the shape graph, and the ontology, nutrition not being a property of Product or Thing, I expect a violation but there wasn't any.

I ran using:

pyshacl -e schemaorg-all-http.ttl -s schemaorg-datashapes.ttl -a -im -df json-ld test.json

I added sh:closed true to Thing but I encounter inheritance issues as discussed #167, #141.

I tried to follow the recommendation here but then pySHACL reports error:

Validator encountered a Constraint Load Error:
Cannot select a validator to use, according to the rules.
For reference, see https://www.w3.org/TR/shacl/#constraint-components-validators%     
@ajnelson-nist
Copy link
Contributor

It looks to me like you've stumbled into an undesired extension of schema:nutrition, and you're looking for why it's not prevented by SHACL. The short answer is, this particular use is part of the open-world behavior of RDF (not necessarily just OWL), and hasn't been closed off by any closed-world rule from SHACL. schema.org's rdf:Propertys are written in a way that gently suggests what classes they should be associated with, but using structural predicates that don't trigger behaviors from RDFS or OWL inferencing engines.

First, on trying to flag this as an error from inferencing: The ontology you linked is not an OWL ontology, and the nearby OWL transcoding does not include any owl:disjointWith statements. Without owl:disjointWith, even if the properties were defined in a way supporting OWL inferencing, you would not find any errors from OWL inferencing by using schema:nutrition on something unexpected, like a schema:Concert. The thing using schema:nutrition would just become a schema:Concert and either a schema:MenuItem or schema:Recipe.

Neither OWL nor RDFS inferencing using the schema.org ontology (i.e. the RDF Schema or OWL Ontology) would cause new @type values to be inferred. schema.org's properties in the RDF Schema eschew rdfs:domain, instead using schema:domainIncludes which (IIRC) has no entailment (/inference) semantics to generate new RDF triples.
And, while the OWL transcoding has rdfs:domain statements, they all (IIRC - a little hard to grep at a skim) tie to owl:unionOf anonymous classes, which need an OWL reasoner to whittle down to a named class from other axioms describing the object. But again, this still won't raise an error for you, because the OWL transcoding does not include any owl:disjointWith statements that would let you arrive at a logical inconsistency.

In a purposefully-unimaginative and closed-world style, I could declare that the set of concerts and set of cooking recipes are disjoint. But, that's not encoded in schema.org, so I'd need to write a SHACL rule for my own data. And, respecting imagination and open-world modeling, someone could probably link a video to a counterexample, with a rock band playing while someone's baking a cake. Depending on any other ontology foundations your graph is built on, that concert could rightly get a nutrition value.

If you wanted to be really careful with your usage of schema:nutrition and make sure it only appears on MenuItem or Recipe, you would need this shape:

<urn:example:schema-nutrition-subjects-shape>
a sh:NodeShape ;
sh:or (
  [
    a sh:NodeShape ;
    sh:class schema:MenuItem ;
  ]
  [
    a sh:NodeShape ;
    sh:class schema:Recipe ;
  ]
) ;
sh:targetSubjectsOf schema:nutrition ;
.

@mhoangvslev
Copy link
Author

Thank you for the prompt reply!

If I understand it right, in a few words, schema.org defines what a class is but does not define what a class is not.
Is there a quick way to assume the owl:disjointWith relation with all other classes unless specified by the author?

The use case is this:

  • I generated a markup and I want to validate it (similar to https://validator.schema.org/ but automatically).
  • The schema.org validator is able to say that "nutrition" is not a property of "Product" or any superclass of "Product".

If you wanted to be really careful with your usage of schema:nutrition and make sure it only appears on MenuItem or Recipe, you would need this shape:

In the shape graph, nutrition only appears in MenuItem or Recipe through:

schema:Recipe
  a rdfs:Class ;
  a sh:NodeShape ;
  rdfs:comment "A recipe. For dietary restrictions covered by the recipe, a few common restrictions are enumerated via [[suitableForDiet]]. The [[keywords]] property can also be used to add more detail."^^rdf:HTML ;
  rdfs:label "Recipe" ;
  rdfs:subClassOf schema:HowTo ;
  sh:property schema:Recipe-cookTime ;
  sh:property schema:Recipe-cookingMethod ;
  sh:property schema:Recipe-ingredients ;
  sh:property schema:Recipe-nutrition ;
  sh:property schema:Recipe-recipeCategory ;
  sh:property schema:Recipe-recipeCuisine ;
  sh:property schema:Recipe-recipeIngredient ;
  sh:property schema:Recipe-recipeInstructions ;
  sh:property schema:Recipe-recipeYield ;
  sh:property schema:Recipe-suitableForDiet ;

schema:Recipe-nutrition
  a sh:PropertyShape ;
  sh:path schema:nutrition ;
  sh:class schema:NutritionInformation ;
  sh:description "Nutrition information about the recipe or menu item."^^rdf:HTML ;
  sh:name "nutrition" ;

@ajnelson-nist
Copy link
Contributor

Thank you for the prompt reply!

Thanks, it was fun to write and think through.

If I understand it right, in a few words, schema.org defines what a class is but does not define what a class is not. [...]

Yes.

[...] Is there a quick way to assume the owl:disjointWith relation with all other classes unless specified by the author?

This will likely be very difficult when you start considering subclasses.

When I mentioned "Other ontology foundations," I was alluding to how some other ontologies rely on some foundational ontology that divides "Everything" into subsets, and sometimes those subsets are disjoint, sometimes they aren't. For instance, one division drawn from the philosophical literature is endurants vs. perdurants, which behave differently with how they relate to time. (Take thing X. Freeze time. Is X wholly contained in that time slice---an endurant---or must you look outside that time slice to have the full definition of X---a perdurant? My body is an endurant. My life is a perdurant.)

Endurants and perdurants are disjoint. If you had those in your ontology near the top (near owl:Thing), you'd probably pick up that a rock concert, which would be an event which would be a perdurant, can't also be a recipe, which is wholly containable in a time slice and is this an endurant. So, schema:nutrition on a concert would fail because of a disjointedness several superclasses up.

Placing disjointedness statements in an ontology is unlikely to be, or remain once tried, quick.

The use case is this:

  • I generated a markup and I want to validate it (similar to https://validator.schema.org/ but automatically).
  • The schema.org validator is able to say that "nutrition" is not a property of "Product" or any superclass of "Product".

The shape I sketched would satisfy this use case.

If you wanted to be really careful with your usage of schema:nutrition and make sure it only appears on MenuItem or Recipe, you would need this shape:

In the shape graph, nutrition only appears in MenuItem or Recipe through:

schema:Recipe
  a rdfs:Class ;
  a sh:NodeShape ;
  rdfs:comment "A recipe. For dietary restrictions covered by the recipe, a few common restrictions are enumerated via [[suitableForDiet]]. The [[keywords]] property can also be used to add more detail."^^rdf:HTML ;
  rdfs:label "Recipe" ;
  rdfs:subClassOf schema:HowTo ;
  sh:property schema:Recipe-cookTime ;
  sh:property schema:Recipe-cookingMethod ;
  sh:property schema:Recipe-ingredients ;
  sh:property schema:Recipe-nutrition ;
  sh:property schema:Recipe-recipeCategory ;
  sh:property schema:Recipe-recipeCuisine ;
  sh:property schema:Recipe-recipeIngredient ;
  sh:property schema:Recipe-recipeInstructions ;
  sh:property schema:Recipe-recipeYield ;
  sh:property schema:Recipe-suitableForDiet ;

schema:Recipe-nutrition
  a sh:PropertyShape ;
  sh:path schema:nutrition ;
  sh:class schema:NutritionInformation ;
  sh:description "Nutrition information about the recipe or menu item."^^rdf:HTML ;
  sh:name "nutrition" ;

These shapes prescribe how schema:nutrition behaves when on schema:Recipe. To check that schema:nutrition is on a schema:Recipe, you'd need to orient your shape around the predicate schema:nutrition, not the subject schema:Recipe. That's why the shape I sketched uses sh:targetSubjectsOf.

@mhoangvslev
Copy link
Author

The shape I sketched would satisfy this use case.
I just updated the example markup to include a rubbish field:

"pouet": "pouet"

Since I never know what the end user might type, I can't anticipate every misadventure with a shape.

@ajnelson-nist
Copy link
Contributor

Unfortunately, IRI typo detection is another hard problem in RDF because of the open-world nature. Think of it from the future-proofing perspective. schema:pouet doesn't exist today. (I'm blindly assuming that, but the URL does 404 at the moment.) If you write some mechanism to flag schema:pouet as an error today, it'd be appropriate for now. But say in a year, schema:pouet exists, and is added as an isolated new property, with no other risk from the perspective of the schema.org maintainers. Now your schema:pouet detector would flag new valid data as wrong.

Also, syntax-check - the way you wrote "pouet": "pouet" does not actually resolve to an RDF triple, because it's not part of a JSON-LD context dictionary (being a non-existent term). My recollection is RDFLib currently silently drops such JSON that doesn't function as JSON-LD. (A string-literal can't be a predicate...except in n-triples, IIRC? In any case, while I forget the detailed reasoning and history, what you wrote would silently drop.) "schema:pouet": "1234" would enact the "User invented something by typo" scenario you wanted.

FWIW, with an ontology community I work with, we introduced a "Concept typo checker" as part of an extension to pyshacl, here. My current understanding of the state of the RDF world is that a general concept typo checker across all namespaces is not possible because of the open-world assumption, but a "fixed set" of concepts can be constructed for specific ontologies, though it has to be kept aligned with the ontologies' versions.

It might also be possible to do such a concept set with SKOS Concept Schemes if you want a SHACL-oriented solution. But, again, this set of skos:inScheme statements has to be an artifact maintained in some way tied to an ontology's specific version, because the set of concepts will likely grow over time.

@mfhepp
Copy link

mfhepp commented Nov 17, 2023

FYI: Some more background from a schema.org perspective of the problem is in schemaorg/schemaorg#3408 (comment).

@mfhepp
Copy link

mfhepp commented Nov 17, 2023

Quick and practical advice for this and similar data validation tasks:

You need to produce your custom SHACL shapes definitions, either manually or derived from the RDFS or OWL vocabulary or vocabularies and maybe manually augmented. Neither OWL nor RDFS are particularly suited for defining constraints regarding the domain or range of properties.

After all, that is the entire motivation for the notion of data shapes and shape languages like SHACL; worthwhile reading is e.g. Tim Berners-Lee's piece Linked Data Shapes, Forms and Footprints.

@mhoangvslev
Copy link
Author

After a while, I figured out a quick and dirty way to perform the type checking under CWA:

  • Recursively bring all parents' properties to the child class, then close the definition with sh:closed true.
  • Here is the Python function that work with this version of schema.org shape.
def close_ontology(graph: ConjunctiveGraph):
    """Load an input SHACL shape graph and close each shape 
    by bringing all property from parent class to currend class shape 
    then add sh:closed at the end
    """             
    query = f"""
    SELECT DISTINCT ?shape ?parentShape ?parentProp WHERE {{
        ?shape  a <http://www.w3.org/ns/shacl#NodeShape> ;
                a <http://www.w3.org/2000/01/rdf-schema#Class> ;
                <http://www.w3.org/2000/01/rdf-schema#subClassOf>* ?parentShape .
                
        ?parentShape <http://www.w3.org/ns/shacl#property> ?parentProp .
        FILTER(?parentShape != ?shape)
    }}
    """ 
    
    results = graph.query(query)
    visited_shapes = set()
    for result in results:
        shape = result.get("shape")
        parent_prop = result.get("parentProp")
        graph.add((shape, URIRef("http://www.w3.org/ns/shacl#property"), parent_prop))
        graph.add((shape, URIRef("http://www.w3.org/ns/shacl#closed"), Literal(True)))
        
        # subj sh:ignoredProperties ( rdf:type owl:sameAs )
        # https://www.w3.org/TR/turtle/#collections
        if shape not in visited_shapes:
            ignored_props = graph.collection(BNode())
            ignored_props += [URIRef("http://www.w3.org/1999/02/22-rdf-syntax-ns#type"), URIRef("http://www.w3.org/2002/07/owl#sameAs")]
            
            graph.add((shape, URIRef("http://www.w3.org/ns/shacl#ignoredProperties"), ignored_props.uri))
            visited_shapes.add(shape)
    
    # Replace xsd:float with xsd:double
    for prop in graph.subjects(URIRef("http://www.w3.org/ns/shacl#datatype"), URIRef("http://www.w3.org/2001/XMLSchema#float")):
        graph.set((prop, URIRef("http://www.w3.org/ns/shacl#datatype"), URIRef("http://www.w3.org/2001/XMLSchema#double")))
    
    return graph

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants