Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regular expressions with Unicode #44

Open
kad-beekw opened this issue Dec 17, 2020 · 4 comments
Open

Regular expressions with Unicode #44

kad-beekw opened this issue Dec 17, 2020 · 4 comments

Comments

@kad-beekw
Copy link

kad-beekw commented Dec 17, 2020

Thanks for maintaining this great library!

Observation

I'm unable to properly validate place names using sh:pattern. Place names may include spaces, single quotes, hyphens, and some non-ASCII Unicode characters. Examples of place names that should succeed are 's-Gravenhage, The Hague, and Köln.

If I understand the somewhat cryptic XSD standard (link), then this should be expressible in the following way:

prefix sh: <http://www.w3.org/ns/shacl#>
[ sh:property
    [ sh:pattern "\\p{S}+";
      sh:path <label> ];
  sh:targetClass <C> ].

But the following data does not validate:

[ a <C>;
  <label> "Köln" ].

Since many natural languages include characters that do not occur in simple ASCII ranges like [A-Za-z], and because natural language information is very common in RDF data, support for validating Unicode strings in sh:pattern is useful in many cases.

Expected

The ability to use category escapes in sh:pattern, specifically for natural language content for which simple ranges are difficult/impossible to express.

@tpluscode
Copy link
Collaborator

tpluscode commented Dec 18, 2020

The pattern is but a simple escaped regex. You need not look into the XSD escaping rules, which I do not see mentioned by SHACL spec

This totally works:

[ sh:pattern "\\S+" ]

@kad-beekw
Copy link
Author

@tpluscode Thanks for your response!

You need not look into the XSD escaping rules, which I do not see mentioned by SHACL spec

I see the following trail when I look through the standard:

  1. The SHACL standard refers to SPARQL 1.1 for the regular expression functionality: https://www.w3.org/TR/shacl/#PatternConstraintComponent
  2. The SPARQL 1.1 standard refers to XPath 3.1: https://www.w3.org/TR/sparql11-query/#func-regex
  3. The XPath 3.1 standard refers to XSD 1.0: https://www.w3.org/TR/xpath-functions/#regex-syntax
  4. But I think that the XSD 1.1 standard supersedes version 1.0: https://www.w3.org/TR/xmlschema11-2/#cces

The main discussion I think is whether XSD 1.0 or XSD 1.1 should be used.

To be honest, I like your regex notation better, since it is a bit simpler :-). However, I can imagine that there is benefit from following the specification. There may be cases in which a regular expression stored in SHACL can be matched and reused in a SPARQL query. (I'm not sure whether this is a good use case, but what I'm getting at is that when the same regex notation is used across SHACL, SPARQL and XSD this may facilitate cross over use cases.)

@tpluscode
Copy link
Collaborator

I must admit I am a little confused myself, not having dug deep before.

You seem correct about how you followed you nose from SHACL to XSD specs. Section 7.1 of XPath seems to suggest that XSD 1.1 should be used, does it?

That said, the examples in SHACL spec to use the simple escaping (it's pretty much just the backslash). And FWIW the section for sh:pattern says

The values of sh:pattern in a shape are valid pattern arguments for the SPARQL REGEX function.

This is definitely valid SPARQL :)

filter ( regex( ?name, "^\\S+" ) )

@kad-beekw
Copy link
Author

kad-beekw commented Dec 21, 2020

@tpluscode Thanks, XSD 1.1 indeed seems to be the intended standard for regex in SHACL (and SPARQL). I do not have enough knowledge of XSD to determine whether \S is also valid. When I look at the XSD 1.1 standard I can only find the charProp grammar rule using within \p{...} or \P{....} notation:

[85] | catEsc | ::= | '\p{' charProp '}'
[86] | complEsc | ::= | '\P{' charProp '}'
[87] | charProp | ::= | IsCategory \| IsBlock

Maybe \p{S} is commonly written as \S in SPARQL? If so, this may be a de facto extension of the XSD 1.1 syntax?

Whatever the case may be, some regex strings that seem to be valid in XSD 1.1 do not seem to be supported by this SHACL library. Maybe this is not so bad: the XSD 1.1 standard is sufficiently unreadable to prevent large groups of users from picking up the regex grammar described in it. Maybe the de facto way of writing regex is more popular.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants