Skip to content
Dimitris Kontokostas edited this page Jul 18, 2022 · 8 revisions

The architecture of RDFUnit decouples the test case generation from the test execution. This makes it possible to target the same tests in different data sources and SPARQL endpoints are one of them.

The overview of the command line options, inlcuding SPARQL endpoints, is available on the CLI wiki page. This page is to give a better overview of the different options that can be tweaked to give better results and/or performance.

Schemas

Missing schemas

Some constraints need to have the schema in the dataset to e.g. traverse the type hierarchy. When you validate in-memory datasets, RDFUnit loads the schema to avoid this problem. However, this is not possible in an immutable dataset such as an Endpoint. Make sure you keep the relevant schemas together with the data or in a named graph. If this is not the case, you might get some false-negatives.

Auto-schema discovery

Automated schema discovery is also available in SPARQL endpoints. This will create some load when RDFUnit profiles the endpoint but should be quick. Note that RDFUnit cannot yet load/discover SHACL constraints directly from the endpoint. SHACL constraints need to be passed with the -s command option

Targeting specific graphs

By default, RDFUnit targets the default graph. Depending on the configuration of the endpoint, all graphs might be queried as well. Check your configuration to make sure.

With the -g options you may optionally specify the graph or graphs that the RDFUnit will validate. This option will ignore the contents of all the other graphs. If you want to validate the combination of the default graph together with other graphs, please consult your endpoint documentation to get the IRI of your default graph. e.g. for Stardog this is <tag:stardog:api:context:default>.

Default delay between queries

To avoid abuse of public SPARQL endpoint, RDFUnit has a default delay between queries, this also helps to keep the load of your endpoint down. If you want to remove this delay use -D 0 the time is in milliseconds

Limit for error sampling

By default, there is no limit defined in the results you request. When you use the aggregated or status execution type, limit has no effect since every query returns only one results. For shacl or shacl-lite you can specify limit to force the endpoint to return a maximum of X violations per constraint. Besides reducing the load on the server, this option can be used to return error samples per constraint. This is very useful for large datasets where the same constraint can have thousands or millions of violation instances.

Pagination

By default, there is no pagination in the results. If your endpoint has a limit in the results it can return in a single query, use e.g. -P 1000. This breaks the results in chunks of 1000 and internally assembles the results together. AS with limit, this option is relevant only when you use the shacl or shacl-lite execution option.

Default cache

By default, RDFUnit creates a file cache when you validate SPARQL endpoints. This speed-ups consecutive validations assuming the data in the endpoint remain static or change very infrequently. If this is not your case, please disable caching by adding -T 0 to disable it or -T xx to keeping a query cache fow xx minutes.