Skip to content

Tracing SHACL Validations towards a Better Understanding of SPARQL Query Results

License

Notifications You must be signed in to change notification settings

SDM-TIB/TracedSPARQL

Repository files navigation

License: GPL v3 Python version Data

TracedSPARQL

TracedSPARQL is tracing SHACL validations during SPARQL query processing towards a better understanding of SPARQL query results. A brief explanation of TracedSPARQL is available in the directory doc.

Table of Contents

  1. Preparation of the Environment
    1. Machine Requirements
    2. Software
    3. Bash Commands
  2. Experiments
    1. Research Questions
    2. Data
    3. Engines
    4. Setups
    5. How to reproduce?
    6. Results
  3. License
  4. References

Preparation of the Environment

Machine Requirements

  • OS: Ubuntu 16.04.6 LTS or newer
  • Memory: 128 GiB
  • HDD: approx. 50 GiB free disk space

Software

  • Docker - v19.03.6 or newer
  • docker-compose - v1.26.0 or newer

Bash Commands

The experiment scripts use the following bash commands:

  • basename
  • cd
  • chown
  • declare (with options -a and -A)
  • echo
  • logname
  • rm
  • sleep
  • source
  • unzip
  • wget

Experiments

Research Questions

  1. What is the overhead of adding online SHACL validation to the SPARQL query processing?
  2. Do the proposed optimizations increase the performance?
  3. Which heuristic has the highest single effect?

Data & SHACL Shape Schemas

Data from three benchmarks are used in the evaluation of TracedSPARQL. The following benchmarks are covered:

  • Lehigh University Benchmark (LUBM) [1]
  • Waterloo SPARQL Diversity Test Suite (WatDiv) [2]
  • DBpedia [3]

For LUBM and WatDiv knowledge graphs of three different sizes are used. Hence, a total of seven knowledge graphs are evaluated. For LUBM and WatDiv two SHACL shape schemas of different complexity are validated. In the case of DBpedia, a single SHACL shape schema is used. 10 SPARQL queries from the LUBM benchmark are included in the evaluation. From WatDiv, 18 SPARQL queries are considered. 20 SPARQL queries are created for the evaluation of DBpedia. The SPARQL queries cover at least one SHACL shape schema of the respective benchmark. All data used are made public [4].

Engines

TracedSPARQL is compared with a naive approach, referred to as baseline. The federated SPARQL query engine used is DeTrusty [5]. The SHACL validation is performed by Trav-SHACL [6] and SHACL2SPARQLpy [7], a Python implementation of SHACL2SPARQL [8]. This leads to the following engines included in the evaluation:

Name SHACL Validator Heuristics
Baseline Trav-SHACL none
Baseline S2S SHACL2SPARQLpy none
TracedSPARQL Trav-SHACL all
TracedSPARQL S2S SHACL2SPARQLpy all

Setups

The combination of a knowledge graph, engine, SHACL shape schema, and SPARQL query is referred to as a testbed; this leads to a total of 1,065 testbeds. Each testbed is executed five times. Caches are flushed between the execution of two consecutive testbeds.

How to reproduce?

In order to facilitate the reproduction of the results, all components are encapsulated in Docker containers and the experiments are controlled via Shell scripts. You can run the entire pipeline by executing:

sudo ./00_auto.sh

In the following, the different scripts are described in short.

  • 00_auto.sh: Executes the entire experiment automatically
  • 01_preparation.sh: Prepares the experimental environment, i.e., downloads the data and sets up the Docker containers
  • 02_experiments_lubm.sh: Executes the experiments for LUBM
  • 03_experiments_watdiv.sh: Executes the experiments for WatDiv
  • 04_experiments_dbpedia.sh: Executes the experiments for DBpedia
  • 05_ablation_study.sh: Executes the ablation study
  • 06_plots.sh: Creates the plots presented in the paper
  • 07_cleanup.sh: Cleans up the experimental environment including changing the ownership of result files to the user executing the script
  • run_testbeds.sh: Contains functions for performing the experiments
  • variables.sh: Contains variables used for performing the experiments

Results

The result plots included in the paper and a brief summary is available in the results directory.

License

TracedSPARQL is licensed under GPL-3.0, see the license.

References

[1] Y. Guo, Z. Pan, J. Heflin. LUBM: A Benchmark for OWL Knowledge Base Systems. Journal of Web Semantics 3(2-3), 158-182 (2005). DOI: 10.1016/j.websem.2005.06.005

[2] G. Aluç, O. Hartig, M.T. Özsu, K. Daudjee. Diversified Stress Testing of RDF Data Management Systems. In: The Semantic Web -- ISWC 2014, Springer, Cham, 2014. DOI: 10.1007/978-3-319-11964-9_13

[3] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, Z. Ives. DBpedia: A Nucleus for a Web of Open Data. In: The Semantic Web, Springer, Berlin, Heidelberg, 2007. DOI: 10.1007/978-3-540-76298-0_52

[4] P.D. Rohde, M.-E. Vidal. Dataset: TracedSPARQL Benchmarks. Leibniz Data Manager (2023). DOI: 10.57702/wfl730bc

[5] P.D. Rohde, M. Bechara, Avellino. DeTrusty v0.15.0. Zenodo (2023). DOI: https://doi.org/10.5281/zenodo.10245898.

[6] M. Figuera, P.D. Rohde, M.-E. Vidal. Trav-SHACL: Efficiently Validating Networks of SHACL Constraints. In: The Web Conference, ACM, New York, NY, USA, 2021. DOI: 10.1145/3442381.3449877.

[7] M. Figuera, P.D. Rohde. SHACL2SPARQLpy v1.3.0. GitHub (2023). URL: https://github.com/SDM-TIB/SHACL2SPARQLpy

[8] J. Corman, F. Florenzano, J.L. Reutter, O. Savković. SHACL2SPARQL: Validating a SPARQL Endpoint against Recursive SHACL Constraints. In: Proceedings of the ISWC 2019 Satellite Tracks, CEUR-WS, Aachen, Germany, 2019. URL: https://ceur-ws.org/Vol-2456/paper43.pdf

About

Tracing SHACL Validations towards a Better Understanding of SPARQL Query Results

Topics

Resources

License

Stars

Watchers

Forks