Skip to content

KGQA/KGQA-datasets-generalization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

KGQA-datasets-generalization

License: Apache-2.0

Existing approaches on Question Answering over Knowledge Graphs (KGQA) have weak generalizability. That is often due to the standard i.i.d. assumption on the underlying dataset. Recently, three levels of generalization for KGQA were defined, namely i.i.d., compositional, zero-shot. We analyze 25 well-known KGQA datasets for 5 different Knowledge Graphs (KGs). We show that according to this definition many existing and online available KGQA datasets are either not suited to train a generalizable KGQA system or that the datasets are based on discontinued and out-dated KGs. Generating new datasets is a costly process and, thus, is not an alternative to smaller research groups and companies. In this work, we propose a mitigation method for re-splitting available KGQA datasets to enable their applicability to evaluate generalization, without any cost and manual effort. We test our hypothesis on three KGQA datasets, i.e., LC-QuAD 1.0, LC-QuAD 2.0 and QALD-9

Table of contents

  1. Datasets
    1. Overview
    2. Statistics
    3. Use of the datasets
  2. Reproduction
    1. Requirements
    2. Run Scripts
  3. License

Datasets

Overview

By analyzing 25 existing KGQA datasets, we spot a huge gap in generalization evaluation of KGQA systems in the Semantic Web community. The main goal of this work is to reuse existing datasets from nearly a decade of research and thus generate new datasets applicable to generalization evaluation. We propose a simple and novel method to achieve this goal, and evaluate the effectiveness of our method and the quality of the new datasets it generate in generalizable KGQA systems.

Test Existing KGQA Datasets

The table below shows the evaluation result w.r.t. three levels of generalization defined in (Gu et al., 2021).

Dataset KG Year I.I.D. Compositional Zero-Shot
WebQuestions Freebase 2013
SimpleQuestions Freebase 2015
ComplexQuestions Freebase 2016 - - -
GraphQuestions Freebase 2016
WebQuestionsSP Freebase 2016
The 30M Factoid QA Freebase 2016
SimpleQuestionsWikidata Wikidata 2017
LC-QuAD 1.0 DBpedia 2017
ComplexWebQuestions Freebase 2018
QALD-9 DBpedia 2018
PathQuestion Freebase 2018 - - -
MetaQA WikiMovies 2018 - - -
SimpleDBpediaQA DBpedia 2018
TempQuestions Freebase 2018 - - -
LC-QuAD 2.0 Wikidata 2019
FreebaseQA Freebase 2019 - - -
Compositional Freebase Questions Freebase 2020
RuBQ 1.0 Wikidata 2020 - - -
GrailQA Freebase 2020
Event-QA EventKG 2020 - - -
RuBQ 2.0 Wikidata 2021 - - -
MLPQ DBpedia 2021 - - -
Compositional Wikidata Questions Wikidata 2021
TimeQuestions Wikidata 2021 - - -
CronQuestions Wikidata 2021 - - -

Statistics

The statistics of the original datasets and its counterparts (*) generated by our approach is shown below.

Dataset Total Train Validation Test I.I.D. Compositional Zero-Shot
QALD-9 558 408 - 150 46 53 51
LC-QuAD 1.0 5000 4000 - 1000 434 559 7
LC-QuAD 2.0 30221 24177 - 6044 4624 948 472
QALD-9* 558 385 - 173 14 41 118
LC-QuAD 1.0* 5000 3420 521 1059 331 1021 228
LC-QuAD 2.0* 30221 20321 3267 6633 4014 3235 2651

Use of the datasets

  • The datasets are available in json format.
  • All the datasets are stored in the output_dir directory, where three sub-directories exist for LC-QuAD 1.0, LC-QuAD 2.0 and QALD-9 respectively. In each dataset directory, there are two sub-directories for its original and new versions respectively.

Reproduction

Requirements

  • rdflib==6.0.2
  • datasets==1.16.1
  • scikit-learn==1.0.1
  • numpy==1.20.3
  • pandas==1.3.5

Due to usage of the kgqa_datasets repository (see link), you need to clone it into the root directory of this project.

Parameters

In order to ensure reproducibility, we set random_seed to 42 for all the KGQA datasets (e.g., LC-QuAD 1.0, LC-QuAD 2.0, and QALD-9).

QALD

  • dataset_id: dataset-qald
  • input_path data_dir/qald/data_sets.json
  • output_dir: output_dir/qald
  • sampling_ratio_zero: .4
  • sampling_ratio_compo: .1
  • sampling_ratio_iid: .1
  • n_splits_compo: 1
  • n_splits_zero: 1
  • validation_size: 0.0

LC-QuAD 1.0

  • dataset_id: dataset-lcquad
  • input_path data_dir/lcquad/data_sets.json
  • output_dir: output_dir/lcquad
  • sampling_ratio_zero: .6
  • sampling_ratio_compo: .1
  • sampling_ratio_iid: .2
  • n_splits_compo: 1
  • n_splits_zero: 1

LC-QuAD 2.0

  • dataset_id: dataset-lcquad2
  • input_path data_dir/lcquad2/data_sets.json
  • output_dir: output_dir/lcquad2
  • sampling_ratio_zero: .6
  • sampling_ratio_compo: .1
  • sampling_ratio_iid: .2
  • n_splits_compo: 1
  • n_splits_zero: 1
  • validation_size: 0.0

Run Scripts

  1. Prior to re-splitting a given KGQA dataset, first preprocess raw datasets by running the following command:
python preprocess.py --tasks <dataset_name> --data_dir <data_dir> --shuffle True --random_seed 42
  1. Start to re-split the given dataset by running the following command:
python resplit.py --dataset_id <dataset_id> --input_path <data_dir> --output_dir <output_dir> --sampling_ratio_zero .4 --sampling_ratio_compo .1 --sampling_ratio_iid .1 --random_seed 42 --n_splits_compo 1 --n_splits_zero 1 --validation_size 0.0

Citation

Please cite our paper if you use any tool or datasets provided in this repository:

@article{jiang2022knowledge,
  title={Knowledge Graph Question Answering Datasets and Their Generalizability: Are They Enough for Future Research?},
  author={Jiang, Longquan and Usbeck, Ricardo},
  journal={arXiv preprint arXiv:2205.06573},
  year={2022}
}

License

This work is licensed under the Apache 2.0 License - see the LICENSE file for details.