How to cite

CoDEx is a set of knowledge graph Completion Datasets Extracted from Wikidata and Wikipedia. As introduced and described by our EMNLP 2020 paper CoDEx: A Comprehensive Knowledge Graph Completion Benchmark, CoDEx offers three rich knowledge graph datasets that contain positive and hard negative triples, entity types, entity and relation descriptions, and Wikipedia page extracts for entities. We provide baseline performance results, configuration files, and pretrained models on CoDEx using the LibKGE framework for two knowledge graph completion tasks, link prediction and triple classification.

The statistics for each CoDEx dataset are as follows:

	Entities	Relations	Train	Valid (+)	Test (+)	Valid (-)	Test (-)	Total triples
CoDEx-S	2,034	42	32,888	1,827	1,828	1,827	1,828	36,543
CoDEx-M	17,050	51	185,584	10,310	10,311	10,310	10,311	206,205
CoDEx-L	77,951	69	551,193	30,622	30,622	-	-	612,437
Raw dump	380,038	75	-	-	-	-	-	1,156,222

Note: If you are interested in contributing to the CoDEx corpus, feel free to open an issue or a PR!

Quick start

If you'd like to download the CoDEx data, code, and/or pretrained models locally to your machine, run the following commands. If you only want to play with the data in a remote environment, head to the next section on data exploration and analysis, and follow the instructions to view the CoDEx data with Colab.

# unzip the repository
git clone https://github.com/tsafavi/codex.git
cd codex

# extract English Wikipedia plain-text excerpts for entities
# other language codes available: ar, de, es, ru, zh
./extract.sh en

# set up a virtual environment and install the Python requirements
python3.7 -m venv myenv
source myenv/bin/activate
pip install -r requirements.txt

# finally, install the codex data-loading API
pip install -e .

Data exploration and analysis

To get familiar with the CoDEx datasets and the data-loading API in an easy-to-use interface, we have provided an exploration Jupyter notebook called Explore CoDEx.ipynb.

You have two options for running the notebook:

Run on Google Colab: Open the notebook on Google's Colab platform and follow the instructions in the first cell to install all the requirements and data remotely. Make sure to restart the Colab runtime after installing the requirements before you run any of the following cells.
Run locally: Run the following commands to register your virtual environment with JupyterLab and launch JupyterLab:
```
# run from codex/
python -m ipykernel install --user --name=myenv
jupyter lab
```
Now, navigate to JupyterLab in your browser and open the Explore CoDEx.ipynb notebook in your browser.

Pretrained models and results

LibKGE setup

To use the pretrained models or run any scripts that involve pretrained models, you will need to set up LibKGE. Run the following:

# run from codex/
# this may take a few minutes
./libkge_setup.sh

This script will install the library inside codex/kge/, download the FB15K-237 dataset (which we use in our experiments) to kge/data/, and copy each CoDEx dataset to kge/data/ and preprocess each dataset according to the format the LibKGE requires.

Reproducing our results

We provide evaluation scripts to reproduce results in our paper. You must have set up LibKGE using the instructions we provided.

Link prediction

scripts/lp_gpu.sh and scripts/lp_cpu.sh run link prediction on all models and datasets using the LibKGE evaluation API. To run on GPU:

# run from codex/
# this may take a few minutes
scripts/lp_gpu.sh  # change to lp_cpu.sh to run on CPU

Note that this script first downloads all link prediction models on CoDEx-S through L and saves them to models/link-prediction/codex-{s,m,l}/ if they do not already exist.

Triple classification

scripts/tc.sh runs triple classification and outputs validation and test accuracy/F1. To run:

# run from codex/
# this may take a few minutes
scripts/tc.sh  # runs on CPU

Note that this script first downloads all triple classification models on CoDEx-S and CoDEx-M and saves them to models/triple-classification/codex-{s,m}/ if they do not already exist.

Comparison to FB15k-237

scripts/baseline.sh compares a simple frequency baseline to the best model on CoDEx-M and the FB15K-237 benchmark. The results are saved to CSV files named fb.csv and codex.csv, respectively. To run:

# run from codex/
# this may take a few minutes
scripts/baseline.sh  # runs on CPU

Note that this script first downloads the best pretrained LibKGE model on FB15K-237 to models/link-prediction/fb15k-237/rescal/ and the best link prediction model on CoDEx-M to models/link-prediction/codex-m/complex/ if they do not already exist.

Downloading pretrained models via the command line

To download pretrained models via the command line, use our download_pretrained.py Python script. The arguments are as follows:

usage: download_pretrained.py [-h]
                              {s,m,l} {triple-classification,link-prediction}
                              {rescal,transe,complex,conve,tucker}
                              [{rescal,transe,complex,conve,tucker} ...]

positional arguments:
  {s,m,l}               CoDEx dataset to download model(s)
  {triple-classification,link-prediction}
                        Task to download model(s) for
  {rescal,transe,complex,conve,tucker}
                        Model(s) to download for this task

For example, if you want to download the pretrained link prediction models for ComplEx and ConvE on CoDEx-M:

# run from codex/
python download_pretrained.py m link-prediction complex conve

This script will place a checkpoint_best.pt LibKGE checkpoint file in models/link-prediction/codex-m/complex/ and models/link-prediction/codex-m/conve/, respectively.

Alternatively, you can download the models manually following the links we provide here.

Link prediction results

CoDEx-S

	MRR	Hits@1	Hits@3	Hits@10	Config file	Pretrained model
RESCAL	0.404	0.293	0.4494	0.623	config.yaml	1vsAll-kl
TransE	0.354	0.219	0.4218	0.634	config.yaml	NegSamp-kl
ComplEx	0.465	0.372	0.5038	0.646	config.yaml	1vsAll-kl
ConvE	0.444	0.343	0.4926	0.635	config.yaml	1vsAll-kl
TuckER	0.444	0.339	0.4975	0.638	config.yaml	KvsAll-kl

CoDEx-M

	MRR	Hits@1	Hits@3	Hits@10	Config file	Pretrained model
RESCAL	0.317	0.244	0.3477	0.456	config.yaml	1vsAll-kl
TransE	0.303	0.223	0.3363	0.454	config.yaml	NegSamp-kl
ComplEx	0.337	0.262	0.3701	0.476	config.yaml	KvsAll-kl
ConvE	0.318	0.239	0.3551	0.464	config.yaml	NegSamp-kl
TuckER	0.328	0.259	0.3599	0.458	config.yaml	KvsAll-kl

CoDEx-L

	MRR	Hits@1	Hits@3	Hits@10	Config file	Pretrained model
RESCAL	0.304	0.242	0.3313	0.419	config.yaml	1vsAll-kl
TransE	0.187	0.116	0.2188	0.317	config.yaml	NegSamp-kl
ComplEx	0.294	0.237	0.3179	0.400	config.yaml	1vsAll-kl
ConvE	0.303	0.240	0.3298	0.420	config.yaml	1vsAll-kl
TuckER	0.309	0.244	0.3395	0.430	config.yaml	KvsAll-kl

Triple classification results

CoDEx-S

	Acc	F1	Config file	Pretrained model
RESCAL	0.843	0.852	config.yaml	1vsAll-kl
TransE	0.829	0.837	config.yaml	NegSamp-kl
ComplEx	0.836	0.846	config.yaml	1vsAll-kl
ConvE	0.841	0.846	config.yaml	1vsAll-kl
TuckER	0.840	0.846	config.yaml	KvsAll-kl

CoDEx-M

	Acc	F1	Config file	Pretrained model
RESCAL	0.818	0.815	config.yaml	KvsAll-kl
TransE	0.797	0.803	config.yaml	NegSamp-kl
ComplEx	0.824	0.818	config.yaml	KvsAll-kl
ConvE	0.826	0.829	config.yaml	KvsAll-kl
TuckER	0.823	0.816	config.yaml	KvsAll-kl

Data directory structure

The data/ directory is structured as follows:

.
├── entities
│   ├── ar
│   ├── de
│   ├── en
│   ├── es
│   ├── ru
│   └── zh
├── relations
│   ├── ar
│   ├── de
│   ├── en
│   ├── es
│   ├── ru
│   └── zh
├── triples
│   ├── codex-l
│   ├── codex-m
│   ├── codex-s
│   └── raw.zip
└── types
    ├── ar
    ├── de
    ├── en
    ├── entity2types.json
    ├── es
    ├── ru
    └── zh

We provide an overview of each subdirectory in this section.

Entities and entity types

We provide entity labels, Wikidata descriptions, and Wikipedia page extracts for entities and entity types in six languages: Arabic (ar), German (de), English (en), Spanish (es), Russian (ru), and Chineze (zh).

Each subdirectory of data/entities/ contains an entities.json file formatted as follows:

{
  <Wikidata entity ID>:{
    "label":<label in respective language if available>,
    "description":<Wikidata description in respective language if available>,
    "wiki":<Wikipedia page URL in respective language if available>
  }
}

For the labels, descriptions, or Wikipedia URLs that are not available in a given language, the value will be the empty string.

The file data/types/entity2types.json maps each Wikidata entity ID to a list of Wikidata type IDs, i.e.,

{
  "<Wikidata entity ID>":[
    <Wikidata type ID 1>,
    <Wikidata type ID 2>,
    ...
  ]
}

Each subdirectory of data/types/ contains a types.json file formatted as follows:

{
  <Wikidata type ID>:{
    "label":<label in respective language if available>,
    "description":<Wikidata description in respective language if available>,
    "wiki":<Wikipedia page URL in respective language if available>
  }
}

Each extracts.zip file contains zipped files of entity descriptions from Wikipedia. Each file is named <Wikidata entity ID>.txt. We provide the extract_en.sh script to unzip all English-language entity and entity type extracts. You can edit this script and provide a different language code (ar for Arabic, de for German, es for Spanish, ru for Russian, and zh for Chinese) to extract descriptions for other languages.

Relations

We provide relation labels and Wikidata descriptions for relations in six languages: Arabic (ar), German (de), English (en), Spanish (es), Russian (ru), and Chineze (zh).

Each subdirectory of data/relations/ contains a relations.json file formatted as follows:

{
  <Wikidata relation ID>:{
    "label":<label in respective language if available>,
    "description":<Wikidata description in respective language if available>
  }
}

Triples

Each triple file follows the format

<Wikidata head entity ID>\t<Wikidata relation ID>\t<Wikidata tail entity ID>

without any header or extra information per line.

If you'd like to use the raw data dump, run

cd data/triples
unzip raw.zip

This will create a new data/triples/raw/ directory containing a single file, triples.txt, in the same tab-separated format as the other triple files.

How to cite

You can find the full text of our paper here.

If you used our work or found it helpful, please use the following citation:

@inproceedings{safavi-koutra-2020-codex,
    title = "{C}o{DE}x: A {C}omprehensive {K}nowledge {G}raph {C}ompletion {B}enchmark",
    author = "Safavi, Tara  and
      Koutra, Danai",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.emnlp-main.669",
    doi = "10.18653/v1/2020.emnlp-main.669",
    pages = "8328--8350",
    abstract = "We present CoDEx, a set of knowledge graph completion datasets extracted from Wikidata and Wikipedia that improve upon existing knowledge graph completion benchmarks in scope and level of difficulty. In terms of scope, CoDEx comprises three knowledge graphs varying in size and structure, multilingual descriptions of entities and relations, and tens of thousands of hard negative triples that are plausible but verified to be false. To characterize CoDEx, we contribute thorough empirical analyses and benchmarking experiments. First, we analyze each CoDEx dataset in terms of logical relation patterns. Next, we report baseline link prediction and triple classification results on CoDEx for five extensively tuned embedding models. Finally, we differentiate CoDEx from the popular FB15K-237 knowledge graph completion dataset by showing that CoDEx covers more diverse and interpretable content, and is a more difficult link prediction benchmark. Data, code, and pretrained models are available at https://bit.ly/2EPbrJs.",
}

References and acknowledgements

We thank HeadsOfBirds for the lightbulb icon and Evan Bond for the book icon in our logo.

This project is licensed under the terms of the MIT license.

Name		Name	Last commit message	Last commit date
Latest commit History 225 Commits
analysis		analysis
codex		codex
data		data
models		models
scripts		scripts
.gitignore		.gitignore
Explore CoDEx.ipynb		Explore CoDEx.ipynb
LICENSE		LICENSE
README.md		README.md
codex_logo.png		codex_logo.png
download_pretrained.py		download_pretrained.py
extract.sh		extract.sh
libkge_setup.sh		libkge_setup.sh
pretrained.csv		pretrained.csv
requirements.txt		requirements.txt
setup.py		setup.py

License

tsafavi/codex

Folders and files

Latest commit

History

Repository files navigation

Table of contents

About

Topics

Resources

License

Stars

Watchers

Forks

Languages