Skip to content

How to add a new Dataset

Michael Röder edited this page Nov 10, 2015 · 4 revisions

At the moment, there are two possibilities to add a new dataset to GERBIL.

  • transforming your dataset into a NIF file or
  • implement an own Dataset class

Note that generating a NIF file is the recommended way. Otherwise you have to add you dataset permanently to GERBIL (step 3).

For both ways you might have to perform some of the following steps.

1. Find the correct category

First, you need to find the correct Experiment Type for your dataset. The types are described here.

For example:

  • if your dataset has named entities annotated (position and URI), your dataset can be used for A2KB.
  • If your dataset has only tags added to the documents containing entities mentioned inside the text, your dataset can be used for C2KB

2. Prepare your dataset

You can either generate a NIF file containing your dataset or implement an adapter for it.

Solution A: implement an adapter

You have to write an adapter implementing the Dataset interface. You might want to examine already existing adapters like org.aksw.gerbil.dataset.impl.msnbc.MSNBCDataset. Afterwards, you will have follow step 3 to register your dataset enabling the GERBIL system to find it.

Solution B: a NIF dataset

For this solution, you have to transform your dataset into the NLP Interchanged Format (NIF). The resulting RDF could look like the following example.

        @prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
        @prefix nif:     <http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#> .
        @prefix itsrdf:  <http://www.w3.org/2005/11/its/rdf#> .
        @prefix owl:     <http://www.w3.org/2002/07/owl#> .
        @prefix xsd:     <http://www.w3.org/2001/XMLSchema#> .
        @prefix rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
        @prefix aksw:    <http://aksw.org/> .

        <http://aksw.org/N3/Reuters-128/81#char=0,589>
                a       nif:String , nif:Context , nif:RFC5147String ;
                nif:beginIndex "0"^^xsd:nonNegativeInteger ;
                nif:endIndex "589"^^xsd:nonNegativeInteger ;
                nif:isString "General Motors Acceptance Corp, a unit of General Motors Corp..."@en ;
                nif:sourceUrl <http://www.research.att.com/~lewis/Reuters-21578/15108> .

        <http://aksw.org/N3/Reuters-128/81#char=0,30>
                a       nif:RFC5147String ;
                nif:anchorOf "General Motors Acceptance Corp"^^xsd:string ;
                nif:beginIndex "0"^^xsd:nonNegativeInteger ;
                nif:endIndex "30"^^xsd:nonNegativeInteger ;
                nif:referenceContext <http://aksw.org/N3/Reuters-128/81#char=0,589> ;
                itsrdf:taIdentRef <http://dbpedia.org/resource/Ally_Financial> ;
                itsrdf:taSource "DBpedia_en_3.9"^^xsd:string .

For this step the articles https://github.com/AKSW/gerbil/wiki/How-to-generate-a-NIF-dataset and https://github.com/AKSW/gerbil/wiki/Generating-a-NIF-dataset-using-Java could be useful.

After creating the NIF file, the dataset already can be used for experiments by uploading it through the user interface. However, if the dataset should be added permanently, step 3 should be performed.

3. Add it permanently

Adding a dataset permanently means that it can be chosen from the list of known datasets in the GUI. Therefore, the configuration of the dataset has to be added to the datasets.properties file. A configuration of an example NIF-based dataset could look like:

org.aksw.gerbil.datasets.definition.MyDataset.name=My first dataset
org.aksw.gerbil.datasets.definition.MyDataset.class=org.aksw.gerbil.dataset.impl.nif.FileBasedNIFDataset
org.aksw.gerbil.datasets.definition.MyDataset.constructorArgs=a/path/to/my/dataset.ttl
org.aksw.gerbil.datasets.definition.MyDataset.cacheable=true
org.aksw.gerbil.datasets.definition.MyDataset.experimentType=A2KB

It can be seen that all properties start with org.aksw.gerbil.datasets.definition, followed by a key that identifies properties of this example dataset. The properties define the name of the dataset, the class that is used to load the dataset and the constructor argument with which the class instance is created. In this example, we use one of the constructors of the FileBasedNIFDataset that needs the path to the NIF file. The last two properties define that results for this dataset can be cached and that it can be used for A2KB experiments.