Skip to content

Configuring SameAs retrieval

Michael Röder edited this page Nov 9, 2020 · 1 revision

For matching URIs, GERBIL tries to make use of owl:sameAs links. This is done based on different retrievers and a mapping of well-known domains to retriever implementations. Given a URI, the retrieval process will extract the domain and use the retriever for this domain to retrieve data about the entity. If the retrieved data contains owl:sameAs links connecting the given URI to other URIs, these new URIs are added to the set of URI and are used for retrieval as well. This is repeated until no more new URIs are found.

Overall, there are three types of retrievers

Index-based retriever

The recommended variant is to use a prepared index. We offer an index for DBpedia URIs. When starting GERBIL using the start.sh file, the user is asked whether the index should be downloaded. It is extracted to gerbil_data/indexes/dbpedia which is the default path for this index.

The path of the index as well as the domains for which it should be used are defined in the gerbil.properties file:

org.aksw.gerbil.semantic.sameas.impl.index.IndexBasedSameAsRetriever.domain=dbpedia.org
org.aksw.gerbil.semantic.sameas.impl.index.IndexBasedSameAsRetriever.folder=${org.aksw.gerbil.DataPath}/indexes/dbpedia

HTTP-based retriever

GERBIL can try to retrieve owl:sameAs links from the web at runtime. These retrieval methods give the advantage that the retrieved data is up-to-date. However, it is costly to request the single URIs one by one. Hence, the runtime of the evaluation is increased significantly when using this retriever.

Dereferencing retriever

The dereferencing retriever uses Apache Jena to retrieve RDF data for the given URI. It can configured to be used for several domains in the gerbil.properties file:

org.aksw.gerbil.semantic.sameas.impl.http.HTTPBasedSameAsRetriever.domain=de.dbpedia.org
org.aksw.gerbil.semantic.sameas.impl.http.HTTPBasedSameAsRetriever.domain=fr.dbpedia.org

Wikipedia API retriever

In practice, redirects within the Wikipedia can be very helpful—especially when older datasets are used for the evaluation. Hence, GERBIL can make use of the Wikipedia API to retrieve additional URIs and use them similar to owl:sameAs links. The usage of the Wikipedia API can be configured in the gerbil.properties file by defining the Wikipedia domain for which it should be used:

org.aksw.gerbil.semantic.sameas.impl.wiki.WikipediaApiBasedSingleUriSameAsRetriever.domain=en.wikipedia.org

Caching

The costly retrievers should be used with caches to avoid at least some of the HTTP requests. To this end, there are two cache implementations available. A simple in-memory cache can be configured with a number of maximum URIs it should store:

org.aksw.gerbil.semantic.sameas.InMemoryCachingSameAsRetriever.cacheSize=5000

Another heavier caching method is offered as a file-based cache. This implementation persists the results in a file and can reuse over a longer time period. It can be used by setting the path to a caching file:

org.aksw.gerbil.semantic.sameas.CachingSameAsRetriever.cacheFile=${org.aksw.gerbil.CachePath}/sameAs.cache

Deactivation

In some cases, the usage of HTTP-based retrieval is inefficient because of it's high costs. In this case, it can be deactivated by removing all statements that define domains of the Dereferencing retriever or the Wikipedia API retriever.

Standard retrievers

The retriever implementation comes with some additional retrievers that do not need further configuration. We list them just for completeness.

  • The DBpedia Wikipedia bridge transforms DBpedia URIs into Wikipedia URIs and vice versa.
  • The URI encoding retriever handles the encoding of special characters and, hence, works like a bridge between URIs and IRIs.
  • The error fixing retriever is used to implement fixes of common errors. At the moment, it simply transforms faulty en.dbpedia.org URIs into dbpedia.org URIs.

URI Filter

Not all owl:sameAs links are always helpful. Since a lot of links between datasets are generated automatically, it is known that some links may connect entities that should not be connected. To this end, GERBIL comes with an implementation of a filter which filters URIs of certain domains from the URI set. The filter can be configured in the gerbil.properties file:

org.aksw.gerbil.semantic.sameas.impl.UriFilteringSameAsRetrieverDecorator.domainBlacklist=data.nytimes.com