Skip to content

Generating a NIF dataset using Java

Daniel Mietchen edited this page Apr 27, 2019 · 4 revisions

This article describes, how a developer could use the classes defined in our gerbil.nif.transfer library version 1.1.0 to generate a dataset using the Natural Language Processing Interchange Format (NIF). Note that this article is neither a complete NIF tutorial nor is our library able to handle all possibilities, classes and properties that are offered by the NIF ontology.

Throughout this article, we want to use the following text as a small example.

Japan (Japanese: 日本 Nippon or Nihon) is a stratovolcanic archipelago of 6,852 islands.

Inside this text, we would like to mark Japan and stratovolcanic archipelago as named entities. Furthermore, we would like to express that Japan is a country and a stratovolcanic archipelago and that a stratovolcanic archipelago is a special type of archipelago. Additionally, we would like to add the topic (or tag) "Geography", since it contains geographical information.

The implementation of the example can be found here.

Simple Corpus Creation

We start by creating a document object using the text and a document URI.

String text = "Japan (Japanese: 日本 Nippon or Nihon) is a stratovolcanic archipelago of 6,852 islands.";
Document document = new DocumentImpl(text, "http://example.org/document0");

A document has a list of so-called Markings. These are additional information that can be added to the text, e.g., the occurance of a named entity. A list of available Marking can be found here.

For our two named entities, we create TypedNamedEntity objects and add them to the document.

Set<String> uris = new HashSet<String>();
uris.add("http://example.org/Japan");
Set<String> types = new HashSet<String>();
types.add("http://example.org/Country");
types.add("http://example.org/StratovolcanicArchipelago");
document.addMarking(new TypedNamedEntity(0, 5, uris, types));

uris = new HashSet<String>();
uris.add("http://example.org/StratovolcanicArchipelago");
types = new HashSet<String>();
types.add("http://example.org/Archipelago");
types.add("http://www.w3.org/2000/01/rdf-schema#Class");
document.addMarking(new TypedNamedEntity(42, 5, uris, types));

The topic "Geography" is added using the Annotation class.

uris = new HashSet<String>();
uris.add("http://example.org/Geography");
document.addMarking(new Annotation(uris));

Since a "real" corpus comprises more than only one document, we might add our generated document ot a list and create some more documents.

List<Document> documents = new ArrayList<Document>();
documents.add(document);

Writing our new list of documents to an OutputStream, Writer or simple String can be done using an instance of the org.aksw.gerbil.io.nif.NIFWriter interface. In our example, we are using a writer for Turtle, i.e., aorg.aksw.gerbil.io.nif.impl.TurtleNIFWriter object. (Note, that there are no other implementations of this interface at the moment.)

NIFWriter writer = new TurtleNIFWriter();
String nifString = writer.writeNIF(documents);
System.out.println(nifString);

This would print the following Turtle to our console

@prefix rdfs:  <http://www.w3.org/2000/01/rdf-schema#> .
@prefix nif:   <http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#> .
@prefix itsrdf: <http://www.w3.org/2005/11/its/rdf#> .
@prefix xsd:   <http://www.w3.org/2001/XMLSchema#> .
@prefix rdf:   <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .

<http://example.org/document0#char=0,86>
        a               nif:RFC5147String , nif:String , nif:Context ;
        nif:beginIndex  "0"^^xsd:nonNegativeInteger ;
        nif:endIndex    "86"^^xsd:nonNegativeInteger ;
        nif:isString    "Japan (Japanese: 日本 Nippon or Nihon) is a stratovolcanic archipelago of 6,852 islands."^^xsd:string ;
        nif:topic       <http://example.org/document0#annotation0> .

<http://example.org/document0#char=0,5>
        a                     nif:RFC5147String , nif:String ;
        nif:anchorOf          "Japan"^^xsd:string ;
        nif:beginIndex        "0"^^xsd:nonNegativeInteger ;
        nif:endIndex          "5"^^xsd:nonNegativeInteger ;
        nif:referenceContext  <http://example.org/document0#char=0,86> ;
        itsrdf:taClassRef     <http://example.org/Country> , <http://example.org/StratovolcanicArchipelago> ;
        itsrdf:taIdentRef     <http://example.org/Japan> .

<http://example.org/document0#char=42,68>
        a                     nif:RFC5147String , nif:String ;
        nif:anchorOf          "stratovolcanic archipelago"^^xsd:string ;
        nif:beginIndex        "42"^^xsd:nonNegativeInteger ;
        nif:endIndex          "68"^^xsd:nonNegativeInteger ;
        nif:referenceContext  <http://example.org/document0#char=0,86> ;
        itsrdf:taClassRef     <http://example.org/Archipelago> , rdfs:Class ;
        itsrdf:taIdentRef     <http://example.org/StratovolcanicArchipelago> .

<http://example.org/document0#annotation0>
        a                  nif:Annotation ;
        itsrdf:taIdentRef  <http://example.org/Geography> .

Checking the created NIF

After generating a NIF corpus, it can be helpful to parse the NIF using a NIFParser instance. In our example, we can do this in the following way.

NIFParser parser = new TurtleNIFParser();
parser.parseNIF(nifString);

Our parser implementation checks the position of Span instances by checking their first and last character as well as the single characters preceding and following the span. Changing the length of the name entity "Japan" to the false value 6 would lead to the following warning messages

... WARN [org.aksw.gerbil.io.nif.utils.NIFPositionHelper] - <Found an anormal marking that ends with a whitespace: "'Japan '(Japanese: 日本 Nippon...">

Parsing the created NIF and looking for such warnings can help to find mistakes.

Generating an RDF model object

Instead of text containing the NIF information, a jena RDF Model can be created.

DocumentListWriter listWriter = new DocumentListWriter();
Model nifModel = ModelFactory.createDefaultModel();
listWriter.writeDocumentsToModel(nifModel, documents);