Guidelines for Linked Data corpus creation using NIF

This document describes best practices to follow for the generation of Linked Data text corpora, using the NLP Interchange Format (NIF). NIF is an RDF/OWL-based format that aims to achieve interoperability between NLP tools, language resources and annotations. It can be used to assign URIs to strings and annotate the resulting resources. The Brown corpus serves as example throughout these guidelines.

Table 1: Namespaces of the relevant vocabularies
owl	<http://www.w3.org/2002/07/owl#>
rdfs	<http://www.w3.org/2000/01/rdf-schema#>
nif	<http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#>
itsrdf	<http://www.w3.org/2005/11/its/rdf#>
olia	<http://purl.org/olia/olia.owl#>
nerd	<http://nerd.eurecom.fr/ontology#>
prov	<http://www.w3.org/ns/prov#>

String annotation

After assigning URIs to meaningful strings of the corpus, these URIs can be annotated using the NIF core ontology. The central class is nif:String, the class of all words consisting of Unicode characters. Strings offer properties to describe their content (the Unicode character string itself, via nif:anchorOf) and their position according to other strings with text indices (via nif:beginIndex and nif:endIndex, mandatory) as well as various semantic information, like part-of-speech tags, sentiment values or word stems.

Context

nif:Context is a subclass of the nif:String class. The Context represents the whole document. It serves as a reference point to all other substrings. It must have either a nif:isString property which contains the string content of the document or a nif:contextStringRef property pointing to a URL where the same string content can be obtained. In both cases, the string content must be cleaned from any markup and encoded as Unicode. Furthermore these unicode strings should be in Unicode Normal Form C (NFC), barring compelling reasons to use Normal Form D. In our case it looks like this when embedding the primary text data:

<http://brown.nlp2rdf.org/corpus/a01.xml#offset_0_161>
        a nif:String , nif:Context , nif:OffsetBasedString ;
        nif:isString """The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced ``no evidence'' that any irregularities took place. [...]"""^^xsd:string ;
        nif:beginIndex "0"^^xsd:int ;
        nif:endIndex "161"^^xsd:int ;
        nif:sourceUrl <http://icame.uib.no/brown/bcm.html> .

Here the equivalent context representation using a stand-off annotation approach:

<http://brown.nlp2rdf.org/corpus/a01.xml#offset_0_161>
        a nif:String , nif:Context , nif:OffsetBasedString ;
        nif:contextStringRef <http://brown.nlp2rdf.org/corpus/a01.text> ; 
        nif:beginIndex "0"^^xsd:int ;
        nif:endIndex "161"^^xsd:int ;
        nif:sourceUrl <http://icame.uib.no/brown/bcm.html> .

nif:beginIndex of the Context is always 0, because it represents the whole document. The nif:endIndex simply is the length of the string.

Sentences, Words and other strings

Substrings of the nif:Context can be anything from a single word to sentences and paragraphs. They link to the relevant Context resource via nif:referenceContext. Beginning and end indices always refer to the string content represented by the context.
The first sentence of our document would be presented as follows:

<http://brown.nlp2rdf.org/corpus/a01.xml#offset_0_155>
        a nif:String , nif:Sentence , nif:OffsetBasedString ;
        nif:anchorOf """The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced ``no evidence'' that any irregularities took place."""^^xsd:string ;
        nif:referenceContext <http://brown.nlp2rdf.org/corpus/a01.xml#offset_0_161> ;
        nif:beginIndex "0"^^xsd:int ;
        nif:endIndex "155"^^xsd:int .

Note that the property nif:anchorOf may be used to explicate the annotated string. Words are annotated the same way. The first word of the document is annotated as follows:

<http://brown.nlp2rdf.org/corpus/a01.xml#offset_0_3>
        a nif:String , nif:Word , nif:OffsetBasedString ;
        nif:anchorOf "The"^^xsd:string ;
        nif:referenceContext <http://brown.nlp2rdf.org/corpus/a01.xml#offset_0_161> ;
        nif:oliaLink brown:AT ;
        nif:nextWord <http://brown.nlp2rdf.org/corpus/a01.xml#offset_4_10> ;
        nif:sentence <http://brown.nlp2rdf.org/corpus/a01.xml#offset_0_155> ;
        nif:beginIndex "0"^^xsd:int ;
        nif:endIndex "3"^^xsd:int .

Words may link the following word as well as their sentence. Further annotations can be added to words. Note the nif:oliaLink property, that assigns a part-of-speech (POS) tag to the word.

Annotations - general concepts

NIF 2.1 distinguishes between two general kinds of annotations:

text span annotations (nif:TextSpanAnnotation) highlight portions of text to carry an intrinsic characteristic that does not need to be specified further using property assertions (e.g. marking a word or phrase as a mentioning of a named entity, marking a phrase as direct quotation)
property annotation assertions (nif:PropertyAssertionAnnotation) associate a portion of a text with fact assertions (e.g. providing a part-of-speech annotation for a word or linking an occurrence of a named entity in the text to an information resource representing the entity, for instance a DBpedia resource)

Also, it is often important to provide provenance and confidence information for represented annotations, especially when they were generated (semi-)automatically by NLP-tools. A generic property nif:confidence is provided to express confidence values. To express provenance, appropriate elements of the PROV ontology should be used, particularly prov:wasGeneratedBy and prov:wasAttributedTo. When several annotation are to be expressed for the same NIF string instance, additional resources typed nif:AnnotationUnit might need to be introduced to associate provenance and confidence information unambiguously. The following paragraph on named entity annotation gives opportunity to elucidate these concepts.

Named Entities

Depending on how much information is available, named entities have to be annotated in different ways. If some annotator or tool was just able to spot the occurrence of a named entity without identifying it conclusively, nif:EntityMention is called for (which is a plain text span annotation).

<http://brown.nlp2rdf.org/corpus/a01.xml#offset_4_10>
        a nif:String , nif:Word , nif:OffsetBasedString ;
        nif:anchorOf "Fulton"^^xsd:string ;
        nif:referenceContext <http://brown.nlp2rdf.org/corpus/a01.xml#offset_0_161> ;
        nif:beginIndex "4"^^xsd:int ;
        nif:endIndex "10"^^xsd:int ;
        a nif:EntityMention ;
        nif:confidence “0.98”^^xsd:decimal .

In most cases, you will have the coarse-grained class the entity belongs to (i.e. if it is a Person, a Location, an Organization etc). To annotate these, please refer to the NERD ontology. Like in the case of OLiA, it maps different named entity types to single resources, increasing interoperability. The relevant property for annotation is nif:taNerdCoreClassRef.

On the other hand, if you have a direct link to, for example, the respective DBpedia resource, you should of course link this one, too. For this purpose you can use the itsrdf:taIdentRef property from the Internationalization Tag Set (ITS) Version 2.0 vocabulary which enables integration of automated processing of human language into core Web technologies. ITS 2.0 provides properties, also called data categories, which can be used to express information related to machine translation, terminology, text analysis, provenance, confidence, etc. All ITS 2.0 properties are compatible and can be used in combination with NIF.

Although there are no named entity annotations in the Brown corpus itself, we will add some for in the following example, since three separate pieces of annotation information is added to the same NIF string, nif:AnnotationUnits to clearly associate provenance information:

<http://brown.nlp2rdf.org/corpus/a01.xml#offset_4_10>
        a nif:String , nif:Word , nif:OffsetBasedString ;
    nif:anchorOf "Fulton"^^xsd:string ;
    nif:referenceContext <http://brown.nlp2rdf.org/corpus/a01.xml#offset_0_161> ;
    nif:oliaLink brown:NP ;
    nif:beginIndex "4"^^offset ;
    nif:endIndex "10"^^xsd:int ;
    nif:annotationUnit [
        a nif:EntityMention ;
        nif:confidence “0.98”^^xsd:decimal
        prov:wasAttributedTo <http://aksw.org/MarkusAckermann.rdf>
    ] ;
    nif:annotationUnit [
        nif:taNerdCoreClassRef nerd:Location ;
        nif:confidence “0.85”^^xsd:decimal ;
        prov:wasAttributedTo <http://aksw.org/MartinBruemmer.rdf>
    ] ;
    nif:annotationUnit [
        itsrdf:taIdentRef <http://dbpedia.org/page/Fulton_County,_Georgia> ;
        nif:confidence “0.72”^^xsd:decimal ;
    prov:wasAttributedTo <http://aksw.org/MartinBruemmer.rdf>
    ] .

POS Annotations

Part-of-speech annotations are handled in NIF via OLiA. In general, OLiA is a set of ontologies that map corpus or tool specific annotations to a reference model. While NIF aims to provide syntactic interoperability between NLP tools and relevant corpora, OLiA provides a component of semantic interoperability by mapping the disparate annotation terms used by different tools and corpora to entities of a common reference model. Differences in annotation terminology range from minor differences in the choice of tag names to more fundamental variations. For each of these different annotation schemes, OLiA provides an Annotation Model, as well as a Linking Model to the common Reference Model. The reference model contains basic POS classes the abstract from specific POS as used in single tagsets.

To transform POS annotations into OLiA, visit the OLiA page and search for the annotation model that matches your tagset. In our case, it is the Brown annotation model. Now POS tags found in the corpus can just be appended to the annotation model’s URI:

<http://brown.nlp2rdf.org/corpus/a01.xml#offset_0_3>
        nif:oliaLink <http://purl.org/olia/brown.owl#AT>

In practice, aggregations of different corpora and tool outputs can then be queried for links to the OLiA reference model. This allows, for example, to aggregate the output of an NLP tool that using two different tagset and query it for all words of the type ``adjective''.

Introduction