Guidelines for Linguistic Linked Data Generation: Bilingual Dictionaries

This document is aimed to guide in the process of creating a linked data (LD) version of a lexical resource, particularly a bilingual dictionary. It contains advice on the vocabularies selection, RDF generation process, and publication of the results. As result of publishing the data as LD, the converted language resource will be more interoperable and easily accessible on the Web of Data by means of standard Semantic Web technologies. The process described in this document has been illustrated with real examples extracted from Apertium RDF, an open-source machine translation system which has their data available for download.

There are a number of ways that one may participate in the development of this report:

Mailing list: public-bpmlod@w3.org
Wiki: Main page
More information about meetings of the BPMLOD group can be obtained here
Source code for this document can be found on Github.

Description of the type of resource

The type of language resources covered in this document is bilingual electronic dictionaries. A bilingual dictionary is a specialized dictionary used to translate words or phrases from one language to another. They can be unidirectional or bidirectional, allowing translation, in the latter case, to and from both languages. In addition to the translation, a bilingual dictionary usually indicates the part of speech, gender, verb type, declination model and other grammatical properties to help a non-native speaker use the word.

We are interested in bilingual dictionaries that have their data in a machine-processable format, no matter whether it is stored locally or is accessible on the Web (e.g., for download). We assume that the data is represented in a structured or semi-structured way (e.g., relational database, xml, csv, etc.).

We will illustrate our discussion with real examples from the conversion of the Apertium dictionaries into RDF [AP_RDF]. Apertium [AP_PAPER] is a free/open-source machine translation platform originally designed to translate between closely related languages, although it has recently been expanded to treat more divergent language pairs. There exist Lexical Markup Framework [LMF] versions of their linguistic data which can be found here and have been used as starting point for the RDF version.

Selection of vocabularies

We propose lemon (LExicon Model for ONtologies) [LEMON, LEMON_PAPER] to model the RDF representation of the linguistic descriptions contained in the bilingual dictionaries. lemon has been designed to extend the lexical layer of ontologies with as much linguistic information as needed, and to provide it as linked data on the Web. From lemon we take mechanisms to represent lexicons, lexical entries, forms, and lexical senses.
The use of lemon is complemented with Lexinfo [LEXINFO]. Lexinfo is an ontology of types, values and properties to be used with the lemon model, partially derived from ISOcat. We use Lexinfo as a catalog of data categories (e.g., to denote gender, number, part of speech, etc.).
We will use the lemon Translation Module [TR, TR_PAPER] for representing translations. The translation module consists essentially of two classes: Translation and TranslationSet. Translation is a reification of the relation between two lemon lexical senses associated to terms in different languages. The idea of using a reified class allows us to describe some attributes of the Translation object itself, basically: translationSource, translationTarget, translationConfidence, context, and translationCategory.
Translation categories are represented by pointing to an external catalog (e.g. to state that a translation is a "cultural equivalent"). We propose the one at [TRCAT] but any other could be used instead.
Other extendedly used vocabularies such as Dublin Core [DC] are used to attach valuable information about provenance, authoring, versioning, or licensing.
Finally, the Data Catalogue Vocabulary [DCAT] will be used to represent other metadata information associated to the publication of the RDF dataset.

Both lemon and the Translation Module are currently under revision by the W3C Ontolex community group [ONTOLEX]. Nevertheless, the resultant model is expected to be backwards compatible with the current ones. Thus, the content of this guideline should remain valid for its use with the future model.

We summarize in the following table a list of relevant namespaces that will be used in the rest of this document.

Table 1: Namespaces of the relevant vocabularies
owl	<http://www.w3.org/2002/07/owl#>
rdfs	<http://www.w3.org/2000/01/rdf-schema#>
lemon	<http://www.lemon-model.net/lemon#>
lexinfo	<http://www.lexinfo.net/ontology/2.0/lexinfo#>
tr	<http://purl.org/net/translation#>
trcat	<http://purl.org/net/translation-categories#>
dc	<http://purl.org/dc/elements/1.1/>
dct	<http://purl.org/dc/terms/>
dcat	<http://www.w3.org/ns/dcat#>
apertium	<http://linguistic.linkeddata.es/id/apertium/>

RDF generation

For the generation and publication processes we have followed the recommendations in [GUIDE_MLD], adapted to our particular case.

Analysis of the data sources

The first activity of the publication of Linked Data is to analyse and specify the resources that will be used as source of data, as well as the data model(s) used within such sources. The analysis covers two aspects

Data model. All available information about the data model used in the sources has to be analysed, comprising standards, terminologies, etc.
Content. The data underlying such models has to be analysed also, and their linguistic features examined: e.g., to identify language dependent/independent information, to understand how names and identifiers have been constructed in the source data, how language have been encoded, etc.

The result of this phase is strongly dependent on the particular data source and its representation formalism. The general advice would be to get a good understanding of how the original dictionary is represented in order to define proper conversion rules of the original data into RDF.

Regarding our illustrating example (the Apertium EN-ES dictionary), the model used for representing the data is [LMF]. The following lines of code illustrate how the content is represented in LMF/XML for a single translation:

	
<Lexicon>
   <feat att="language" val="en"/>
   ...
   <LexicalEntry id="bench-n-en">
      <feat att="partOfSpeech" val="n"/>
	  <Lemma>
	     <feat att="writtenForm" val="bench"/>
	  </Lemma>
      <Sense id="bench_banco-n-l"/>
   </LexicalEntry>
   ...
</Lexicon>
<Lexicon>
    <feat att="language" val="es"/>
    ...
    <LexicalEntry id="banco-n-es">	
	   <feat att="partOfSpeech" val="n"/>
	   <Lemma>
	       <feat att="writtenForm" val="banco"/>
	   </Lemma>
	   <Sense id="banco_bench-n-r"/>
    </LexicalEntry>
    ...
</Lexicon>
...
<SenseAxis id="bench_banco-n-banco_bench-n" senses="bench_banco-n-l banco_bench-n-r"/>
...

Modelling

The first step in the modelling phase is the selection of the domain vocabularies to be used. This has been already discussed in the above section "selection of vocabularies". Next, it has to be decided how the representation scheme of the source data has to be mapped into the new model. In the case of bilingual dictionaries, each dictionary is converted into three different objects in RDF (no matter if the original data comes in one or several files):

Source lexicon
Target lexicon
Translation Set

This is illustrated in the following figure, for the conversion of an English-Spanish bilingual dictionary.

Example RDF dictionary conversion — Conversion of a bilingual electronic dictionary into RDF

In our opinion, this is the division that fits more naturally in the scheme of lemon and the translation module. As result, two independent monolingual lexicons will be published on the Web of Data, along with a set of translations that connects them. The publication of additional bilingual dictionaries (following the same scheme) would imply the creation of a pool of online monolingual lexicons that grows with time, all of them potentially connected within the same RDF graph by sets of translations.

Going into the details of the model, the following figure illustrates the representation scheme used for a single translation, in terms of lemon and the translation module:

Model for representing a single translation — Modelling a translation in RDF

In short, lemon:LexicalEntry and their associated properties are used to account for the lexical information, while the tr:Translation class puts them in connection through lemon:LexicalSense. Other options are possible, of course, such as connecting directly the lexical entries without defining "intermediate" senses. Nevertheless, we understand that translations occur between specific meanings of the words and the class lemon:LexicalSense allows us to represent this fact explicitly.

URIs design

Among the different patterns and recommendations for defining URIs we propose the one at [ISA_URIS] although others could be used instead. In short, the ISA pattern is as follows: http://{domain}/{type}/{concept}/{reference}, where {type} should be one of a small number of possible values that declare the type of resource that is being identified. Typical examples include: 'id' or 'item' for real world objects; 'doc' for documents that describe those objects; 'def' for concepts; 'set' for datasets; or a string specific to the context, such as 'authority' or 'dcterms'.

In our example, the main components (lexicons and translation set) of the RDF bilingual dictionary are named as follows:

	
Apertium English lexicon:  http://linguistic.linkeddata.es/id/apertium/lexiconEN 
Apertium Spanish lexicon:  http://linguistic.linkeddata.es/id/apertium/lexiconES 
Apertium English-Spanish translation set:  http://linguistic.linkeddata.es/id/apertium/tranSetEN-ES

In order to construct the URIs of the lexical entries, senses, and rest of lexical elements, we have preserved the identifiers of the original data whenever possible, propagating them into the RDF representation. Some minor changes have been introduced, though. For instance, in the original data the identifier of the lexical entries ended with the particle "-l" or "-r" depending on their role as "source" or "target" in the translation. In our case, the directionality is not preserved at the level of Lexicon (but in the Translation class) so these particles are removed from the name. In addition, some other suffixes have been added for readability (this step is optional): "-form" for lexical forms, "-sense" for lexical senses, and "-trans" for translation. See the following section "generation" for particular examples.

Generation

This activity deals with the transformation into RDF of the selected data sources using the representation scheme chosen in the modelling activity. Technically speaking, there are a number of tools that can be used to assist the developer in this task (see here for a survey), depending on the format of the data source. In our case, Open Refine (with its RDF extension) was used for defining the transformations from XML into RDF.

As result of the transformation, three RDF files were generated, one per component (lexicons and translation set). The following examples contain the RDF code (in turtle) of a single translation. The three pieces of code come from the EN and ES lexicons and from the EN_ES translation set, respectively, of the Apertium example:

apertium:lexiconEN a lemon:Lexicon ;
	dc:source <http://hdl.handle.net/10230/17110> .
...
apertium:lexiconEN lemon:entry apertium:lexiconEN/bench-n-en .
apertium:lexiconEN/bench-n-en a lemon:LexicalEntry ;
	lemon:lexicalForm apertium:lexiconEN/bench-n-en-form ;
	lexinfo:partOfSpeech lexinfo:noun .
apertium:lexiconEN/bench-n-en-form a lemon:Form ;
	lemon:writtenRep "bench"@en .

apertium:lexiconES a lemon:Lexicon ;
	dc:source <http://hdl.handle.net/10230/17110> .
...
apertium:lexiconES lemon:entry apertium:lexiconES/banco-n-es .
apertium:lexiconES/banco-n-es a lemon:LexicalEntry ;
	lemon:lexicalForm apertium:lexiconES/banco-n-es-form ;
	lexinfo:partOfSpeech lexinfo:noun .
apertium:lexiconES/banco-n-es-form a lemon:Form ;
	lemon:writtenRep "banco"@es .

apertium:tranSetEN-ES a tr:TranslationSet ;
	dc:source <http://hdl.handle.net/10230/17110> ;
...	
apertium:tranSetEN-ES tr:trans apertium:tranSetEN-ES/bench_banco-n-en-sense-banco_bench-n-es-sense-trans .
apertium:tranSetEN-ES/bench_banco-n-en-sense a lemon:LexicalSense ;
	lemon:isSenseOf apertium:lexiconEN/bench-n-en .
apertium:tranSetEN-ES/banco_bench-n-es-sense a lemon:LexicalSense ;
	lemon:isSenseOf apertium:lexiconES/banco-n-es .
apertium:tranSetEN-ES/bench_banco-n-en-sense-banco_bench-n-es-sense-trans a tr:Translation ;
	tr:translationSource apertium:tranSetEN-ES/bench_banco-n-en-sense ;
	tr:translationTarget apertium:tranSetEN-ES/banco_bench-n-es-sense .

Reproducibility is an important feature, so the mappings between the original data and the new RDF-based model, as well as the scripts for the RDF generation, should be recorded and stored to enable their later reuse.

Publication

The publication step involves: (1) dataset publication, (2) metadata publication, and (3) enabling effective discovery. Here we will focus on the second task. In the context of Linked Data, there are two major vocabularies for publishing metadata for describing datasets and catalogues: VoID (Vocabulary of Interlinked Datasets) [VOID], and DCAT (Data Catalogue Vocabulary) [DCAT]. In principle, we think that DCAT suffices for the purposes of describing the elements generated in the RDF conversion of bilingual dictionaries. Further, some data management platforms such as Datahub use DCAT in a preferred way for representing metadata. In any case, DCAT can be complemented with VoID or other vocabularies if required.

The RDF version of Apertium EN-ES was published in Datahub. The Datahub platform created a metadata file for the Apertium EN-ES dataset based on DCAT. We extended such metadata file with some additional missing information such as provenance, license, and related resources. The extended metadata was published as part of the Apertium EN-ES Datahub entry. The following lines are a fragment of it:

<dcat:Dataset rdf:about="http://linguistic.linkeddata.es/set/apertium/EN-ES">
   <owl:sameAs rdf:resource="http://datahub.io/dataset/apertium-en-es"></owl:sameAs>
   <dct:source rdf:resource="http://hdl.handle.net/10230/17110"></dct:source>
   <dct:license rdf:resource="http://purl.oclc.org/NET/rdflicense/gpl-3.0"></dct:license>
   <rdfs:seeAlso rdf:resource="http://dbpedia.org/resource/Apertium"></rdfs:seeAlso>
   <rdfs:seeAlso rdf:resource="http://purl.org/ms-lod/UPF-MetadataRecords.ttl#Apertium-en-es_resource-5v2"></rdfs:seeAlso>
</dcat:Dataset>

Recommendations

Separate the monolingual lexicons from the translation sets (different graphs and/or files).
Lexical senses should play the role of connectors between translations and lexical entries.
Be consistent with the rules for naming and URIs creation.
Keep the identifiers of the legacy data if possible, but removing indicators of directionality if any (e.g., "l", "r", "left", "right", ...)