This document is aimed to guide in the process of creating a Linked Data (LD) version of a wordnet, and may be of interest for other kinds of resources. These guidelines contain advice on the vocabularies selection, RDF generation process, and publication of the results. As result, the converted language resource is more interoperable and easily accessible on the Web of Data by means of standard Semantic Web technologies. This document describes the models used and the recommended format (WN-JSON-LD) for representing wordnets on the Web that is based on lemon. We then describe the InterLingual Index, a common resource for providing interlingual links between wordnets, which is adminstered by the Global WordNet Association.
This document was published by the Best Practices for Multilingual Linked Open Data community group. It is not a W3C Standard nor is it on the W3C Standards Track.
There are a number of ways that one may participate in the development of this report:
WordNet is one of the most widely used lexical resources within natural language processing. From the time since the first version of WordNet was released, many resources have been produced that represent complementary information to WordNet or extend it to other languages .
WordNet is a large lexical database of English nouns, verbs, adjectives and adverbs. Word forms are grouped into more than 117,000 sets of (roughly) synonymous word forms, so called synsets. These are interconnected by bidirectional arcs that stand for lexical (word-word) and semantic (synset-synset) relations, including hyper/hyponymy (tree-oak), meronymy (tree-branch), antonymy (long-short) and various entailment relations (buy-pay, show-see, untie- tie).
WordNet’s synsets and its network structure yield a rough measure of semantic similarity among words and concepts in terms of synset membership as well as the number of arcs separating synsets. Due to its availability under open licenses, WordNet has become a popular tool for Word Sense Disambiguation (WSD) and Natural Language Processing in general. WordNets have been built for around 100 different languages. Most are mapped onto the Princeton WordNet, enabling translation on the lexical level as well as cross-lingual WSD and applications. WordNet continues to evolve both in terms of coverage and representation of meaning. Recent enhancements include the addition of internet language and partially compositional multi word units. Finally, WordNet has been mapped to formal ontologies, including SUMO and KYOTO.
lemon is a model that has been proposed for the representation of lexicons relative to ontologies. As such, this model is well suited to the representation of semantic networks such as WordNet and defines many useful features for linking a WordNet to wider objects in the Semantic Web/Linked Open Data Cloud. lemon models lexicons by means of a core consisting of the following elements:
In addition, to using the lemon vocabulary to model the semantics of WordNet, we use the SKOS vocabulary as this is better suited to WordNet’s structural model than a formal ontology language such as OWL. Furthermore, we introduce a new vocabulary at http://wordnet-rdf.princeton.edu/ontology# to include properties found only in WordNet.
It is not trivial to apply lemon to the case of a WordNet as there is no clear ontology in WordNet. Clearly, WordNet’s words can be regarded as lemon lexical entries and the word senses correspond well to lemon’s lexical senses. WordNet has lemmas and a separate list of variants of these, and as such we recommend creating a canonical form for each lemma and a form object for each of these variants. Since there is currently no indication in WordNet of what grammatical properties these variants have, we do not recommend attaching additional properties to these variants/forms. As lemon is a model for ontology-lexica, the main question is what the reference of the lexical senses should be. We recommend regarding WordNet’s synsets as ontological references, but instead of assigning them a formal ontological type (e.g., class, property or individual), we introduce a new type Synset as a subclass of Concept in SKOS.
This allows the nature of synsets to be captured without ontologizing the semantic network. Similarly, we introduce relations such as hypernymy, meronymy etc. as new properties rather than attempt to relate them to existing ontological properties such as OWL’s subClassOf. In order to capture the new properties, an ontology has been created at http://wordnet-rdf.princeton.edu/ontology
Another key question concerns the identifiers to use for each element in the data. In order to enable browsability, we recommend that each entry and each synset has its own page and thus it own URI. We do not recommend distributing WordNets as a single flat file as they are large and this impedes their access on the web. You should assign new identifiers using the existing identifiers in your wordnet. Furthermore, as wordnets have released several versions and are still under development, we consider it important to include the version number in the URI. As such, we recommend the following scheme for URIs, as exemplified below:
In combination with the Global WordNet Association, we have developed a set of guidelines for creating wordnets as linked data. This can be done in either the LMF-compliant WN-LMF format or using the JSON-LD enabled schema below. The JSON-LD schema consists of the following elements