This document is aimed to give advice on representing lexicographical data as linked data on the Web. Lexicographic linked data can either be migrated from earlier, non Semantic Web, data sources, or build from scratch using linked data mechanisms. We will not discuss here the technical aspects of the data creation or conversion process, but will focus on modelling issues related to the lexicographic nature of the data. To that end, we will focus on Ontolex lemon and its lexicog module as modelling choices, giving advise on how to use them with real data examples. TBC
There are a number of ways that one may participate in the development of this report:
-------------------
UNDER CONSTRUCTION
-------------------
Types of resources
The main type of language resources covered in this document are dictionaries or any other lexicographic resource whose data exists in a machine-processable format, no matter whether it is stored locally or is accessible on the Web (e.g., for download). We assume that the data is represented in a structured or
semi-structured way (e.g., relational database, xml, csv, etc.).
We will illustrate our discussion with real examples from the LiLa project [LILA]. DESCRIBE LILA BRIEFLY
Selected Vocabularies
TBC
Here we will mention Ontolex and lexicog
- We propose Ontolex lemon (LExicon Model for ONtologies) [ONTOLEX, ONTOLEX_PAPER] to model the RDF representation of the linguistic descriptions contained in the electronic dictionaries. Ontolex lemon has been designed to extend the lexical layer of ontologies with as much linguistic information as needed, and to provide it as linked data on the Web. It is made up of a series of modules, of which we will use the following three:
- core, where lexical entries, forms, and lexical senses are defined,
- lime, where
lexicon
is defined,
- lexicog, where a number of entities to support lexicographic information are defined (see Section 3)
- The use of lemon is complemented with Lexinfo [LEXINFO]. Lexinfo is an ontology of types, values and properties to be used with the lemon model, partially derived from ISOcat. We use Lexinfo as a catalog of data categories (e.g., to denote gender, number, part of speech, etc.).
Table 1: Namespaces of the relevant vocabularies
owl | <http://www.w3.org/2002/07/owl#> |
rdfs | <http://www.w3.org/2000/01/rdf-schema#> |
ontolex | <http://www.w3.org/ns/lemon/ontolex#> |
lime | <http://www.w3.org/ns/lemon/lime#> |
vartrans | <http://www.w3.org/ns/lemon/vartrans> |
lexicog | <http://www.w3.org/ns/lemon/lexicog> |
lexinfo | <http://www.lexinfo.net/ontology/3.0/lexinfo#> |
lila | <http://lila-erc.eu/data/> |
Lexicog in a nutshell
TBC
Guidelines/Best practises
TBC
When to choose lexicog
[extracted from the lexicog specification:]
- As long as the entities in OntoLex and the other lemon modules, together with those of catalogues of linguistic categories (e.g. LexInfo), suffice to represent the information encoded in the lexicographic resource (e.g., lexical entry, part of speech, translation, ...), the OntoLex lexicography module need not be applied.
- In the case of lexicographic information that cannot be modelled by using either OntoLex nor any of the other lemon modules (e.g., to denote sense ordering), the OntoLex lexicography module should be used, at the same time avoiding redundancies and keeping additional information to the minimum.
The reason behind this is that this module adds some complexity by providing additional description capabilities to the purely lexical description accounted by OntoLex. If this information is not needed for a specific conversion, i.e, if the lexicographic view is not essential, reusing lemon would allow for keeping the representation simpler but yet sufficient.
When to go beyond lexicog
TBC
How to choose proper metadata for dictionaries
TBC
Some modelling examples
TBC
Different POS in the same entry
TBC
Modelling usage examples
TBC
Collocation dictionary
TBC
Acknowledgements
The authors would like to thank the BPMLOD community group members for their valuable feedback.
References
[AP_PAPER]
M. Forcada, M. Ginestí-Rosell, J. Nordfalk, J. O'Regan, S. Ortiz-Rojas, J. Pérez-Ortiz, F. Sánchez-Martínez, G. Ramírez-Sánchez, and F. Tyers, Apertium: a free/open-source platform for rule-based machine translation . Machine Translation, vol. 25, no. 2, pp. 127-144, 2011.
[AP_RDF]
RDF version of the Apertium bilingual dictionaries. URL: http://linguistic.linkeddata.es/apertium/
[DC]
DCMI Metadata Terms. URL: http://purl.org/dc/elements/1.1/
[DCAT]
F. Maali, J. Erickson (Eds.). Data Catalog Vocabulary (DCAT). W3C Recommendation. January 2014 URL: http://www.w3.org/TR/vocab-dcat/
[GUIDE_MLD]
A. Gómez-Pérez, D. Vila-Suero, E. Montiel-Ponsoda, J. Gracia, and G. Aguado-de Cea, Guidelines for multilingual linked data , in Proceedings of the 3rd International Conference on Web Intelligence, Mining and Semantics (WIMS'13). New York, NY, USA: ACM, Jun. 2013.
[ISA_URIS]
P. Archer, S. Goedertier, and N. Loutas, Study on persistent URIs Tech. Rep., ISA, Dec. 2012.
[LEMON_PAPER]
J. McCrae, G. Aguado-de Cea, P. Buitelaar, P. Cimiano, T. Declerck, A. Gómez-Pérez, J. Gracia, L. Hollink, E. Montiel-Ponsoda, D. Spohr, and T. Wunner, Interchanging lexical resources on the Semantic Web . Language Resources and Evaluation, vol. 46, 2012.
[LEMON]
The lemon model. URL: http://lemon-model.net/
[LEXINFO]
Lexinfo. URL: https://lexinfo.net/index.html
[LILA]
Marco C Passarotti, Flavio Massimiliano Cecchini, Greta Franzini, Eleonora Litta, Francesco Mambrini, Paolo Ruffolo The LiLa Knowledge Base of Linguistic Resources and NLP Tools for Latin . LDK-PS 2019.
[LMF]
Lexical Markup Framework (LMF). URL: http://www.lexicalmarkupframework.org/
[ONTOLEX]
Lexicon Model for Ontologies. URL: https://www.w3.org/2016/05/ontolex/
[ONTOLEX_PAPER]
John P McCrae, Julia Bosque-Gil, Jorge Gracia, Paul Buitelaar, and Philipp Cimiano, The Ontolex-Lemon model: development and applications . Proceedings of eLex 2017 conference.
[TR]
Translation Module. URL: http://purl.org/net/translation
[TR_PAPER]
J. Gracia, E. Montiel-Ponsoda, D. Vila-Suero, and G. Aguado-de Cea, Enabling language resources to expose translations as linked data on the web, in Proc. of 9th Language Resources and Evaluation Conference (LREC'14), Reykjavik (Iceland), May 2014.
[TRCAT]
OEG Translation Categories. URL: http://purl.org/net/translation-categories
[VOID]
K. Alexander, R. Cyganiak, M. Hausenblas, J. Zhao, Describing Linked Datasets with the VoID Vocabulary. W3C Interest Group Note. March 2011. URL: http://www.w3.org/TR/void/