Guidelines for Cross-lingual linking

This document aims to give advice on different representation mechanisms for linking data across different languages on the Semantic Web. We first outline the scope of the guidelines, clarifying the intended audience and the types of language resources to which they apply. Then, inspired by an adapted version of the Vauquois Triangle, we review the different possible levels of representation at which cross-lingual linking can be implemented. Then, we give concrete advice on how to model cross-lingual links in the light of an adapted Vauquois Triangle. We exemplify this with real examples. Finally, we review, from a practical point of view, the different representation mechanisms available to describe links across linguistic data in different languages. Note that whilst recognising the relevance of techniques for automated cross-lingual link discovery, these guidelines focus on general representation issues of cross-lingual data.

Scope of guidelines

These guidelines are intended for producers of resources which concern several languages and links among them. Inevitably, such resources involve the modelling and creation of cross-lingual (CL) links. Such links enable CL access to lexical and semantic representation. They also facilitate dynamic computation (via federated queries) of lexical equivalences between languages. There are diverse ways in which such links can be implemented, depending on which kinds of entities are linked and the relations available to link them together. This diversity gives rise to the problem of choosing an implementation method. Our main objective is to provide a perspective from which to resolve this problem based on an adapted version of the Vauquois Triangle (Vauquois 1968). This offers a point of departure for comparing different CL linking methods within the linguistic linked (open) data (LLD) framework and thus a basis for practical advice on how to proceed with linking data associated with different languages.

Types of resources

We are concerned here with language resources that contain descriptions of lexical items, such as dictionaries, thesauri, ontologies, terminologies, taxonomies, lexical, and encyclopaedic resources. This is not meant to be an exhaustive list, and it excludes resources that are highly composite, such as corpora, which are essentially collections and may incorporate CL links internally. Such resources may be within the scope of future versions of these guidelines.

Conventions in this document

Throughout this document, we will use Turtle RDF syntax to provide RDF examples. The following table shows a list of relevant namespaces that will be used in the rest of this document.

Namespaces of the relevant vocabularies
owl	<http://www.w3.org/2002/07/owl#>
rdfs	<http://www.w3.org/2000/01/rdf-schema#>
ontolex	<http://www.w3.org/ns/lemon/ontolex#>
vartrans	<http://www.w3.org/ns/lemon/vartrans#>
lexinfo	<http://www.lexinfo.net/ontology/2.0/lexinfo#>
skos	<http://www.w3.org/2004/02/skos#>
ili	<http://globalwordnet.org/ili/>
dc	<http://purl.org/dc/elements/1.1/>

Levels of cross-lingual linking

CL links associate items in one language with items expressed in another language. Such links are usually directional with the two languages being referred to as source and target. They are used to create resources for a variety of applications such as CL information retrieval, word translation, CL ontology mapping, and machine translation.

Previous works have recognised that both monolingual and CL linking can occur at different levels. For instance, writing from a Semantic Web perspective, Gracia et al. [CHALLENGES] consider that links can be established at three different levels: conceptual level, where links are established between classes or properties as modelled in an ontology or vocabulary; instance level, where links are established between specific individuals/entities belonging to certain ontology classes; and linguistic level, where links are established between the linguistic manifestation of the ontology entities (e.g., their associated lexical entries).

To clarify the usage of levels in CL linking, we make use of the Adapted Vauquois Triangle, shown in Figure 1, to present the various ways in which CL links can be implemented. This is an adaptation of the Vauquois Triangle, first proposed by Bernard Vauquois in 1968 [VAUQ] to illustrate a range of possible architectures for rule-driven transfer-based MT systems current during the 1970s and 80s. Two important aspects of the original triangle are retained. First, the horizontal dimension reflects cross-linguality, the left and right sides referring to source and target languages respectively. Second, the vertical dimension represents different levels of semantic abstraction. The original triangle pre-specifies the intermediate levels. However, the adapted version differs in the number and choice of levels. At the bottom is the surface linguistic form, e.g., a word, and at the top is an ontological referent, i.e., something conceptual and purely semantic, constituting an entity or element within an ontology. In between these extremes, moving upward, are a series of levels that proceed from the surface towards the meaning.

The adapted Vauquois triangle, for the cross-lingual linking case, is illustrated in the following figure.

The ontological referent serves as a natural vertex that joins the two edges, in each of the languages, and allows traversing the triangle from one to the other, reaching any possible lexical-semantic level from source to target and vice-versa. In reality, however, the common vertex is often missing, thus obliging to link at other levels.

The choice of levels is largely inspired by Ontolex [ONTOLEX, ONTOLEX_PAPER], a W3C lexicon model for ontologies, which includes some of the definitions reproduced below:

We understand the ontological referent as a conceptual element, a unit of thought, that can be represented in terms of an ontology entity (class, property, or individual). We assume that it is language neutral and can be lexicalised differently in different languages and can be adapted (or not) to a particular cultural and/or linguistic view. Its adaptation to such different languages or cultural views results in the concept that we refer to in Figure 1, which belongs to the edge of a particular language.

In practice, the concept (as in the case of the ontological referent) can be defined through an ontology entity (e.g., class, property, or individual in OWL or RDFS; but also through a skos:Concept or ontolex:LexicalConcept).

As an illustrative example of an ontological referent we show here a concept defined as an Interlingual Lexical Concept (CILI) in Global Wordnet (notice that our example is an oversimplification of the real implementation of Global WordNet). Although a description is given in English (other languages can be added), the intention is to have a common referent for such a concept no matter the particular language:


<i36300> a skos:Concept ;
               skos:definition "a flight maneuver; aircraft tips laterally about its longitudinal 
                                axis (especially in turning)"@en ;
               dc:source pwn30:00169305-n .

Such a common ontological referent can be associated with specific monolingual concepts. In our example, with Wordnet synsets in English and German:

		
ili:i36300 	owl:sameAs 	pwn31:100170126-n . # bank
ili:i36300	owl:sameAs	odwn13:00169305-n . # rol, wenden, wending, rollen

A lexical sense represents the lexical meaning of a lexical entry when interpreted as referring to the corresponding ontology element. A lexical sense thus represents a reification of a pair of a uniquely determined lexical entry and a uniquely determined ontology entity it refers to. A link between a lexical entry and an ontology entity via a lexical sense object implies that the lexical entry can be used to refer to the ontology entity in question. A lexical entry represents a unit of analysis of the lexicon consisting of a set of forms that are grammatically related and a set of base meanings that are associated with all of these forms. Thus, a lexical entry is a word, multiword expression or affix with a single part-of-speech, morphological pattern, etymology and set of senses. In the following example, different lexical senses can be defined to associate different WordNet concepts with the lexical entry for “bank”:

	
:bank_n1 a ontolex:LexicalEntry;
    ontolex:sense :bank_sense1;
    ontolex:sense :bank_sense2.
	
:bank_sense1 a ontolex:LexicalSense;
    ontolex:reference pwn31:100170126-n. #bank as a “flight maneuver”
	
:bank_sense2 a ontolex:LexicalSense;
    ontolex:reference pwn31:108437235-n.   #bank as a “depository financial institution”

The lexical form property relates a lexical entry to one grammatical form variant of the lexical entry. It represents one grammatical realisation of a lexical entry. Different forms are used to express different morphological realisations of the entry (e.g., plural vs singular or canonical vs non-canonical). Finally, a form can have an associated written representation, but also others such as the phonetic representation. We refer to such a “bottom” representation level as surface form in this document, despite the term not being used as such in the Ontolex specification. For example, the plural form of the English noun 'bank' may be described by the following lexical form, with their corresponding surface forms (written and phonetic representations, in the example):

	
:bank_n1 ontolex:lexicalForm :bank-n1-pl-form.

:bank_n1_pl_form a ontolex:LexicalForm ;
                 ontolex:writtenRep “banks”@en; 
                 ontolex:phoneticRep “/bæŋks/”@en-fonipa; 
                 lexinfo:number lexinfo:plural .

Notice that a common surface form may correspond to several lexical forms, e.g., the surface form “banks”@en may correspond to several lexical forms, among which the plural form of the noun 'bank' or the third person singular form of the verb 'bank'. The same goes for the surface form “/bæŋks/”@en-fonipa.

Guidelines for cross-lingual linking

In this section we give advice on how to leverage an Adapted Vauquois Triangle-based representation in order to make more informed decisions when establishing cross-lingual (CL) links within the same or different language resources. This is the recommended procedure:

Analyse. Analyse the different lexical-semantic representation levels described in your data and draw your own version of the two edges of the Adapted Vauquois Triangle.
Decide. Among the different identified representation levels, decide the level at which the CL links will be defined. Choose the highest possible level first. If the description of the chosen level is not rich enough in your data, or the foreseen linking algorithm is unnecessarily complex, you can choose the next inferior level and iterate until you find the appropriate representation level. But move always from the top (more expressive) to the bottom (less expressive semantics). Draw the chosen linking level in your triangle.
- Note: Try to always establish links at the same level, and only in exceptional cases across different levels.
Implement. Then decide on the ontology/model/properties that will be used to encode the CL links at the chosen level.
- Note: If (and only if) a standard existing model is not applicable, define your own, BUT (1) document it using the Adapted Vauquois Triangle and (2) relate your model to standard existing models mentioned in this document as much as possible.

The following flowchart represents the above recommended procedure:

The rest of this document presents examples of the application of these guidelines to existing resources and explains in detail the available models and properties you should use when applicable.

Examples of cross-lingual linking

Below we show some examples of datasets that incorporate CL linking in different ways, as can be readily seen using the Adapted Vauquois Triangle.

Global/Open Multilingual Wordnet

Gathering experiences from Princeton Wordnet and Open Multilingual Wordnet, the Global Wordnet Grid gathers wordnets in different languages aligned through the Collaborative InterLingual Index (CILI) that serves as a pivot between languages [GWNET]. Global Wordnet Grid / Open Multilingual Wordnet cross-lingual links are possible through the CILI used as a pivot between languages. See Figure 3.

Adapted Vauquois GlobalWordNet — Global Wordnet Grid / Open Multilingual Wordnet cross-lingual links are possible through the CILI used as a pivot between languages

WikiData

Wikidata is a free and open knowledge base of structured links and a sister project of Wikipedia [WDATA]. In 2018, Wikidata extended its model to improve modelling for lexical entities such as words and phrases, with the Wikibase Lexeme extension. Wikibase uses its own Lexeme data model, where basic concepts and relationships are defined to describe lexemes, but it also provides an alignment with the Ontolex Lemon model. Within Wikibase, there are both horizontal CL links at the sense level, e.g., translations (wdt:P5972), and compositional ones through properties such as “item for this sense” (wdt:P5137) relating a sense to a ontological reference, offering a non-direct translation, as shown in Figure 4.

Apertium RDF

In 2018, many bilingual dictionaries from the Apertium Machine Translation system [AP] were converted into RDF [AP_RDF] following the first version of the lemon model. Later in 2020, Apertium RDF version was expanded to 53 dictionares and updated to the Ontolex lemon model [AP_RDF2]. In this process, the lexica of both source and target languages were extracted and translations were encoded as links from the source ontolex:LexicalSense to the target ontolex:LexicalSense. Hence, the resulting dataset contains "horizontal" CL links at the lexical sense level, as illustrated in Figure 5.

Adapted Vauquois Apertium — Apertium RDF cross-lingual links modelled as relations at the lexical sense level.

DBnary

With DBnary [DBNARY], many Wiktionary language editions are extracted and modelled using Ontolex lemon. In this process, many translations are extracted and represented in RDF. However, the different language editions are independent of the others, hence, translations are encoded in the lexicon of the source language whilst the translation itself is a string in the target language that is not connected to the target lexicon. Indeed, in some cases, the given string is descriptive and cannot be linked to any Lexical Entry in the target language. In such a context, it is not possible to systematically encode translations as “horizontal” links, i.e., between Lexical Senses or between Lexical Entries. Hence, the DBnary dataset adopts an ad-hoc representation of translation that systematically links the source Lexical Entries with surface form in the target language. Additionally, and when possible, the source Lexical Sense is also linked to the Translation. This modelling is illustrated in Figure 6.

These examples illustrate the fact that even when using Ontolex as a unique modelling ontology, a given dataset may have specific characteristics that need to be clearly identified when it comes to the way in which CL links are modelled.

Cross-lingual links representation mechanisms

We now turn to look at the mechanisms that are currently available to express CL linking. There are a number of available mechanisms for representing cross-lingual links in RDF (see [LLD_BOOK]), which we summarise here.

Cross-lingual links at the conceptual level

Ideally, ontological referents could act as a pivot to move from one language to the other. However this language independent pivot is not always available. When the upper available level is the (monolingual) concept, relations can be established between them using identity links, e.g., owl:sameAs or owl:equivalentClass, whenever the two entities refer exactly to the same thing. E.g.:


ontology1:Banco rdfs:label “banco”@es ;
               	rdfs:comment “Una institución financiera que [...]”@es .
ontology2:Bank 	rdfs:label "bank"@en ;
       	 	rdfs:comment “A financial institution that [...]”@en .
ontology1:Banco owl:equivalentClass ontology2:Bank .

In case the matching is not exact but approximate, we can use soft links such as rdfs:seeAlso, skos:closeMatch, or skos:exactMatch. When linking at the conceptual level, one may face lexical gaps (concepts that are not realised in a specific language) or conceptual mismatches (lexical concepts) that only partly share a common meaning, e.g.. "rice"@en in English (cooked or uncooked grain) vs "beras"@id (raw grain) and "nasi"@id (cooked grain) in Indonesian. To solve this issue, ontology properties that account for taxonomical relations can be used between lexical concepts or ontology entities in different languages. For this purpose, we can consider the use of SKOS properties, like skos:broader, skos:narrower or skos:exactMatch. Also rdfs:subClassOf whenever appropriate. Example:


ontology1:Rice rdfs:label “rice”@en . 
ontology2:Beras rdfs:label “beras”@id . 
ontology2:Beras rdfs:subClassOf ontology1:Rice .

Cross-lingual links at the linguistic level (the Vartrans module)

The Ontolex lemon model proposes the vartrans module to encode CL relations at the linguistic levels. This module should be used when applicable (i.e., when enough linguistic information is available to do so).

Vartrans defines three top level object properties:

vartrans:lexicalRel defines relations between two lexical entries.
vartrans:senseRel defines relations between two lexical senses.
vartrans:conceptRel defines relations between two lexical concepts.

It also defines two instances of the preceding relations for standard cross-link representations: vartrans:translatableAs that links two lexical entries that are translations of each other, and vartrans:translation that links two lexical senses that are translations of each other.

Finally, it introduces the vartrans:Translations class for the reification of the later relation.

We can illustrate the expressive power of this model on the adapted Vauquois Triangle in Figure 7.

Adapted Vauquois Vartrans — The vartrans cross-lingual relations and their levels in the Vauquois triangle.

Let’s see an example of a translation mediated by vartrans:translatableAs property.


:bench-en a ontolex:LexicalEntry ;
            ontolex:lexicalForm [ontolex:writtenRep "bench"@en] .
:banco-es a ontolex:LexicalEntry ;
            ontolex:lexicalForm [ontolex:writtenRep "banco"@es] .
:bench-en vartrans:translatableAs :banco-es

In case more information needs to be attached to the translation itself (e.g., confidence degree, provenance, …) we can use the class vartrans:Translation that reifies the vartrans:translation property, as in the following example (taken from [LLD_BOOK]):


:bench-en-sense a ontolex:LexicalSense ;
                ontolex:isSenseOf :bench-en ;
                ontolex:reference ontology1:bench .
:banco-es-sense a ontolex:LexicalSense ;
                ontolex:isSenseOf :banco-es ;
                ontolex:reference ontology2:banco .

:bench_banco-trans a vartrans:Translation ;
                   vartrans:source :bench-en-sense ;
                   vartrans:target :banco-es-sense .

We should note that formally, vartrans:LexicoSemanticRelation allows for the linking between different semantic levels (it may relate a LexicalSense with a LexicalEntry); however, other classes are restricted to links on the same semantic level. Hence, a resource which systematically gives CL links from source LexicalSense to target LexicalEntry may use a reified class ontology:TranslationRel that should be defined as a subClass of vartrans:LexicoSemanticRelation. However, a full discussion of this topic takes us beyond the scope of these guidelines.

References

[AP]

M. Forcada, M. Ginestí-Rosell, J. Nordfalk, J. O'Regan, S. Ortiz-Rojas, J. Pérez-Ortiz, F. Sánchez-Martínez, G. Ramírez-Sánchez, and F. Tyers, Apertium: a free/open-source platform for rule-based machine translation . Machine Translation, vol. 25, no. 2, pp. 127-144, 2011.

[AP_RDF]

J. Gracia, M. Villegas, A. Gómez-Pérez, and N. Bel. The Apertium bilingual dictionaries on the web of data. Semantic Web, 9(2), 231–240. 2018

[AP_RDF2]

J. Gracia, C. Fäth, M. Hartung, M. Ionov, J. Bosque-Gil, S. Veríssimo, C. Chiarcos, and M. Orlikowski (2020). Leveraging Linguistic Linked Data for Cross-Lingual Model Transfer in the Pharmaceutical Domain . In B. Fu & A. Polleres (Eds.), Proc. of 19th International Semantic Web Conference (ISWC 2020) (pp. 499–514). Springer. 2020

[CHALLENGES]

J. Gracia, E. Montiel-Ponsoda, P. Cimiano, A. Gomez-Perez, P. Buitelaar, and J.P. McCrae. Challenges for the multilingual web of data. Journal of Web Semantics, 11:63–71. 2012

[DBNARY]

G. Sérasset. DBnary: Wiktionary as a Lemon-Based Multilingual Lexical Resource in RDF . Semantic Web – Interoperability, Usability, Applicability, Multilingual Linked Open Data, 6 (4), 355–361. 2015

[GWNET]

P. Vossen, F. Bond and J.P. McCrae. Toward a truly multilingual GlobalWordnet Grid. In C. Fellbaum, P. Vossen, V. B. Mititelu, & C. Forascu (Eds.), Proceedings of the 8th Global WordNet Conference (GWC) (pp. 424–431). Global Wordnet Association. 2016.

[LLD_BOOK]

P. Cimiano, C. Chiarcos, J.P. McCrae, J. Gracia, “Link representation and discovery”, in Linguistic Linked Data: Representation, Generation and Applications. (pp. 181-196). Springer International Publishing. 2020.

[ONTOLEX]

Lexicon Model for Ontologies. URL: https://www.w3.org/2016/05/ontolex/

[ONTOLEX_PAPER]

John P McCrae, Julia Bosque-Gil, Jorge Gracia, Paul Buitelaar, and Philipp Cimiano, The Ontolex-Lemon model: development and applications . Proceedings of eLex 2017 conference.

[VAUQ]

B.Vauquois. A survey of formal grammars and algorithms for recognition and transformation in mechanical translation. In A. J. H. Morrel (ed.), IFIP Congress (2) (p./pp. 1114-1122). 1968.

[WDATA]

D. Vrandečić and M. Krötzsch. 2014. Wikidata: a free collaborative knowledgebase. Communication of ACM vol. 57, issue 10, pp. 78-85. October 2014.