General Patterns for Multilingual Linked Open Data

Linked data usually contains labels, comments, descriptions, etc. that depend on the natural language used. When linked data appears in a multilingual setting, it is a challenge to publish and consume it. This document presents a survey on the work done by the Best Practices for Multilingual Linked Open Data Group to identify a set of solutions and their benefits/drawbacks when applied in a given context.

General Patterns

Naming

Descriptive URIs

Descriptive URIs use ASCII characters that are combined to represent terms or abbreviations of terms in some natural language. It is usually done with terms in English or in other Latin-based languages, like French, Spanish, etc. where only a small fraction of their alphabets is outside ASCII characters.

http://example.org/Armenia

Some arguments in favour

Readability (but English/latin alphabet)
Tool support
The only way of describing the concept in ontologies that lack labels and comments

Some drawbacks

Unreadable for non-Latin alphabet users
Difficult to be descriptive enough in a URI in certain contexts (biomedical, financial, ...), descriptive names are sometimes criptical

Opaque URIs

Opaque URIs are resource identifiers which are not intended to represent terms in a natural language.

http://example.org/#I23AX45

Arguments in favor

Independence between content and language
Changes in textual descriptions do not imply changes in URIs
Suitable for automatic LD generation from existent resources

Arguments against

Non human-readable
Opaque URIs may be difficult to handle by developers

Full IRIs

This patterns consists of using unrestricted IRIs which can contain Unicode characters outside the ASCII repertoire.

http://օրինակ.օրգ/#Հայաստան

IRIs (Internationalized Resource Identifiers) are described in [[RFC3987]] and they are not restricted to ASCII characters.

Arguments in favor

More Readable (for speakers of that language)

Arguments against

Security issues (spoofing)
Unreadable for speakers of other languages
Writing IRI domain names is difficult in certain languages (e.g., Arabic, which is written right to left)

Internationalized paths only

IRIs where the domain part is restricted to ASCII characters while the path can use Unicode characters.

http://example.org#Հայաստան

Arguments in favor

Less security risks
Path more human readable (for a given language)

Arguments against

Unreadable for speakers of other languages
Problem when the namespace does not end in / or #. It may be difficult to see where the term is (eg.: namespace is http://w3.org/html and full name is http://w3.org/htmldiv). One should define prefixes that end with / or #

Include language in host name

http://hy.example.org/#Հայաստան

Arguments in favor

Practical reasons: divide in different datasets, as in DBpedia

Arguments against

Where do we put it? (beginning, end, ..)
Dialects
Actually, "es" in DBpedia uri identifies NOT the language but the source
May be technically challenging (requires DNS entry for each language)
Confuses server/data distinction

Include language in path

      http://example.com/en/Armenia
      http://example.com/Armenia.en 
      http://example.com/Armenia?lang=en

Arguments in favor

Compatible with content negotiation

Arguments against

Dereferencing

No language content negotiation

Return always the same triples without taking into account the HTTP Accept-language header.

Arguments in favor

The user agent gets all the information.

Arguments against

The user agent may get triples with languages that it is not interested in

Language content negotiation

The server attends the language preferences of the user agent, presented in the Accept-language header and returns different data for each language preference.

Arguments in favor

Save bandwith. The user agent will get only triples adapted to the language preferences that it has presented.

Arguments against

The user agent can loose information, specially if there is multilingual content

Language content redirection

The server attends the language preferences of the user agent, presented in the Accept-language header and returns a 303 (see also) redirect to a resource with triples in that language

Arguments in favor

Maintains the difference between the generic representation of a resource in any language and the representation of that resource in a given language

Arguments against

Not feasible for all the resources to have representations in different languages.

Textual information

Label everything

Linked data datasets should provide labels for all resources: individuals, concepts and properties, not just the main entities.

Introduction