Linked data usually contains labels, comments, descriptions, etc. that depend on the natural language used. When linked data appears in a multilingual setting, it is a challenge to publish and consume it. This document presents a survey on the work done by the Best Practices for Multilingual Linked Open Data Group to identify a set of solutions and their benefits/drawbacks when applied in a given context.

This document was published by the Best Practices for Multilingual Linked Open Data comunity group. It is not a W3C Standard nor is it on the W3C Standards Track.

There are a number of ways that one may participate in the development of this report:

Introduction

This report describes the work done by the Best Practices for Multilingual Linked Open Data.

The goal of the group is to identify common practices and patterns that can be applied to publish linked data in a multilingual context.

General Patterns

Naming

Descriptive URIs

Descriptive URIs use ASCII characters that are combined to represent terms or abbreviations of terms in some natural language. It is usually done with terms in English or in other Latin-based languages, like French, Spanish, etc. where only a small fraction of their alphabets is outside ASCII characters.

http://example.org/Armenia

Some arguments in favour

  • Readability (but English/latin alphabet)
  • Tool support
  • The only way of describing the concept in ontologies that lack labels and comments

Some drawbacks

  • Unreadable for non-Latin alphabet users
  • Difficult to be descriptive enough in a URI in certain contexts (biomedical, financial, ...), descriptive names are sometimes criptical

Opaque URIs

Opaque URIs are resource identifiers which are not intended to represent terms in a natural language.

http://example.org/#I23AX45

Arguments in favor

  • Independence between content and language
  • Changes in textual descriptions do not imply changes in URIs
  • Suitable for automatic LD generation from existent resources

Arguments against

  • Non human-readable
  • Opaque URIs may be difficult to handle by developers

Full IRIs

This patterns consists of using unrestricted IRIs which can contain Unicode characters outside the ASCII repertoire.

http://օրինակ.օրգ/#Հայաստան

IRIs (Internationalized Resource Identifiers) are described in [[RFC3987]] and they are not restricted to ASCII characters.

Arguments in favor

  • More Readable (for speakers of that language)

Arguments against

  • Security issues (spoofing)
  • Unreadable for speakers of other languages
  • Writing IRI domain names is difficult in certain languages (e.g., Arabic, which is written right to left)

Internationalized paths only

IRIs where the domain part is restricted to ASCII characters while the path can use Unicode characters.

http://example.org#Հայաստան

Arguments in favor

  • Less security risks
  • Path more human readable (for a given language)

Arguments against

  • Unreadable for speakers of other languages
  • Problem when the namespace does not end in / or #. It may be difficult to see where the term is (eg.: namespace is http://w3.org/html and full name is http://w3.org/htmldiv). One should define prefixes that end with / or #

Include language in host name

http://hy.example.org/#Հայաստան

Arguments in favor

  • Practical reasons: divide in different datasets, as in DBpedia

Arguments against

  • Where do we put it? (beginning, end, ..)
  • Dialects
  • Actually, "es" in DBpedia uri identifies NOT the language but the source
  • May be technically challenging (requires DNS entry for each language)
  • Confuses server/data distinction

Include language in path

      http://example.com/en/Armenia
      http://example.com/Armenia.en 
      http://example.com/Armenia?lang=en
    

Arguments in favor

  • Compatible with content negotiation

Arguments against

Dereferencing

No language content negotiation

Return always the same triples without taking into account the HTTP Accept-language header.

Arguments in favor

  • The user agent gets all the information.

Arguments against

  • The user agent may get triples with languages that it is not interested in

Language content negotiation

The server attends the language preferences of the user agent, presented in the Accept-language header and returns different data for each language preference.

Arguments in favor

  • Save bandwith. The user agent will get only triples adapted to the language preferences that it has presented.

Arguments against

  • The user agent can loose information, specially if there is multilingual content

Language content redirection

The server attends the language preferences of the user agent, presented in the Accept-language header and returns a 303 (see also) redirect to a resource with triples in that language

Arguments in favor

  • Maintains the difference between the generic representation of a resource in any language and the representation of that resource in a given language

Arguments against

  • Not feasible for all the resources to have representations in different languages.

Textual information

Label everything

Linked data datasets should provide labels for all resources: individuals, concepts and properties, not just the main entities.

Multilingual labels

Labels without language tags

Divide longer descriptions

Provide lexical information

Structured literals

Linking

Ontologies and vocabularies

Quality

References

Acknowledgements

A large amount of thanks goes out to the BPMLOD Community Group participants who worked through many of the technical issues on the mailing list and the telecons.

Thanks to the following individuals, in order of their first name, for their input on the report: ...here we should include the list of people that contributed