Guidelines for LLD exploitation

This best practice describes how to exploit Linguistic Linked Data resources. The suggested steps for exploitation comprise:

search and discovery of relevant resources
verify the license of the dataset
navigating to the distribution of the data (download or SPARQL endpoint)
extract that part of the data that is relevant for a particular purpose or application

This document was published by the Best Practices for Multilingual Linked Open Data community group. It is not a W3C Standard nor is it on the W3C Standards Track.

There are a number of ways that one may participate in the development of this report:

Mailing list: public-bpmlod@w3.org
Wiki: Main page
More information about meetings of the BPMLOD group can be obtained here
Source code for this document can be found on Github.

Use Case

Let us consider the example of a company developing sentiment analysis and opinion mining software that has a working system for the English language and wants to port the system to also support German. The company wants to find a corpus that is annotated at the sentiment level and extract a first seed lexicon of German subjective expressions with their polarity (positive, negative, neutral).

Method

In order to exploit Linguistic Linked Data resources, the above mentioned methodology can be implemented as follows:

Search and discovery: relevant linguistic resources can be discovered using LingHub, which has been developed by the LIDER project.
Licensing: when a relevant dataset has been found using LingHub, by clicking on the link of the resource one can navigate to a page containing all the metadata about the resource.
Distribution: from the metadata page in LingHub, one can either download the dataset or discover where the SPARQL endpoint of the data is.
Extraction: Using W3C standards, in particular SPARQL as RDF query language, one can extract that portion of the data that is needed for a particular purpose.

If LIDER guidelines are followed during publication and metadata provision for resources and if the resource is registered at either Metashare, CLARIN VO, LRE Map or DataHub, LingHub will crawl the resource and index the resource with the appropriate metadata. Further, if de facto standards and vocabularies as recommended by LIDER are followed, then the same extraction patterns can be used to extract data from different datasets.

Use Case Revisited

Our company looking for a German lexicon would follow the above sketched methodology as follows:

Search and discovery: the company would enter the query http://linghub.lider-project.eu/search/?property=&query=sentiment+corpus+german into LingHub and reach the following page. It would get two results. Clicking for instance on the usage review dataset it would reach the following page: http://linghub.lider-project.eu/datahub/usage-review-corpus#Nedfa753871df4052a5e6074d9389e901.
Licensing: It would check the license http://opendatacommons.org/licenses/by/1.0 and see that it is compatible with their purposes.
Distribution: On the metadata page of the given usage review dataset the company is able to find the availiable download at http://data.lider-project.eu/usage/usage.nt.gz, as well as the SPARQL endpoint: http://data.lider-project.eu/usage/sparql

Extraction: Using the following SPARQL query

    SELECT ?string ?polarity
    WHERE {
    ?phrase <http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#anchorOf> ?string ;
    <http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#lang> <http://www.lexvo.org/page/iso639-3/deu> ;
    <http://www.gsi.dit.upm.es/ontologies/marl/ns#hasPolarity> ?polarity .
    }

the company would obtain a seed lexicon of subjective phrases with their polarity as a result.