Project Marmotta has retired. For details please refer to its Attic page.
Apache Marmotta - Linked Data Client - Custom Data Providers

Linked Data Client: Custom Data Providers

The LDClient library is implemented in a modular and pluggable way that allows to easily extend the types of supported data sources with custom data providers. Plugins are loaded using the Java ServiceLoader API. Implementing custom data providers therefore involves the following steps:

  1. implementing the DataProvider interface to provide details how to request the data for a class of resources and how to map the returned data to RDF triples
  2. adding the data provider as a service to the library in META-INF/services/org.apache.marmotta.ldclient.api.provider.DataProvider so it can be loaded by the LDClient service loader
  3. (optional) implementing one or more default Endpoint definitions for the new data providers for resources where the data provider should always be used
  4. (optional) adding the endpoint implementation as a service to the library in META-INF/services/org.apache.marmotta.ldclient.api.endpoint.Endpoint so it can be auto-registered with the LDClient

The following sections describe in more detail how this can be achieved in detail.

Implementing DataProviders

Implementing a DataProvider requires you to provide the following kinds of information and functionality for a data source:

  • a unique name for the data provider (used to identify it and refer to it in endpoint definitions); the name is returned by a getName() method
  • a list of supported MIME types (used as accept header and sometimes in the client response and for deciding which provider to use); the mime types are returned by a listMimeTypes() method
  • a method returning a ClientResponse for a resource given as argument; the method will be called by the LDClient instance in case it decides that the data provider is responsible for the resource it us currently processing

While the first two methods are more-or-less trivial, the third method performs most of the magic and is probably the most difficult to implement. Basically, it needs to retrieve the resource data either directly or from a service handling the data for this resource, parse the returned content and the HTTP status codes, and create a ClientResponse object. The ClientResponse contains two kinds of information:

  • the triples representing the data of the requested resource; each triple should have the requested resource as subject, or otherwise the retrieval is not Linked Data compliant
  • metadata about the request; most importantly, this includes the expiry date of the requested data which is used by the LDCache library for deciding when to refresh a resource

The triples are stored in an OpenRDF Sesame Repository. In principle, any Sesame repository is possible, but it most cases it only makes sense to use an in-memory repository (MemStore sail).

Implementing Endpoints

Endpoint configurations define for which kinds of resources to use which data provider, and how this data provider is to be configured. The endpoint that matched a resource and selected a data provider is handed over to the retrieveResource(...) method to allow the data provider to access the configuration. Endpoints contain the following configuration parameters:

  • a name to uniquely identify the endpoint configuration in the LDClient instance
  • a type, i.e. the name of the data provider this endpoint is referring to
  • a uri pattern specified as a Java regular expression to indicate which resources are handled by this endpoint configuration
  • an (optional) endpoint URL pointing to the service endpoint to use for accessing the resource data; this parameter is only used by providers accessing a resource through a service instead of directly
  • an (optional) collection of content types that are used by some data providers as accept headers in requests
  • a default expiry time that will be used in case the service endpoint or resource request does not return information about resource expiry

In almost all cases, creating an instance of the default Endpoint is sufficient for configuring an LDClient instance. However, there are a few exceptions to this rule:

  • you want to auto-register an endpoint configuration for certain resources (and therefore need a class with an empty-argument constructor to add to the service loader configuration); in this case you would create a subclass of Endpoint with empty constructor and pass the necessary default configuration to the call of the super constructor; see the LinkedDataEndpoint as an example
  • you want to provide a simplified way of configuring certain data providers, e.g. by pre-defining some of the configuration parameters (data provider or the endpoint URL); in this case you would create a sub-class that offers a simpler constructor and creates the configuration in the call to the super constructor; see the SPARQLEndpoint as an example
  • you want to provide additional configuration to your data provider, e.g. defining username and password for accessing a protected service; in this case you would create a subclass that offers additional methods with your configuration options, and call these methods in the retrieveResource method of your data provider (obviously, you’ll need to check with instanceof for the correct endpoint type)

Auto-Registering Providers and Endpoints

LDClient uses the Java ServiceLoader API to automatically register data providers and endpoints. To auto-register a data provider to be used by LDClient, create a file with the following name (usually in your resources directory):

META-INF/services/org.apache.marmotta.ldclient.api.provider.DataProvider

and add into this file a line with the fully-qualified class name of the data provider you implemented. Likewise, if you want to auto-register one or more endpoint configurations, create a file with the following name:

META-INF/services/org.apache.marmotta.ldclient.api.endpoint.Endpoint

and add to this file all fully-qualified class names of the endpoint configurations you want to register. Please keep in mind that (1) all classes need to have a zero-argument constructor, and (2) you should carefully choose priorities for endpoints you are auto-registering.

Support Modules

The LDClient library provides a number of support modules in the form of abstract base classes that can be used to implement typical cases of data providers. Currently, there are base classes for HTTP requests, for processing XML data, and for processing HTML data.

HTTP Module

The HTTP module is part of the LDClient Core. It offers support for retrieving HTTP resources (of any type) using the Apache HTTPClient library. Since this is the most common type of Linked Data resources, the LDClient Core offers advanced connection management for HTTP connections using a connection pool and keep-alive connections.

Implementing a data provider that uses HTTP to retrieve resource data requires subclassing the class AbstractHttpProvider. This class implements the retrieveResource(...) method of DataProvider, but requires subclasses to implement a parseResponse(...) method instead. Please see the Javadoc documentation for details.

XML Module

The XML module provides an abstract base class for all XML-based data sources, e.g. web services like YouTube or Vimeo offering their data in a proprietary XML format. It is based on the HTTP module, implements basic XML parsing functionality (using JDOM) and allows subclasses to specify mappings from XPath statements to RDF properties. Subclasses will need to provide this mapping as well as a method specifying how to build the request URL for the service based on the resource URI of the retrieved resource. To build a data provider based on an XML data source, you will need to include the following Maven artifact:

<dependency>
    <groupId>org.apache.marmotta</groupId>
    <artifactId>ldclient-provider-xml</artifactId>
    <version>3.3.0</version>
</dependency>

To actually use the XML provider with a data source, you need to create a subclass of AbstractXMLDataProvider and override the following methods:

  • Map<String,XPathValueMapper> getXPathMappings(String requestUrl): a mapping table mapping from RDF properties to XPath Value Mappers. Each entry in the map is evaluated in turn; in case the XPath expression yields a result, the property is added for the processed resource
  • List<String> getTypes(URI resource): should return a list of URIs that are added as RDF types (using rdf:type) to each retrieved resource
  • List<String> buildRequestUrl(String resourceUri, Endpoint endpoint): build the URLs used for accessing the actual resource data through a web service; in the most simple case, this can simply be the same as the resource URI, but in most real-world scenarios you will need to do some sort of rewriting

Good examples on how to use the XML module can be found in the Vimeo and Youtube modules in the source code.

HTML Module

Like the XML module, the HTML module provides an abstract base class for all HTML-based data sources (simple web pages or also more complex web applications). It is for example used in the PHPBB module to access posts and threads in an online forum. It implements basic HTML parsing functionality, even for messy HTML, using JSoup. Since XPath is usually not very convenient for HTML documents, mappings from element values to RDF properties in the HTML module are specified using CSS selectors similar to jQuery. Subclasses will need to provide this mapping as well as a method specifying how to build the request URL for the service based on the resource URI of the retrieved resource. To build a data provider based on an HTML data source, you will need to include the following Maven artifact:

<dependency>
    <groupId>org.apache.marmotta</groupId>
    <artifactId>ldclient-provider-html</artifactId>
    <version>3.3.0</version>
</dependency>

To actually use the XML provider with a data source, you need to create a subclass of AbstractHTMLDataProvider and override the following methods:

  • Map<String, JSoupMapper> getMappings(String resource, String requestUrl): a mapping table mapping from RDF properties to JSoup element selections in the HTML document; Each entry in the map is evaluated in turn; in case the CSS expression yields a result, the property is added for the processed resource
  • List<String> getTypes(URI resource): should return a list of URIs that are added as RDF types (using rdf:type) to each retrieved resource
  • List<String> buildRequestUrl(String resourceUri, Endpoint endpoint): build the URLs used for accessing the actual resource data through a web service; in the most simple case, this can simply be the same as the resource URI, but in most real-world scenarios you will need to do some sort of rewriting

The only currently existing example is the PHPBB module.