Efficient Support of Analytical SPARQL Queries in
Federated Systems

Abstract

As more and more data is becoming accessible on the Web via SPARQL endpoints, we can often find related data at multiple endpoints. Hence, diverse query processing strategies have been developed to overcome challenges such as links between sources, heterogeneity of the data sources, the need to support implicit (derived) information and reasoning, etc. However, not many of these techniques have been designed to support analytical SPARQL queries involving grouping and aggregation, which are an integral part of online analytical processing (OLAP) systems and enable interesting insights and analyses at a large scale. Hence, in this paper we propose LITE (OLAP-style AnalytIcs in a FederaTion of SPARQL Endpoints), a federated system for computing aggregate SPARQL queries over a federation of SPARQL endpoints addressing the above mentioned challenges. In particular, LITE is able to integrate the diverse schemas of SPARQL endpoints and provide access to the data via OLAP-style hierarchies to enable uniform, efficient, and powerful analytics. The experimental evaluation shows that LITE significantly outperforms the state of the art.

Authors: Dilshod Ibragimov, Katja Hose, Torben Bach Pedersen, and Esteban Zimanyi


SSB Queries

SSB defines 13 queries. They represent 4 "prototypical" queries with different selectivity factors. A brief description of the queries is given in a table below. We converted all 13 queries defined into SPARQL

Query Prototypes Query No Query Parameters for Various Selectivities
Prototype 1. Amount of revenue increase that would have resulted from eliminating certain company-wide discounts. Q1 Discounts 1, 2, and 3 for quantities less than 25 shipped in 1993
Q2 Discounts 1, 2, and 3 for quantities less than 25 shipped in 01/1993
Q3 Discounts 5, 6, and 7 for quantities less than 35 shipped in week 6 of 1993
Prototype 2. Revenue for some product classes, for suppliers in a certain region, grouped by more restrictive product classes and all years. Q4 Revenue for 'MFGR#12' category, for suppliers in America
Q5 Revenue for brands 'MFGR#2221' to 'MFGR#2228', for suppliers in Asia
Q6 Revenue for brand 'MFGR#2239' for suppliers in Europe
Prototype 3. Revenue for some product classes, for suppliers in a certain region, grouped by more restrictive product classes and all years. Q7 For Asian suppliers and customers in 1992-1997
Q8 For US suppliers and customers in 1992-1997
Q9 For specific UK cities suppliers and customers in 1992-1997
Q10 For specific UK cities suppliers and customers in 12/1997
Prototype 4. Aggregate profit, measured by subtracting revenue from supply cost. Q11 For American suppliers and customers for manufacturers 'MFGR#1' or 'MFGR#2'
Q12 For American suppliers and customers for manufacturers 'MFGR#1' or 'MFGR#2' in 1997-1998
Q13 For American customers and US suppliers for category 'MFGR#14' in 1997-1998

Schema

The global and local schemas, the mappings, the example query over the global schema and its converted counterpart can be found here.

Datasets

The data in the SSB benchmark represent sales in a retail company; each transaction is defined as an observation described by 4 dimensions (Parts, Customers, Suppliers, and Dates). We translated the data into the RDF representation as illustrated in the figure below. An observation is connected to dimensions (objects) via certain predicates. The Suppliers and Customers dimension contain information about cities, countries and world regions for each supplier/customer.We linked each city/country present in the dataset to their counterparts from the GeoNames dataset using owl:sameAs predicate, thus showing how external hierarchies can be added. To establish a federated setup, we divided the data among 5 SPARQL endpoints, each storing observations for one of the world regions defined in SSB: Africa, America, Asia, Europe and Middle East. The schema in each SPARQL endpoint was made slightly different from the other scemas, by generating an intermediary graph node between the observation and one of the dimensions or the value of revenue (dashed lines in the figure) for every SPARQL endpoint. For example, the data schema in the Africa endpoint was different in the Parts dimension and so on.

SSB RDF Schema

Copyright © 2014 - All Rights Reserved - EXTBI