Abstract
As more and more data is becoming accessible on the Web via SPARQL endpoints, we can often find related data at multiple endpoints. Hence, diverse query processing strategies have been developed to overcome challenges such as links between sources, heterogeneity of the data sources, the need to support implicit (derived) information and reasoning, etc. However, not many of these techniques have been designed to support analytical SPARQL queries involving grouping and aggregation, which are an integral part of online analytical processing (OLAP) systems and enable interesting insights and analyses at a large scale.
Hence, in this paper we propose LITE (OLAP-style AnalytIcs in a FederaTion of SPARQL Endpoints), a federated system for computing aggregate SPARQL queries over a federation of SPARQL endpoints addressing the above mentioned challenges. In particular, LITE is able to integrate the diverse schemas of SPARQL endpoints and provide access to the data via OLAP-style hierarchies to enable uniform, efficient, and powerful analytics. The experimental evaluation shows that LITE significantly outperforms the state of the art.
Authors: Dilshod Ibragimov, Katja Hose, Torben Bach Pedersen, and Esteban Zimanyi
SSB Queries
SSB defines 13 queries. They represent 4 "prototypical" queries with different selectivity factors.
A brief description of the queries is given in a table below.
We converted all 13 queries defined into SPARQL
| Query Prototypes | Query No | Query Parameters for Various Selectivities |
| Prototype 1. Amount of revenue increase that would have resulted from eliminating certain company-wide discounts. |
Q1 |
Discounts 1, 2, and 3 for quantities less than 25 shipped in 1993 |
| Q2 |
Discounts 1, 2, and 3 for quantities less than 25 shipped in 01/1993 |
| Q3 |
Discounts 5, 6, and 7 for quantities less than 35 shipped in week 6 of 1993 |
| Prototype 2. Revenue for some product classes, for suppliers in a certain region, grouped by more restrictive product classes and all years. |
Q4 |
Revenue for 'MFGR#12' category, for suppliers in America |
| Q5 |
Revenue for brands 'MFGR#2221' to 'MFGR#2228', for suppliers in Asia |
| Q6 |
Revenue for brand 'MFGR#2239' for suppliers in Europe |
| Prototype 3. Revenue for some product classes, for suppliers in a certain region, grouped by more restrictive product classes and all years. |
Q7 |
For Asian suppliers and customers in 1992-1997 |
| Q8 |
For US suppliers and customers in 1992-1997 |
| Q9 |
For specific UK cities suppliers and customers in 1992-1997 |
| Q10 |
For specific UK cities suppliers and customers in 12/1997 |
| Prototype 4. Aggregate profit, measured by subtracting revenue from supply cost. |
Q11 |
For American suppliers and customers for manufacturers 'MFGR#1' or 'MFGR#2' |
| Q12 |
For American suppliers and customers for manufacturers 'MFGR#1' or 'MFGR#2' in 1997-1998 |
| Q13 |
For American customers and US suppliers for category 'MFGR#14' in 1997-1998 |
Schema
The global and local schemas, the mappings, the example query over the global schema and its converted counterpart can be found here.
Datasets
The data in the SSB benchmark represent sales in a retail company; each transaction is defined as an observation described by 4 dimensions (Parts, Customers, Suppliers, and Dates).
We translated the data into the RDF representation as illustrated in the figure below. An observation is connected to dimensions (objects) via certain predicates. The Suppliers and Customers
dimension contain information about cities, countries and world regions for each supplier/customer.We linked each city/country present in the dataset to their counterparts from the
GeoNames dataset using owl:sameAs predicate, thus showing how external hierarchies can be added. To establish a federated setup, we divided the data among 5 SPARQL endpoints, each
storing observations for one of the world regions defined in SSB: Africa, America, Asia, Europe and Middle East. The schema in each SPARQL endpoint was made slightly different from
the other scemas, by generating an intermediary graph node between the observation and one of the dimensions or the value of revenue (dashed lines in the figure) for every SPARQL endpoint.
For example, the data schema in the Africa endpoint was different in the Parts dimension and so on.