Optimizing RDF Data Cubes for Efficient Processing of Analytical Queries

We create the snowflake pattern, star pattern, and fully denormalized pattern and show how these patterns can be used to improve querytimes over the RDF version of the TPC-H dataset.

Repositories

The inplementaiton and the code for running the experiments can be access at the the two GitHub project linked below.

  • SWOD algorithm implementation GibHub

  • Tools for running the experiments GibHub

SWOD Implementation

This program generates a serie of SPARQL construct queries that create the snowflake pattern and fully denormalized pattern cubes.

This Java program use Apache Maven to manage dependencies

The SWOD Tools project contains generated SPARQL queries, thus it is not nessary to run the SWOD program in order to run the experiments

SWOD Tools

These tools will allow you to generate the TPC-H data in triples (generate.sh), load the data into Virtuoso and Apache Jena (load.sh), run the TPC-H queries on the triple stores (query.sh), and analyse the results by comparing the queries (extractQueryTimes.py, compareResults.py).

All scripts are written in bash and python, this might result in some problem on windows systems.

The batch scripts takes a series of "sources" as input, these modular configurations files are located in the "source" folder. Be aware the these configuration files need to be set up manually before running any of the programs.

The python scripts have a help flag (--help) that displays the allowed parameters.

Workflow

  1. Download and install the following progarm

  2. Create configuration files (source files) that match your system (source/machine/) and wanted configuration (scale factor etc.)

  3. Generate or download the dataset

    • Generation requires Virtuoso for running the construct queries

  4. Install Virtuoso or Apache Jena

  5. Load the data into the Jena TDB or Virtuoso by using the appropriate configuration files

  6. Change the querymix configuration (source/) to mach which queries you want to execute, run the querymix.sh program to propagate these settings.

  7. Run the query.sh script with the appropiate configuration files to start the experiments

  8. Use the extractQueryTimes.py on the generated logfiles (logs/) to extract and aggregate the query times.

  9. The experiments can now be compare using the compareResults.py script

Feel free to post bug report and ask questions

Copyright © 2014 - All Rights Reserved - EXTBI