FedBench: A Benchmark Suite for Federated Semantic Data Query

FedBench: A Benchmark Suite for Federated

Semantic Data Query Processing

Michael Schmidt

, Olaf G¨orlitz

, Peter Haase

, G¨unter Ladwig

Andreas Schwarte

, and Thanh Tran

ﬂuid Operations AG, Walldorf, Germany

Institute for Web Science and Technology, University of Koblenz-Landau, Germany

Institute AIFB, Karlsruhe Institute of Technology, Germany

Abstract. In this paper we present FedBench, a comprehensive bench-

mark suite for testing and analyzing the performance of federated query

processing strategies on semantic data. The major challenge lies in the

heterogeneity of semantic data use cases, where applications may face

diﬀerent settings at both the data and query level, such as varying data

access interfaces, incomplete knowledge about data sources, availability

of diﬀerent statistics, and varying degrees of query expressiveness. Ac-

counting for this heterogeneity, we present a highly ﬂexible benchmark

suite, which can be customized to accommodate a variety of use cases

and compare competing approaches. We discuss design decisions, high-

light the ﬂexibility in customization, and elaborate on the choice of data

and query sets. The practicability of our benchmark is demonstrated by

a rigorous evaluation of various application scenarios, where we indicate

both the beneﬁts as well as limitations of the state-of-the-art federated

query processing strategies for semantic data.

1 Introduction

Driven by the success of the Linking Open Data initiative, the amount of se-

mantic data that is published on the Web in the form of RDF is increasing at

a tremendous pace. While oﬀering great potentials for innovative applications

that integrate heterogeneous data from diﬀerent sources, on the data manage-

ment side, this development comes along with a variety of new challenges, last

but not least due to the sheer amount of data that may be utilized by such appli-

cations. Most research contributions in the context of RDF data processing have

focused on the problem of query evaluation over local, centralized repositories

(see e.g. [2, 24, 19]) – and for these scenarios diﬀerent benchmarks have been pro-

posed [11, 6, 21]. Accounting for the decentralized nature of the Semantic Web,

though, one can observe an ongoing shift from localized to federated semantic

data processing, where independent endpoints provide data, and semantic data

applications utilize both local repositories and remote data sources at the same

time to satisfy their information needs. In response to this paradigm shift, dif-

ferent federated RDF processing strategies – targeted at diﬀerent use cases and

application scenarios – have been proposed [15, 14, 16, 9, 17].

With distributed semantic data processing becoming increasingly important,

we identify a clear need for a benchmark tailored to the problem of federated

semantic data query processing. We employ a broad deﬁnition of semantic data,

which includes Linked Data sources, datasets, and ontologies represented in

RDF. The main challenge here lies in the diversity and heterogeneity of se-

mantic data use cases, and the demands they pose to a benchmark: First, we

can observe heterogeneity at data level along several dimensions: applications

are facing diﬀerent physical distribution of datasets, diﬀerent interfaces for data

access, incomplete knowledge about the existence of entry points into the Web of

data, and diﬀerent types of metadata and statistics. Apart from the challenges

at data level, applications may also exhibit diﬀerent demands w.r.t. query eval-

uation, including aspects such as query languages, expressiveness, and ranking.

The overall setting in which a concrete semantic data application is settled

may have severe impact on query processing strategies. Ultimately, there cannot

exist a single “one-size-ﬁts-all” benchmark to measure each and every aspect of

an application – or to compare the performance of orthogonal federated query

processing strategies. Hence, taking an existing benchmark and distributing its

data across several endpoints may cover some, but not all challenges that arise

in the context of federated semantic data processing. What is needed instead is

a collection of customizable benchmark scenarios that accommodate a multitude

of dimensions as well as essential challenges – and from which one can choose

data sets, queries, and settings that ﬁt the speciﬁc needs.

Contributions. (1) Based on a review of federated query processing scenar-

ios our community has dealt with so far, we discuss orthogonal dimensions at

data and query level that can be used to classify existing approaches. (2) Ac-

counting for the heterogeneity of these dimensions, we present a rich collection

of queries, data, and data statistics, which can be ﬂexibly combined and cus-

tomized. This makes our benchmark generic enough to cover a broad range of

use cases, such as testing the performance of the underlying federation approach,

data access mechanisms, static optimization based on metadata and statistics,

queries with varying characteristics, and many more. All queries and datasets

were carefully chosen to reﬂect a variety of domains, query patterns, and typi-

cal challenges in query processing (in particular in distributed settings). While

some of the queries were speciﬁcally designed to test and vary in these aspects,

others were taken from prototypical, domain-speciﬁc use cases (e.g. in the Life

Science domain) built by participants in other projects. (3) In order to show

the ﬂexibility and illustrate the broad range of scenarios covered by FedBench,

we provide results for selected scenarios and implementations, identifying areas

where ongoing work is required. Our results are published in a Wiki, where we

also maintain data, statistics, queries, and scenarios. We invite researchers and

benchmark users to customize and extend the benchmark suite according to

their own needs.

We point out that the resulting benchmark suite is available online

, includ-

ing a ﬂexible and extensible Open Source Java-based evaluation framework.

See http://fbench.googlecode.com/

Related Work. Apart from benchmarks that target structural properties of

RDF schemas (such as [18]), several benchmarks for RDF, RDFS, and OWL

data processing have been proposed. The Lehigh University Benchmark [11],

for instance, has been designed to test the reasoning capabilities of systems

over a single ontology. The SPARQL-speciﬁc, use-case driven Berlin SPARQL

Benchmark [6] comes with a set of queries implementing meaningful requests on

top of an eCommerce scenario modeled in RDF. Complementary, the SPARQL

Performance Benchmark (SP

Bench) [21] puts a stronger focus on language-

speciﬁc features of the SPARQL query language, addressing optimization in

complex scenarios. None of the above benchmarks considers federation at data

level, nor does provide data collections consisting of multiple interlinked datasets.

To the best of our knowledge, the only work addressing the latter issue was

our previous work in [13], which – focusing on selected federation approaches –

served as a starting point for designing FedBench. Going far beyond this initial

work, in this paper we present a holistic benchmark suite, including a variety

of new data and query sets, new scenarios such as Linked Data access (i.e., via

HTTP requests), an automated evaluation framework (with support for various

metrics like counting the number of requests, automated evaluation, interfaces for

connecting new systems etc.), a discussion and classiﬁcation of state-of-the-art

approaches, a discussion of statistics, as well as novel experiments and ﬁndings.

Our benchmark has been designed to be compatible with the current state-of-

the-art in Linked Data query processing. In particular, in our benchmark we use

queries deﬁned in SPARQL 1.0, without requiring speciﬁc extensions for explicit

speciﬁcation of the endpoint services as proposed by the SPARQL 1.1 federation

extensions. To our knowledge, there is currently no public implementation of the

SPARQL 1.1 federation extension [1]. Examples of systems expected to support

SPARQL 1.1 federation in future releases include SPARQL DQP [5], Sesame [7],

and FedX [23]. While in this paper we focus on SPARQL 1.0 features, on our

project page we also provide SPARQL 1.1 versions of the benchmark queries,

ready to be used as soon as implementations become available.

An analysis and comparison of the structure of diﬀerent semantic data sets

and benchmarks from the Linked Data domain is presented in [8]. It shows

that artiﬁcial datasets used in benchmarks are typically highly structured, while

Linked Data are less structured. They conclude that benchmarks should not

solely rely on artiﬁcial data but also consider real world datasets. Resuming this

discussion, in Section 3.1 we will present an analysis and comparison of our data

sets using the methods proposed in [8].

Outline. In the next section, we identify and discuss essential dimensions of fed-

erated semantic data processing, which form the groundwork for the benchmark

suite presented in this paper. Next, in Section 3 we motivate benchmark design

goals and present the suite in detail, namely benchmark datasets (Section 3.1),

covering aspects such as their properties and associated statistics, benchmark

queries and their properties (Section 3.2), as well as the benchmark driver (Sec-

tion 3.3). We turn towards a comprehensive evaluation of concrete application

scenarios in Section 4 and wrap up with some concluding remarks in Section 5.

2 Heterogeneity of Semantic Data Use Cases

To date, several approaches to semantic data query processing have been studied,

comprising both centralized and decentralized settings. Centralized approaches,

where all data is periodically crawled and stored locally, come with the merits of

high controllability and reliability. While Google and the likes have shown that

this is a successful approach for dealing with Web documents that primarily

comprise text, it has also been adopted for semantic data, where projects like

Factforge

collect large amounts of RDF(S) data in centralized stores.

If federation is set up dynamically or the underlying sources exhibit high

update rates, though, it may not be aﬀordable to keep imported semantic data

up-to-date. Then, decentralized, federated query processing strategies are re-

quired, typically implemented on top of public SPARQL endpoints or directly

on Linked Data accessible via HTTP requests. In the best case, not only the ex-

istence of data sources but also detailed statistics can be assumed and exploited

for optimization [14]. A stronger relaxation is to assume only partial knowledge

about the sources, e.g. past information that is stored and employed for future

runs, thereby obtaining more entry points and hints to explore unknown sources

in a goal-directed fashion. In fact, it has been shown that already little knowl-

edge obtained at runtime can be used to adaptively correct the query processing

strategy chosen at compile time, to improve performance [16]. In the worst case,

engines have to deal with federated scenarios where no data and knowledge are

available, so queries have to be processed based on iterative URI lookups [15].

In summary, previous approaches reveal several dimensions along the data

and query level, which we use to characterize approaches. They determine the

challenges, including the major issues of centralized vs. decentralized process-

ing and knowledge about datasets discussed above. At data level, we identify

heterogeneity along the following dimensions:

(D1) Physical Distribution: Federated query processing systems may either

access and process global data from the Web, process locally stored data

sources, or mix up both paradigms connecting local with global data.

(D2) Data Access Interface: Semantic data may be accessible through dif-

ferent interfaces. There may be native repositories, SPARQL endpoints, and

Linked Data accessible through HTTP requests. These interfaces provide

diﬀerent access paths to the data, ranging from iterators at data level, URI

lookups, to expressive queries in diﬀerent languages.

(D3) Data Source Existence: In particular in Linked Data scenarios, not all

sources may be known a priori. Hence, applications may have only few entry

points into the data graph, which can be used to iteratively deep-dive by

exploring links (see for instance the setting described in [16]).

(D4) Data Statistics: In the best case, advanced statistical information about

properties, counts, and distributions in the form of histograms for all data

sets are available; in the worst case – in particular if data is not stored locally

– only few or no information about the data sources may be given.

http://ontotext.com/factforge/

The concrete setting an application faces at data level – i.e., the classiﬁcation

within dimension (D1)–(D4) – implies challenges in data processing and imposes

an upper bound on the eﬃciency in query processing: applications built on top

of local repositories exploiting detailed statistical knowledge for query optimiza-

tion, for instance, are generally faster than applications that rely on federated

Linked Data accessible via HTTP lookups, where network delay and incomplete

knowledge about data sets impose hard limits on query eﬃciency.

Apart from the challenges at data level, applications may also face diﬀer-

ent challenges at query level. Like the dimensions at data level, also those at

query level drive the challenges behind semantic data applications and should

be covered in a benchmark. In particular, we identify the following dimensions.

(Q1) Query Language: The expressiveness of the query language needed by

applications may vary from case to case: while some applications get around

with simple conjunctive queries, others may rely on the full expressive power

of RDF query languages, such as the de facto standard SPARQL [20, 22].

(Q2) Result Completeness: Certain applications may rely on complete re-

sults, while others cannot aﬀord it when responsiveness is ﬁrst priority. In

particular in Linked Data scenarios where complete knowledge cannot be

assumed (s.t., beginning from some entry points, further sources have to be

discovered via online link traversal) not all data sources may be found [15,

16].

(Q3): Ranking: Applications may be interested in queries that enable ranking

according to some predeﬁned metrics, or maybe only in top-k results.

3 FedBench: Benchmark Description

In order to support benchmarking of the diﬀerent scenarios that emerge along

all the dimensions, FedBench consists of three components, all of which can

be customized and extended to ﬁt the desired scenario: (i) multiple datasets,

(ii) multiple query sets, and (iii) a comprehensive evaluation framework. We

ﬁrst elaborate on the datasets and statistics (addressing dimension (D4)), then

present the queries (addressing dimensions (Q1)–(Q3)), and conclude with a

discussion of our evaluation framework, which addresses dimensions (D1)–(D3).

3.1 Benchmark Data

Accounting for the heterogeneity of semantic data use cases, we provide three

data collections, each consisting of a number of interlinked datasets. The data

collections have been selected to represent both real-world and artiﬁcial data

federations over multiple representative semantic datasets. The collections diﬀer

in size, coverage, and types of interlinkage. Two of them are subsets of the Linked

Open Data cloud: The ﬁrst spans diﬀerent domains of general interest, represent-

ing typical scenarios of combining cross-domain data with heterogeneous types

of interlinkage; the second contains datasets from the Life Science area, repre-

senting a federation scenario in a very domain-speciﬁc setting. Additionally, we

use a partitioned synthetic data collection, whose advantage lies in the ability to

simulate federations of varying size with well-deﬁned characteristics of the data.

General Linked Open Data Collection. This ﬁrst data collection consists of

datasets from diﬀerent domains: DBpedia is a central hub in the Linked Data

world, containing structured data extracted from Wikipedia. Many datasets are

linked to DBpedia instances. GeoNames provides information about geographic

entities like countries and cities. Jamendo is a music database containing infor-

mation about artists, records, tracks, publishers and publication dates. Linked-

MDB exhibits details about ﬁlms, genres, actors, producers, etc. and connects

its instances to the corresponding DBpedia entities. The New York Times

dataset contains about 10,000 subject headings covering diﬀerent topics, which

are linked with with people, organizations and locations. Finally, the Semantic

Web Dog Food dataset provides information about Semantic Web conferences

of the past years, including paper descriptions, authors, and so on.

Life Science Data Collection. In this collection, we again included the DBpe-

dia subset from the General Linked Open Data dataset as a central hub. KEGG

(Kyoto Encyclopedia of Genes and Genomes) contains data about chemical com-

pounds and reactions, with a focus on information relevant for geneticists. It is

published in a number of separate modules; in the dataset we included the mod-

ules KEGG Drug, Enzyme, Reaction and Compound. Next, ChEBI (Chemical

Entities of Biological Interest) is a dictionary of molecular entities focused on

“small” chemical compounds, describing constitutionally or isotopically distinct

atoms, molecules, ions, ion pairs, radicals, radical ions, complexes, conformers,

etc. DrugBank is a bioinformatics and cheminformatics resource that combines

detailed drug (i.e. chemical, pharmacological, and pharmaceutical) data with

comprehensive drug target (i.e. sequence, structure, and pathway) information.

The datasets are linked in diﬀerent ways: Drugbank is linked with DBpedia via

owl:sameAs statements, and other datasets are linked via special properties,

e.g. Drugbank links to KEGG via the property keggCompoundId. KEGG and

Drugbank can be joined via identiﬁers of the CAS database (Chemical Abstract

Service). Some links are implicit by the use of common identiﬁers in literal values,

e.g. the genericName in Drugbank corresponds to the title in ChEBI.

Bench is a synthetic dataset generated by the SP

Bench data genera-

tor [21], which mirrors vital characteristics (such as power law distributions

or Gaussian curves) encountered in the DBLP bibliographic database. The data

generator provides a single dataset, from which we created a collection by clus-

tering by the types occurring in the dataset, ﬁnally obtaining 16 sub-datasets

(for persons, inproceedings, articles, etc.) that can be deployed independently in

a distributed scenario. The data collection consists of 10M triples in total.

Metadata and Statistics about data sources are important for identifying

suitable data sources for answering a given query, as well as for query optimiza-

tion. They can be used to parametrize the benchmark along dimension (D4).

Table 1: Basic Statistics of Datasets

Collection Dataset version #triples #subj. #pred. #obj. #types #links strct.

DBpedia subset

3.5.1 43.6M 9.50M 1063 13.6M 248 61.5k 0.19

NY Times 2010-01-13 335k 21.7k 36 192k 2 31.7k 0.73

Cross LinkedMDB 2010-01-19 6.15M 694k 222 2.05M 53 63.1k 0.73

Domain Jamendo 2010-11-25 1.05M 336k 26 441k 11 1.7k 0.96

GeoNames 2010-10-06 108M 7.48M 26 35.8M 1 118k 0.52

SW Dog Food 2010-11-25 104k 12.0k 118 37.5k 103 1.6k 0.43

DBpedia subset

3.5.1 43.6M 9.50M 1063 13.6M 248 61.5k 0.19

Life KEGG 2010-11-25 1.09M 34.3k 21 939k 4 30k 0.92

Sciences Drugbank 2010-11-25 767k 19.7k 119 276k 8 9.5k 0.72

ChEBI 2010-11-25 7.33M 50.5k 28 772k 1 - 0.34

Bench SP

Bench 10M v1.01 10M 1.7M 77 5.4M 12 - 0.76

All datasets are available at http://code.google.com/p/fbench/.

Includes the ontology, infobox types plus mapped properties, titles, article categories with labels,

Geo coordinates, images, SKOS categories, and links to New York Times and Linked Geo Data.

Some Linked Data sources provide basic VoiD [3] statistics such as number

of triples, distinct subjects, predicates, objects, and information about the vo-

cabulary and links to other sources. Table 1 surveys such basic statistics for

our datasets. These and other statistics (such as predicate and type frequency,

histograms, full pattern indexes obtained by counting all combinations of values

in triple patterns, full join indexes obtained by counting all join combinations,

and link statistics) can be exploited by engines in the optimization process.

Duan et al. [8] introduced the notion of structuredness, which indicates

whether the instances in a dataset have only a few or all attributes of their

types set. They show that artiﬁcial datasets are typically highly structured and

“real” datasets are less structured. As shown in the last column of Table 1,

the structuredness (range [0, 1]) varies for our datasets, e.g. DBpedia has a low

structuredness value whereas Jamendo and KEGG are highly structured.

3.2 Benchmark Queries

There are two reasonable options for the design of benchmark queries [10]:

language-speciﬁc vs. use case driven design. The query sets we propose cover

both dimensions. We choose SPARQL as a query language: It is known to be

relationally complete, allowing us to encode a broad range of queries with vary-

ing complexity, from simple conjunctive queries to complex requests involving

e.g. negation [4, 20, 22]. We restrict ourselves on general characteristics, pointing

to the FedBench project page for a complete listing and description.

Life Science (LS) and Cross Domain Queries (CD). These two query

sets implement realistic, real-life use cases on top of the cross-domain and life

science data collection, respectively. Their focus is on federation-speciﬁc aspects,

in particular (1) number of data sources involved, (2) join complexity, (3) types

of links used to join sources, and (4) varying query (and intermediate) result

size. Figure 1 exemplarily discusses three queries taken from these query sets.

Example, Life Science Query 4: For all drugs

in DBpedia, ﬁnd all drugs they interact with,

along with an explanation of the interaction.

Example, Cross Domain Query 5: Find the

director and the genre of movies directed by

Italians.

SELECT ?Drug ?IntDrug ?IntEffect WHERE {

?Drug rdf:type dbpedia-owl:Drug .

?y owl:sameAs ?Drug .

?Int drugbank:interactionDrug1 ?y .

?Int drugbank:interactionDrug2 ?IntDrug .

?Int drugbank:text ?IntEffect . }

SELECT ?film ?director ?genre WHERE {

?film dbpedia-owl:director ?director.

?director dbpedia-owl:nationality dbpedia:Italy .

?x owl:sameAs ?film .

?x linkedMDB:genre ?genre . }

This query includes a star-shaped sub pattern of

drugs which is connected via owl:sameAs link to

DBpedia drug entities.

A chain-like query for ﬁnding ﬁlm entities (in

LinkedMDB and in DBpedia) linked via owl:sameAs

and restricted on genre and director.

Example, Linked Data Query 4: Find authors of papers at the ESWC 2010 conference who

were also involved in the conference organization.

SELECT * WHERE {

?role swc:isRoleAt <http://data.semanticweb.org/conference/eswc/2010> .

?role swc:heldBy ?p .

?paper swrc:author> ?p .

?paper swc:isPartOf ?proceedings .

?proceedings swc:relatedToEvent <http://data.semanticweb.org/conference/eswc/2010> }

Fig. 1: Selected Benchmark Queries

Bench Queries (SP). Next, we reuse the queries from the SP

Bench

SPARQL performance benchmark, which were designed to test a variety of

SPARQL constructs and operator constellations, but also cover characteristics

like data access patterns, result size, and diﬀerent join selectivities. Some of the

Bench queries have high complexity, implementing advanced language con-

structs such as negation and double negation. They are intended to be run on

top of the distributed SP

Bench dataset described in Section 3.1. A thorough

discussion of the queries and their properties can be found in [21].

Linked Data Queries (LD). Today’s Linked Data engines typically focus on

basic graph patterns (Conjunctive Queries). Therefore, all LD queries are basic

graph pattern queries, designed to deliver results when processing Linked Data in

an exploration-based way (cf. the bottom-up strategy described in Section 4.1).

Queries LD1–LD4 use the SW Dog Food dataset to extract information about

conference and associated people. LD1–LD3 all contain a single URI to be used

as a starting-point for exploration-based query processing, whereas LD4 has two

URIs that could be used to speed up processing by starting the exploration from

multiple points in the Linked Data graph (cf. Figure 1). The other queries operate

on DBpedia, LinkedMDB, NewYork Times, and the Life Science collection. In

summary, the Linked Data queries vary in a broad range of characteristics, such

as number of sources involved, number of query results, and query structure.

Query Characteristics. Table 2 surveys the query properties, showing that

they vastly vary in their characteristics. We indicate the SPARQL operators

used inside the query (Op.), the solution modiﬁers that were used additionally

(Sol.), categorize the query structure (Struct.), roughly distinguishing diﬀerent

join combinations – like subject-subject or subject-object joins – leading to dif-

ferent query structures commonly referred to as star-shaped, chain, or hybrid

Table 2: Query Characteristics. Operators: And (“.”), Union, Filter, Optional;

Modiﬁers: Distinct, Limit, Offset, OrderBy; Structure: Star, Chain, Hybrid

Life Science (LS) SP

Bench (SP) Linked Data (LD)

Op. Mod. Struct. #Res. #Src Op. Mod. Struct. #Res. #Src Op. Mod. Struct. #Res.

1 U - - 1159 2 1 A - S 1 11 1 A - C 309

2 AU - - 333 4 2 AO Or S >500k 12 2 A - C 185

3 A - H 9054 2 3a AF - S >300k 16 3 A - C 162

4 A - H 3 2 3b AF - S 2209 16 4 A - C 50

5 A - H 393 3 3c AF - S 0 16 5 A - S 10

6 A - H 28 3 4 AF D C >40M 14 6 A - H 11

7 AFO - H 144 3 5a AF D C >300k 14 7 A - S 1024

Cross Domain (CD) 5b AF D C >300k 14 8 A - H 22

1 AU - S 90 2 6 AFO - H >700k 16 9 A - C 1

2 A - S 1 2 7 AFO D H >2k 14 10 A - C 3

3 A - H 2 5 8 AFU D H 493 16 11 A - S 239

4 A - C 1 5 9 AU D - 4 16

5 A - C 2 5 10 - - - 656 12

6 A - C 11 4 11 - LOfOr - 10 8

7 A - C 1 5

queries, and indicate the number of results (#Res.) on the associated datasets

(we provide an estimated lower bound when the precise number is unknown).

In addition, we denote the number of datasets that potentially contribute to the

result (#Src) , i.e. those that match at least one triple pattern in the query.

Note that the number of data sets used for evaluation depends on the evalua-

tion strategy (e.g., an engine may substitute variable bindings at runtime and,

in turn, some endpoints would no longer yield results for it), or intermediate

results delivered by endpoints may be irrelevant for the ﬁnal outcome. We ob-

serve that the (LS) queries typically address 2–3 sources, the (CD) queries up to

5 sources, while the (SP) queries have intermediate results in up to 16 sources

(where, however, typically only few sources contribute to the result).

3.3 Benchmark Evaluation Framework

To help users executing FedBench in a standardized way and support parametriza-

tion along the dimensions from Section 2, we have developed a Java benchmark

driver, which is available in Open Source. It provides an integrated execution

engine for the diﬀerent scenarios and is highly conﬁgurable. Using the Sesame

API as a mediator, it oﬀers support for querying local repositories, SPARQL end-

points, and Linked Data in a federated setting. Systems that are not built upon

Sesame can easily be integrated by implementing the generic Sesame interfaces.

The driver comes with predeﬁned conﬁgurations for the benchmark scenar-

ios that will be discussed in our experimental results. Custom scenarios can be

created intuitively by writing conﬁg ﬁles that deﬁne properties for data con-

ﬁguration and other benchmark settings (query sets, number of runs, timeout,

output mediator, etc). In particular, one can specify the types of repositories

http://www.openrdf.org/

(e.g., native vs. SPARQL endpoints), automatically load datasets into reposito-

ries (while measuring loading time), and execute arbitrary queries while simulat-

ing real-world conditions like execution via HTTP with a customizable network

delay. Combining this ﬂexibility with the predeﬁned data and query sets thus

allows the user to customize the benchmark along the dimensions relevant for

the setting under consideration. Designed with the goal to position the bench-

mark as an ongoing community eﬀort, the underlying evaluation framework is

Open Source and has been designed with extensibility in mind at diﬀerent lev-

els. In particular, it is easy to specify complex evaluation settings by means of

simple conﬁguration ﬁles (i.e., without code modiﬁcations), plug in new systems,

implement new metrics, evaluate and visualize results, etc.

To standardize the output format, the driver ships two default mediators for

writing results in CSV and RDF; for the latter we have implemented an Infor-

mation Workbench [12] module to visualize benchmark results automatically.

4 Evaluation

The central goal of our experimental evaluation is to demonstrate the usefulness

of our benchmark suite. Thus, in order to show that our framework is a useful

tool to assess strengths and weaknesses in a variety of semantic data use cases,

we investigate diﬀerent scenarios that vary in the dimensions sketched in Sec-

tion 2, in particular in data distribution, access interfaces, and query complexity.

For space limitations, aspects (Q2) Result Completeness and (Q3) Ranking are

not covered; further, there are currently no systems that improve their behavior

when an increasing amount of statistics are provided, so (D4) Data Statistics

could only be assessed by comparing systems that make use of diﬀerent statis-

tics. All experiments described in the following are supported by our benchmark

driver out-of-the-box and were realized by setting up simple benchmark driver

conﬁg ﬁles specifying data and query sets, setup information, etc. We refer the

interested reader to [23] for additional results on other systems, such as DARQ

and FedX. We start with a description of the scenarios, then discuss the bench-

mark environment, and conclude with a discussion of the evaluation results.

4.1 Description of Scenarios

(A) RDF Databases vs. SPARQL Endpoints. This ﬁrst set of scenarios

was chosen to demonstrate the capabilities of FedBench to compare federation

approaches for data stored in local RDF databases or accessible via SPARQL

endpoints. In particular, they were designed to test how dimensions (D1) Physical

Distribution of data and (D2) Data Access Interfaces aﬀect query evaluation

while the remaining dimensions are ﬁxed across the scenarios, namely

(A1) centralized processing, where all data is held in a local, central store, vs.

(A2) local federation, where we use a federation of local repository endpoints, all

of which are linked to each other in a federation layer, vs.

(A3) a federation of SPARQL endpoints, also linked to each other in a common

local federation layer, pursuing the goal to test the overhead that is imposed

by the SPARQL requests exchanged over the associated HTTP layer.

For all scenarios, we carried out experiments with the Sesame 2.3.2 engine

using AliBaba version 2.0 beta 3, a federation layer for the Sesame framework

which links integrated federation members together. In addition, we carried out

experiments with the SPLENDID federation system from [9]. In contrast to the

AliBaba federation, the latter uses statistical data about predicates and types

to select appropriate data sources for answering triple patterns, which oﬀers a

wide range of optimization opportunities. Patterns that need to be sent to the

same source are grouped and the join order between them is optimized with

a dynamic programming approach, using e.g. the number of distinct subjects

and objects per predicate to estimate the cardinality of intermediate results.

The evaluation strategy relies on hash joins to allow for parallel execution and

to reduce the number of HTTP requests, instead of sending individual result

bindings to endpoints in a nested-loop join.

(B) Linked Data. Complementing the previous scenarios, we also evaluated

a Linked Data scenario, where the data is distributed among a large number of

sources that can only be retrieved using URI lookup and queries are evaluated

on the combined graph formed by the union of these sources. Hence, in this

setting the focus is on the knowledge about (D3) Data Source Existence. When

all relevant sources are known, all of them are retrieved using their URIs before

executing the query on the retrieved data (top-down) [14]. Another scheme that

does not require a priori knowledge about data sources is an exploration-based

approach [15]. Here, the query is assumed to contain at least one constant that

is a URI. This URI is then used for retrieving the ﬁrst source, and new sources

are iteratively discovered in a bottom-up fashion starting with links found in

that source (bottom-up). The (mixed ) approach in [16] combines bottom-up and

top-down to discover new sources at runtime as well as leverage known sources.

All three approaches in scenario (B) were evaluated based on the Linked

Data query set (LD) using the prototype system from [16], which implements a

stream-based query processing engine based on symmetric hash join operators.

Note that all three approaches yield complete results (by design of the queries).

4.2 Setup and Evaluation Metrics

All experiments were carried out on an Integrated Lights-Out 2 (ILO2) HP server

ProLiant DL360 G6 with 2000MHz 4Core CPU and 128KB L1 Cache, 1024KB

L2 Cache, 4096KB L3 Cache, 32GB 1333MHz RAM, and a 160GB SCSI hard

drive. They were run on top of a 64bit Windows 2008 Server operating system

and executed with a 64bit Java VM 1.6.0 22 (all tested systems were Java-based).

In the centralized setting (A1) we reserved a maximum of 28GB RAM to the

VM, while in the distributed settings we assigned 20GB to the server process

and 1GB to the client processes (i.e., the individual endpoints). Note that we run

0.01

0.1

100

1000

CD1 CD2 CD3 CD4 CD5 CD6 CD7

Evaluation Time (s)

Evaluation Error

Timeout

Centralized Store (Sesame)

Local Federation (AliBaba)

Endpoint Federation (AliBaba)

Endpoint Federation (SPLENDID)

0.01

0.1

100

1000

LS1 LS2 LS3 LS4 LS5 LS6 LS7

Evaluation Time (s)

Timeout

Centralized Store (Sesame)

Local Federation (AliBaba)

Endpoint Federation (AliBaba)

Endpoint Federation (SPLENDID)

CD1 CD2 CD3 CD4 CD5 CD6 CD7 LS1 LS2 LS3 LS4 LS5 LS6 LS7

Endpoint Federation (AliBaba) 27 22 93k (372k) (117k) 6k 2k 13 61 (410) 21k 17k (130) (876)

Endpoint Federation (SPLENDID) 22 9 20 (9) 15 2465 492 2 49 9 10 4778 322 4889

Fig. 2: Scenario (A): Evaluation Time (top) and Number of Requests to End-

points (bottom) for the Cross Domain and Life Science Queries

all tests against local SPARQL endpoints, one for each federation member, to

avoid unpredictable eﬀects like network delay and high load of remote sources.

All query mixes in each setup have been run three times with a timeout of

ten minutes per query. We report on the average total time over all runs. Queries

that failed (e.g. with a system-internal exception) or delivered an unsound or

incomplete result are indicated by “Evaluation Error”. To exclude the inﬂuence

of cold start eﬀects, for each setup we ran a “ramp-up” query mix prior to query

evaluation. We executed each query set separately and in order, counting the

number of results (but not materializing them on disk). In addition to the eval-

uation time, we counted the number of requests sent to the individual endpoints

(which is supported by our benchmark driver out-of-the-box).

For the Linked Data scenario, a CumulusRDF

Linked Data server was used

to deploy the dataset on the local network. Both the server process and the

Linked Data server were started with a maximum of 10GB RAM reserved. To

simulate internet conditions, an artiﬁcial delay of 750ms was introduced, which

resembles typical response times of current Linked Data servers.

4.3 Experimental Results

Figure 2 summarizes our results for the cross domain (CD) and life science (LS)

queries in scenarios (A1)–(A3). The two plots visualize evaluation time, while

the table at the bottom shows the number of requests sent by the systems to

the federation members (numbers in parentheses are lower bounds for queries

that failed due to timeout). Comparing Sesame and AliBaba ﬁrst, we observe

that the centralized approach in almost all cases outperforms the local federation

http://code.google.com/p/cumulusrdf/

approach, for 5 out of 14 queries even by an order of one magnitude or more (the

time for the centralized store for queries CD2 and LS4 was about 1ms, which

is not visible in the diagram). We observe an additional performance overhead

for the AliBaba SPARQL Endpoint federation approach, which delivered results

only for 8 out of the 14 queries. Upon closer investigation, we could identify

several reasons for the poor performance of the AliBaba federation approaches:

• Due to lack of statistics, eﬀective join order optimization is often impossible,

resulting in a high number of triples being exchange in the federation.

• Also caused by the lack of statistics, the AliBaba federation layer iteratively

sends triple patterns to all federation members, to obtain candidate bind-

ings for free variables. These bindings are then instantiated in the remaining

part of the query, subsequently sending instantiated triple patterns to the

federation members in a nested-loop fashion. This often results in a very

high number of requests to the federation members (cf. the table at the bot-

tom), which cause the high evaluation time. Given that AliBaba’s strategy is

identical in the local and SPARQL Endpoint federation scenario, the results

indicate an enormous overhead imposed by the HTTP layer in the SPARQL

Endpoint federation, typically in the order of one magnitude or more.

• Sesame’s ability to deal with parallelization is limited. In the SPARQL End-

point scenario, where our driver simulates endpoints by servlets that process

incoming HTTP requests, we experimented with diﬀerent degrees of paral-

lelization. When instantiating more than 5–10 worker threads for answering

the queries (each having its own repository connection), we could observe

a performance drop down, manifesting in high waiting times for the worker

threads, probably caused by Sesame’s internal locking concept.

For the SPLENDID federation, we can observe that the number of HTTP

requests is signiﬁcantly lower: in contrast to AliBaba, which evaluates the query

starting with a single triple pattern and iteratively substitutes results in sub-

sequent patterns, SPLENDID also generates execution plans which send the

patterns independently to relevant endpoints and join them together locally, at

the server. Therefore, SPLENDID still returns results where AliBaba’s naive

nested-loop join strategy times out. For queries CD3, CD5, LS4 and LS6 it even

beats the local federation.

Figure 3 summarizes our results in the SP

Bench scenario. We observe that

even the centralized Sesame store has severe problems answering the more com-

plex queries, which is in line with previous investigations from [21].

Except

for the outlier query SP1, the trends are quite similar to those observed in the

(LS) and (CD) scenario, with the centralized scenario being superior to the local

federation, which is again superior to the SPARQL Endpoint federation. Query

SP1 is a simple query that asks for a speciﬁc journal in the data set. It can be

answered eﬃciently on top of the local federation, because the federation is split

In the experiments from [21] Sesame was provided with all possible index combina-

tions, whereas in these experiments we use only the standard indices. This explains

why in our setting Sesame behaves slightly worse than in the experiments from [21].

0.01

0.1

100

1000

SP1 SP2 SP3a SP3b SP3c SP4 SP5a SP5b SP6 SP7 SP8 SP9 SP10 SP11

Evaluation Time (s)

Timeout

Evaluation error

Timeout

Evaluation error

Timeout

Evaluation error

Centralized Store (Sesame)

Local Federation (AliBaba)

Endpoint Federation (AliBaba)

Fig. 3: Scenario (A): Results of SP

Bench Queries

up by type and the journals are distributed across only two federation mem-

bers, so the system beneﬁts from the parallelization of requests. In summary,

the SP

Bench experiments show that, for more complex queries and federation

among a larger number of federation members (as indicated in Table 2), current

federation approaches are still far from being applicable in practice.

Figure 4 visualizes our results for the Linked Data scenario (B). Regarding

overall query time, the bottom-up and mixed approaches behave similarly: both

perform run-time discovery and the mixed strategy cannot use its partial knowl-

edge to restrict sources, but only to load relevant sources earlier. This leads to

earlier result reporting, but is not reﬂected in the overall query time. In some

cases the mixed approach is even slightly worse than bottom-up, due to the over-

head imposed by using local source indices. The top-down approach, though, is

able to restrict the number of sources to be retrieved, leading to better query

times in many cases. For example, for query LD8 the query time for bottom up

evaluation is 19.1s, while top-down requires 4.2s, an improvement of 75%, made

possible by the lower number of sources retrieved in the top-down scenario.

Overall, the top-down approach uses its centralized-complete knowledge to

identify relevant sources and exclude non-relevant sources. In a dynamic scenario,

though, such as Linked Data, it may be infeasible to keep local indexes up-to-

date, so exploration-based approaches like bottom-up or the mixed approach,

which do not rely on complete knowledge, may be more suitable.

5 Conclusion

As witnessed by the evaluation, our benchmark is ﬂexible enough to cover a

wide range of semantic data application processing strategies and use cases,

ranging from centralized processing over federation to pure Linked Data pro-

cessing. Clearly, our experimental analysis is not (and does not intend to be)

100

LD1 LD2 LD3 LD4 LD5 LD6 LD7 LD8 LD9 LD10 LD11

Evaluation Time (s)

Bottom Up

Mixed

Top Down

Query #Srcs.

LD1 317

LD2 67

LD3 317

LD4 342

LD5 39

LD6 598

LD7 1219

LD8 217

LD9 59

LD10 44

LD11 366

Fig. 4: Scenario (B): Query Evaluation Times (left) and Number of Sources

Dereferenced for Answering the Query (right)

complete with respect to covering all existing systems and solutions – yet we

have provided a ﬂexible benchmark suite that can be used and extended by

others to evaluate alternative approaches, systems, and scenarios.

Extensions we are planning to address in future work particularly include

queries targeted at the new SPARQL features that will be published in the com-

ing SPARQL 1.1 release such as aggregation, nested subqueries, and built-in

support for federation. Further, given that the current data sets and queries

focus on instance data and query answering over ground RDF graphs, exten-

sions for testing reasoning capabilities (e.g., over RDFS and OWL data) in a

distributed setting are left as future work. With our ﬂexible framework, though,

it is straightforward to incorporate such scenarios with little eﬀort, and we invite

the community to contribute to FedBench with own data and query sets.

Finally, our evaluation has revealed severe deﬁciencies of today’s federation

approaches, which underline the practicability of FedBench. As one of our major

ﬁndings, data statistics – which are explicitly included in our benchmark suite

– play a central role in eﬃcient federated data processing: as indicated by our

results, they are crucial in optimization to minimize the size of results shipped

across federation members and requests exchanged within the federation.

Acknowledgments

Research reported in this paper was partially supported by the German BMBF

in the project CollabCloud. http://collabcloud.de/

References

1. SPARQL 1.1 Federation Extensions. W3C Working Draft, 1 June 2010,

http://www.w3.org/TR/sparql11-federated-query/.

2. D. J. Abadi, A. Marcus, S. R. Madden, and K. Hollenbach. Scalable Semantic

Web Data Management Using Vertical Partitioning. In VLDB, 2007.

3. K. Alexander and M. Hausenblas. Describing Linked Datasets – On the Design

and Usage of voiD”. In Linked Data on the Web Workshop, 2009.

4. R. Angles and C. Gutierrez. The Expressive Power of SPARQL. In ISWC, pages

114–129, 2008.

5. C. B. Aranda, O. Corcho, and M. Arenas. Semantics and Optimization of the

SPARQL 1.1 Federation Extension. In ESWC, 2011.

6. C. Bizer and A. Schultz. The Berlin SPARQL Benchmark. Int. J. Semantic Web

Inf. Syst., 5(2):1–24, 2009.

7. J. Broekstra, A. Kampman, and F. van Harmelen. Sesame: A Generic Architecture

for Storing and Querying RDF and RDF Schema. In ISWC, pages 54–68, 2002.

8. S. Duan, A. Kementsietsidis, K. Srinivas, and O. Udrea. Apples and Oranges: A

Comparison of RDF Benchmarks and Real RDF Datasets. In SIGMOD, 2011.

9. O. G¨orlitz and S. Staab. Federated Data Management and Query Optimization

for Linked Open Data. In New Directions in Web Data Management. 2011.

10. J. Gray. Database and Transaction Processing Performance Handbook. In The

Benchmark Handbook. 1993.

11. Y. Guo, Z. Pan, and J. Heﬂin. LUBM: A benchmark for OWL knowledge base

systems. J. Web Sem., 3(2-3):158–182, 2005.

12. P. Haase, A. Eberhart, S. Godelet, T. Math¨aß, T. Tran, G. Ladwig, and A. Wagner.

The Information Workbench - Interacting with the Web of Data. Technical report,

ﬂuid Operations & AIFB Karlsruhe, 2009.

13. P. Haase, T. Math¨aß, and M. Ziller. An Evaluation of Approaches to Federated

Query Processing over Linked Data. In I-SEMANTICS, 2010.

14. A. Harth, K. Hose, M. Karnstedt, A. Polleres, K.-U. Sattler, and J. Umbrich. Data

Summaries for On-Demand Queries over Linked Data. In WWW, 2010.

15. O. Hartig, C. Bizer, and J. C. Freytag. Executing SPARQL Queries over the Web

of Linked Data. In ISWC, pages 293–309, 2009.

16. G. Ladwig and T. Tran. Linked Data Query Processing Strategies. In ISWC, 2010.

17. G. Ladwig and T. Tran. SIHJoin: Querying Remote and Local Linked Data. In

ESWC, pages 139–153, 2011.

18. A. Magkanaraki, S. Alexaki, V. Christophides, and D. Plexousakis. Benchmarking

RDF Schemas for the Semantic Web. In ISWC, 2002.

19. T. Neumann and G. Weikum. Rdf-3X: a RISC-style engine for RDF. PVLDB,

1(1), 2008.

20. J. P´erez, M. Arenas, and C. Gutierrez. Semantics and Complexity of SPARQL.

ACM Trans. Database Syst., 34(3), 2009.

21. M. Schmidt, T. Hornung, G. Lausen, and C. Pinkel. SP2Bench: A SPARQL Per-

formance Benchmark. In ICDE, pages 222–233, 2009.

22. M. Schmidt, M. Meier, and G. Lausen. Foundations of SPARQL Query Optimiza-

tion. In ICDT, pages 4–33, 2010.

23. A. Schwarte, P. Haase, K. Hose, R. Schenkel, and M. Schmidt. FedX: Optimization

Techniques for Federated Query Processing on Linked Data. In ISWC, 2011.

24. C. Weiss, P. Karras, and A. Bernstein. Hexastore: Sextuple Indexing for Semantic

Web Data Management. PVLDB, 1(1):1008–1019, 2008.