SANSA’s Leap of Faith: Scalable RDF and Heterogeneous Data Lakes

Half-Day Tutorial at 16th European Semantic Web Conference 2019

2nd – 6th June 2019, Portorož, Slovenia

Hajira JabeenMohamed Nadjib MamiDamien GrauxGezim Sejdiu, and Prof. Dr. Jens Lehmann

Scalable processing of Knowledge Graphs (KG) is an important requirement for today’s KG engineers.  Scalable Semantic Analytics Stack (SANSA) is a library built on top of Apache Spark and it offers several APIs tackling various facets of scalable KG processing. SANSA is organized into several layers: (1) RDF data handling e.g. filtering, computation of RDF statistics, and quality assessment (2) SPARQL querying (3) inference reasoning (4) analytics over KGs. In addition to processing native RDF, SANSA also allows users to query a wide range of heterogeneous data sources (e.g. files stored in Hadoop or other popular NoSQL stores) uniformly using SPARQL. This tutorial, aims to provide an overview, detailed discussion, and a hands-on session on SANSA, covering all the aforementioned layers using simple use-cases.

  • Introduction

    In recent years, Semantic Web has become an acknowledged format for data consumption, publication, conversion, and exchange, with openly established standards and protocols. These activities and interests have resulted in an unprecedented increase in both, size and number of Knowledge Graphs (KG). Concurrently, KGs are being increasingly adopted as a major format for integrating heterogeneous data sources. Unfortunately, this integration is challenged by the multitude of data storage systems, each with their respective formats and querying languages, making it cumbersome to perform data integration.  Therefore, there is a strong need for a unified framework that is scalable, KG-oriented, resilient, and is compatible with the existing SW stack e.g. provides RDF representation, querying, inference, analytics, together with abilities to ingest and query heterogeneous data sources.

    In this tutorial, we will present Scalable Semantic Analytics Stack (SANSA) [1], the Scalable Semantic Analytics Stack that combines RDF processing and heterogeneous data integration. We will offer a hands-on session on SANSA, and demonstrate RDF data processing at scale, including statistic calculation, quality assessment, querying, inference, analytics, and  ‘simplified’ heterogeneous data integration.

    Content Overview

    In this tutorial, we will introduce and describe different functionalities of SANSA. First, we will cover a brief introduction to the underlying processing engine of SANSA, Apache Spark. Later, we will dive into the details of SANSA’s various building blocks organized as its four layers described below.

    RDF Processing layer

    SANSA provides APIs to load/store RDF data in several native formats, from HDFS or a local drive. SANSA has a rich set of functionalities that can perform distributed manipulations over given data. This includes filtering, computation of statistics [2], and quality assessment over large-scale RDF datasets, in a distributed manner assuring resilience and horizontal scalability.

    Querying Layer

    Querying is the primary approach for searching, exploring and extracting insights from an underlying RDF graph. SPARQL is the de facto W3C standard for querying RDF graphs. SANSA provides cross-representational transformations and partitioning strategies for efficient query processing. It provides APIs for performing SPARQL queries directly in Spark programs. We will cover the following query engines of SANSA:

    • Triplify: The default approach for querying RDF data in SANSA is based on SPARQL-to-SQL rewriting. It uses a flexible triple-based partitioning strategy on top of RDF (such as predicate tables with sub-partitioning by data types). Currently, the Sparqlify implementation serves as a baseline.
    • DataLake: SANSA’s DataLake component allows to query heterogeneous data sources ranging from different databases to large files stored in HDFS, to NoSQL stores, using SPARQL. SANSA DataLake currently supports CSV, Parquet files, Cassandra, MongoDB, Couchbase, ElasticSearch, and various JDBC sources e.g., MySQL, SQL Server. Technically, the given SPARQL queries are internally decomposed into subqueries, each extracting a subset of the results.

    Inference Layer

    RDFS and OWL are well known for containing schema information as an addition to assertions or facts. The forward chaining inference process applies inference rules on the existing facts in a knowledge base to infer new facts iteratively. Doing so, it enriches the knowledge base with new knowledge and allows inconsistencies-detection. This process is usually data-intensive and compute-intensive. SANSA provides an efficient rule engine for the well-known reasoning profiles RDFS (with different subsets) and OWL-Horst. By using SANSA, applications will be able to fine-tune the rules they require and – in case of scalability issues – adjust them accordingly.  

    Analytics Layer

    SANSA makes full use of, and derives benefit from the graph structure and semantics of the background knowledge specified using RDF and OWL standards. SANSA tries to exploit the expressivity of the knowledge structures in order to attain either better performance or more human readable results. Currently, this layer provides algorithms for: Clustering, Anomaly detection [3] and Rule Mining.

    SANSA-Notebooks

    SANSA’s interactive Spark Notebooks will be used to demonstrate the APIs of different layers. SANSA-Notebooks are easy to deploy using docker-compose. The deployment stack includes Hadoop for HDFS, Spark for running SANSA APIs, Hue for navigating and copying file to HDFS. All examples and code snippets will be shown and prepared using SANSA-Notebooks throughout this tutorial.

    Presentation and Format

    This will be a half-day tutorial divided into small sprints. The first session will provide an overview of Apache Spark and SANSA, followed by setup instructions. The later sprints will cover the four layers of SANSA together with corresponding hands-on sessions for each layer. This tutorial will help attendees to write programs in the in-memory computing framework Apache Spark and use SANSA’s out-of-the-box APIs to handle large scale RDF data. This tutorial will cover the full spectra of scalable RDF processing; ranging from overview to installation, from basic examples in Spark to executing the APIs of SANSA, and from parsing local or HDFS files to in-memory analytics using SANSA.

    Target Audience

    We expect that our tutorial will attract attendees from several research areas since SANSA is at the crossroad of several domains within Semantic Web, namely:  Scalable RDF Processing, Inference, Querying, Machine Learning, Heterogeneous Data Integration, while being fully open-source with an increasingly growing community. Thus, we consider that the various sessions being offered in the tutorial (either the technical/theoretical or the practical ones) would be relevant to almost everyone attending the conference.

    Further, given the broad and detailed nature of most of the covered material, we believe our tutorial will be useful for people who want to know and learn about scalable RDF processing. The tutorial will also serve as a good entry point for newcomers in the field of Big Data. Analytics of RDF, as well as heterogeneous data integration could be of interest for a variety of interesting use-cases.

    Required knowledge: Knowledge of Java/Scala are required, basic knowledge of Distributed Computing frameworks and Docker are prefered.

    Technical Requirements

    We will only need standard projection equipment. Participants interested in following the hands-on sessions should bring their own laptops with a Linux based OS, preferably with Docker Engine 1.13.0 and docker-compose 1.10.0 installed.

    References

    1. Lehmann, Jens, et al. “Distributed semantic analytics using the sansa stack.” International Semantic Web Conference. Springer, Cham, 2017.
    2. Sejdiu, Gezim, et al. “DistLODStats: Distributed Computation of RDF Dataset Statistics.” In Proceedings of 17th International Semantic Web Conference, 2018.
    3. Jabeen, Hajira, et al. “Divided We Stand Out! Forging Cohorts fOr Numeric Outlier Detection in Large Scale Knowledge Graphs (CONOD).” European Knowledge Acquisition Workshop. Springer, Cham, 2018.

  • The tutorial is organized as follows:

     

    Time Topic Presenter
    14:00 Introduction Hajira Jabeen
    14:30 Setting up the environment All
    14:50 RDF layer Basics and APIs + examples Gezim Sejdiu
    15:10 Query Layer (including DataLake) APIs + examples Damien Graux
    15:30 Coffee break
    16:00 Inference Layer APIs + examples Hajira Jabeen
    16:20 ML Layer APIs + examples Hajira Jabeen & Damien Graux

  • – Hajira Jabeen (f) has a broad international experience in teaching, management, and research. Currently, she is a senior researcher at Smart Data Analytics(SDA) research group at the University of Bonn, and works in the area of ‘Distributed Semantics Analytics’. Her research interests are distributed analytics, semantic web, data mining, and big data. She is actively involved in teaching, conducting seminars and labs related to big data and analytics at the University of Bonn. Hajira is also working on several projects related to big data, analytics, and trainings in these competency areas.

    – Mohamed Nadjib Mami (m) is a Researcher in the “Enterprise Information Systems” department at Fraunhofer IAIS and a PhD student at Bonn University. He has organized a hackathon in deploying Spark applications in a virtual environment, for the EU Erasmus+ supported GraDAna project. His expertise revolves around Semantic Web and (Big) Data Management, in particular Distributed Query Processing of Heterogeneous Data Sources.

    Damien Graux (m) is a senior researcher in the “Enterprise Information Systems” department at Fraunhofer IAIS and the Smart Data Analytics group at the University of Bonn. In 2016, he received his PhD from the University of Grenoble (France). Prior to joining Fraunhofer, he contributed to research efforts in Semantic Web technologies at INRIA (France) where he mainly focused on distributed query evaluation and on designing complex data transformation pipelines. Currently, he is continuing his research efforts while being involved in several international projects.

    – Gezim Sejdiu (m) is a Researcher in the “Smart Data Analytics” research group and a PhD student at Bonn University. His research interests are in the area of Semantic Web, Big Data and Machine Learning. He is also interested in the area of distributed computing systems (Apache Spark, Apache Flink). He conducts a Distributed Big Data Analytics lab which covers the Spark fundamentals and SANSA as a whole, for master students at the University of Bonn.

    Prof. Dr. Jens Lehmann (m, http://www.jens-lehmann.org) leads the “Smart Data Analytics” research group at the University of Bonn and Fraunhofer IAIS with 50 researchers. His research interests involve Semantic Web, machine learning, question answering, distributed computing and knowledge representation. Prof. Lehmann authored more than 100 articles in international journals and conferences cited more than 12000 times. In several major conferences and journals, he has leading positions. He is founder, leader or contributor of several community research projects, including SANSA, AskNow, DL-Learner, DBpedia and LinkedGeoData. Previously, he completed his PhD with „summa cum laude“ at the University of Leipzig with visits to the University of Oxford.