Scalable RDF Analytics with SANSA

Half-Day Tutorial at The 19th International Semantic Web Conference (ISWC2020)

2nd – 6th November 2020, Athens, Greece

Hajira Jabeen, Damien GrauxGezim Sejdiu, and Prof. Dr. Jens Lehmann

The size of knowledge graphs has reached the scale where centralised analytical approaches have become infeasible. Recent technological progress has enabled tools for powerful distributed in-memory analytics that have been shown to work well on elementary data structures but they are not specialised for knowledge graph (KG) processing. Scalable Semantic Analytics Stack (SANSA) is a library built on top of one such tool, Apache Spark, and it offers several APIs covering different facets of scalable KG processing. SANSA is organized into several layers: (1) RDF data handling e.g. filtering, computation of RDF statistics, and quality assessment (2) SPARQL querying (3) inference reasoning (4) analytics over KGs. In addition to processing native RDF, SANSA also allows users to query a wide range of heterogeneous data sources (e.g. files stored in Hadoop or other popular NoSQL stores) uniformly using SPARQL. This tutorial aims to provide an up to date overview of the stack, together with detailed discussions on the previous releases, technical add-ons and developments. Furthermore, a hands-on session on SANSA, covering all the aforementioned layers using simple use-cases will be provided.

  • Introduction

    In recent years, Semantic Web has become an acknowledged format for data consumption, publication, conversion, and exchange, with openly established standards and protocols. These have resulted in an unprecedented increase in both the size and number of Knowledge Graphs (KG). Concurrently, KGs are being increasingly adopted as a major format for integrating heterogeneous data sources.  Therefore, there is a strong need for a unified framework that is scalable, KG-oriented, resilient, and is compatible with the existing SW stack e.g. provides RDF representation, querying, inference, and analytics, together with abilities to ingest and query heterogeneous data sources.

    In this tutorial, we will present Scalable Semantic Analytics Stack (SANSA) [1], that combines distributed RDF processing and heterogeneous data integration. We will offer a hands-on session on SANSA, and demonstrate RDF data processing at scale, including statistic calculation, quality assessment, querying, inference, analytics, and ‘simplified’ heterogeneous data integration.

    Detailed Description

    In this half-day tutorial, we will introduce and describe different functionalities of SANSA. First, we will cover a brief introduction to the underlying processing engine of SANSA, Apache Spark. Later, we will dive into the details of SANSA’s various building blocks organized as its four layers described below.

    RDF Processing layer

    SANSA provides APIs to load/store large scale RDF data from HDFS or the local drive, in several native formats. SANSA has a rich set of functionalities that can perform distributed manipulations over given data. This includes representing RDF data in different formats e.g. graph, tensor, or data frames, performing operations like filtering, computation of statistics [2], and quality assessment [4], over large-scale RDF datasets in a distributed manner ensuring resilience and horizontal scalability.

    Querying Layer

    Querying is the primary approach for searching, exploring and extracting insights from an underlying RDF graph. SPARQL is the de facto W3C standard for querying RDF graphs. SANSA provides cross-representational transformations and partitioning strategies for efficient query processing. It provides APIs for performing SPARQL queries directly in Spark programs. We will cover the two main query engines of SANSA. The first one is Sparqlify [5] based on SPARQL-to-SQL rewritings on top of a flexible triple-based partitioning strategy of RDF data (predicate tables with sub-partitioning by data types). The second one is the DataLake component allowing to query heterogeneous data sources; technically, the given SPARQL queries are internally decomposed into subqueries, each extracting a subset of the results.

    Inference Layer

    RDFS and OWL are well known W3C standards for representing schema information in addition to assertions or facts. Inference commonly proceeds either as Backward or Forward-chaining. Backward chaining begins with the goal and operates backward, chaining to find known facts supporting the goal through rules (implying the query rewriting which can be expensive). The forward-chaining inference process applies inference rules on the existing facts in the knowledge base to infer new facts iteratively. Doing so, it enriches the knowledge base with new knowledge and allows inconsistency detection. This process is usually data-intensive and compute-intensive.

    SANSA provides an efficient rule engine for the well-known reasoning profiles RDFS (with different subsets) and OWL-Horst. Additionally, SANSA provides OWL library to read OWL files. Currently, OWL library contains builder objects to read OWL files in different formats such as Functional Syntax, Manchester Syntax, and OWL XML Syntax. We endorse distributed OWL file representations based on Spark Datasets or RDDs. These can either contain parsed OWL Axiom objects or String-based representations.

    Machine-Learning  Layer

    SANSA makes full use of, and derives benefit from, the graph structure and semantics of the background knowledge specified using RDF and OWL standards. SANSA tries to exploit the expressivity of the knowledge structures in order to attain either better performance or more human-readable results. Currently, this layer provides numerous algorithms for; Clustering, Anomaly detection [3], Entity Linking, and Rule Mining. In this tutorial, we will demonstrate these algorithms and showcase the outcomes.

    SANSA-Notebooks

    SANSA’s interactive Spark Notebooks will be used to demonstrate the APIs of different layers. SANSA-Notebooks are easy to deploy using docker-compose. The deployment stack includes Hadoop for HDFS, Spark for running SANSA APIs, Hue for navigating and copying the files to HDFS. All examples and code snippets will be shown and prepared using SANSA-Notebooks throughout this tutorial.

    Presentation and Format

    This will be a half-day tutorial divided into small sprints. The first session will provide an overview of Apache Spark and SANSA, followed by setup instructions. The later sprints will cover the four layers of SANSA together with corresponding hands-on sessions for each layer. This tutorial will help attendees to write programs in the in-memory computing framework Apache Spark and use SANSA’s out-of-the-box APIs to handle large scale RDF data. This tutorial will cover the full spectra of scalable RDF processing; ranging from overview to installation, from basic examples in Spark to executing the APIs of SANSA, and from parsing local or HDFS files to in-memory analytics using SANSA.

    Tutorial Material

    • Presentation slides
    • Worksheets
    • Lecture Notes
    • Notebooks (in addition to the hands-on sessions)

    All the material presented in the tutorial will be made publicly available via the associated webpage.

    Audience

    We expect that our tutorial will attract attendees from several research areas since SANSA is at the crossroad of several domains within Semantic Web, namely:  Scalable RDF Processing, Inference, Querying, Machine Learning, Heterogeneous Data Integration, while being fully open-source with an increasingly growing community. Thus, we consider that the various sessions being offered in the tutorial (either the technical/theoretical or the practical ones) would be relevant to almost everyone attending the conference.

    Further, given the broad and detailed nature of most of the covered material, we believe our tutorial will be useful for people who want to know and learn about scalable RDF processing. The tutorial will also serve as a good entry point for newcomers in the field of Big Data. Analytics of RDF, as well as heterogeneous data integration, could be of interest for a variety of interesting use-cases.

    Required knowledge: Knowledge of Java/Scala are required, basic knowledge of Distributed Computing frameworks and Docker is prefered.

    Related events

    We offered a SANSA tutorial last year at ESWC, 2019 which emphasised the Data Lake aspects and features the stack was providing. The tutorial was very well received by approximately 35 attendees and there were lots of interesting discussions during, and following the tutorial.  Users from academia, as well as industry are keen in learning more about the platform and the advancements offered over the consecutive releases.

    Requirements

    We will only need standard projection equipment. Participants interested in following the hands-on sessions should bring their own laptops with a Linux based OS, preferably with Docker Engine 1.13.0 and docker-compose 1.10.0 installed.

    References

    1. Lehmann, Jens, et al. “Distributed semantic analytics using the SANSA stack.” International Semantic Web Conference. Springer, Cham, 2017.
    2. Sejdiu, Gezim, et al. “DistLODStats: Distributed Computation of RDF Dataset Statistics.” In Proceedings of the 17th International Semantic Web Conference, 2018.
    3. Jabeen, Hajira, et al. “Divided We Stand Out! Forging Cohorts fOr Numeric Outlier Detection in Large Scale Knowledge Graphs (CONOD).” European Knowledge Acquisition Workshop. Springer, Cham, 2018.
    4. Sejdiu, Gezim, et al. A Scalable Framework for Quality Assessment of RDF Datasets, International Semantic Web Conference, 2019
    5. Sejdiu, Gezim, et al. Querying large-scale RDF datasets using the SANSA framework. Claus Stadler; Gezim Sejdiu; Damien Graux; and Jens Lehmann. In Proceedings of the 18th International Semantic Web Conference (ISWC), Poster & Demos, 2019.
    6. Jabeen, Hajira, et al. “DISE: A Distributed in-Memory SPARQL Processing Engine over Tensor Data”, IEEE International Conference on Semantic Computing, 2020

  • The tutorial is organized as follows:

    Date: TBA

    Time Topic Presenter
    TBA Introduction Hajira Jabeen
    Setting up the environment All
    RDF layer Basics and APIs + examples Gezim Sejdiu
    Query Layer (including DataLake) APIs + examples Damien Graux
    Coffee break
    Inference Layer APIs + examples Hajira Jabeen
    ML Layer APIs + examples Hajira Jabeen & Damien Graux

  • Dr. Hajira Jabeen (f, https://hajirajabeen.github.io/) is a senior researcher and team leader at Smart Data Analytics (SDA) research group at the University of Bonn. She works in the area of ‘Distributed Semantics Analytics’. Her research interests are distributed analytics, semantic web, data mining, and big data. She is actively involved in teaching, conducting seminars and labs related to big data and analytics at the University of Bonn. Hajira is also working on several projects related to big data, analytics, and offers trainings in these competency areas.

    Dr. Damien Graux (m, https://dgraux.github.io/) is a research fellow at Trinity College Dublin (Ireland), based in the ADAPT Centre. Before that, he was a senior researcher in Fraunhofer IAIS and the Smart Data Analytics group at the University of Bonn (Germany). He has been contributing to research efforts in Semantic Web technologies and mainly focused on distributed query evaluation and on designing complex transformation pipelines for heterogeneous Big Data. Currently, he is continuing his research efforts while being involved in several international projects.

    – Gezim Sejdiu (m, https://gezimsejdiu.github.io/) is a Data Engineer at the Deutsche Post DHL Group, a PhD student at Bonn University/Smart Data Analytics and a SANSA contributor. His research interests are in the area of Semantic Web, Big Data and Machine Learning. He is also interested in the area of distributed computing systems (Apache Spark, Apache Flink). Before that, while being a Research Scientist at Bonn University, he conducted a Distributed Big Data Analytics lab which covers the Spark fundamentals for master students at the University of Bonn.

    – Heba Mohamed (f, http://sda.cs.uni-bonn.de/people/heba-ibrahim/) is a PhD student in Smart Data Analytics group at Bonn University, Germany, and a SANSA contributor. Before that she has been working as an Assistant Lecturer at the Department of Mathematics and Computer Science, Faculty of Science, Alexandria University, Alexandria, Egypt. Her research interests include Semantic Web, Big Data, Machine Learning. She is interested in distributed machine learning algorithms and inference systems.

    Prof. Dr. Jens Lehmann (m, http://www.jens-lehmann.org) leads the “Smart Data Analytics” research group at the University of Bonn and Fraunhofer IAIS with 50 researchers. His research interests involve Semantic Web, machine learning, question answering, distributed computing and knowledge representation. Prof. Lehmann authored more than 100 articles in international journals and conferences cited more than 12000 times. In several major conferences and journals, he has leading positions. He is founder, leader or contributor of several community research projects, including SANSA, AskNow, DL-Learner, DBpedia and LinkedGeoData. Previously, he completed his PhD with „summa cum laude“ at the University of Leipzig with visits to the University of Oxford.