One of the key features of Big Data is its complexity. We can define complexity in different ways. It could be that data is coming from different sources, it could be the same data source representing different aspects of a resource, it could be different data sources representing the same property; this difference in representation, structure, or association makes it difficult to introduce common methodologies or algorithms to learn and predict from different types of data. The state of the art to handle this ambiguity and complexity of data is its representation or modelling in the form of Linked RDF Data.
Linked Data is associated to a set of standards for the integration of data and information in addition to searching and querying it. To create linked data, the information represented in unstructured form or referring to other structured or semi-structured representation is mapped to the RDF data model, this process is called extraction. RDF has a very flexible data model consisting of so-called triples (subject, predicate, object) that can be interpreted as a labeled directed edge (s, p, o) with s and o being arbitrary resources and p being the property among these two resources. Thus, a set of RDF triples forms an interlinkable graph whose flexibility allows to represent a large variety of highly to loosely structured datasets. RDF, which was standardized by W3C, is increasingly being adapted to model data in a variety of scenarios, partly due to the popularity of projects like linked open data and schema.org. This linked or semantically annotated data has grown steadily towards a massive scale.
SANSA-Stack’s core is a processing data flow engine that provides data distribution, communication, and fault tolerance for distributed computations over RDF large-scale datasets.
This is an overview of SANSA Stack. Click on any component to go to the respective documentation.
SANSA can run on a top of a Spark and Flink, which allows users to test and debug SANSA programs in the most common Big Data processing environments.
SANSA RDF / OWL API Programming Guide
RDF API – RDF programs in SANSA are regular programs that implement transformations on RDF datasets (e.g., filtering, mapping, joining, grouping). The data sets are initially created from certain sources (e.g., by reading files (HDFS, local), or from collections).
The main data structures provided are distributed sets of ‘Triples’. RDF data model for representing graphs consisting of triples with subject, predicate and object. RDF datasets may contains multiple RDF graphs and record information about each graph, allowing any of the upper layers of sansa (Querying and ML) to make queries that involve information from more than one graph. Instead of directly dealing with RDF datasets, the target RDF datasets need to be converted into an distributed version of triples. We name such an dataset a main dataset. The
For Spark the main dataset is based on an RDD data structure, which is a basic building block of the Spark framework. RDDs are in-memory collections of records that can be operated on in parallel on large clusters. The Flink implementation contains methods to read nt files into Flink data sets.
OWL API – OWL programs in SANSA are regular programs that implement transformations on OWL axioms (e.g., filtering, mapping, joining, grouping). The data sets are initially created from certain sources, e.g., by reading files (HDFS or local).
The main data structures provided are distributed sets of either so called ‘expressions’ or OWL axioms. Expressions are string-based representations of single entities of the given input format. These could for example be single functional-style axiom descriptions like DisjointDataProperties(bar:dataProp1 bar:dataProp2) or whole Manchester Syntax frames like
In case of distributed axiom sets these expressions are already parsed and replaced by the corresponding OWL API OWLAxiom objects.
For Spark we provide builder objects for RDDs and Spark datasets. The Flink implementation contains builders to read OWL files into Flink data sets.
SANSA Inference API
This section of the documentation describes the current support for inference available within SANSA. It includes an outline of the general inference API, together with details of the specific rule engines and configurations for RDFS and OWL inference supplied with SANSA.
The inference layer supports rule-based reasoning, i.e. given a set of rules it computes all possible inferences on the given dataset. Technically, forward-chaining  is applied, i.e. it starts with the available data and uses inference rules to extract more data. This is sometimes also referred to as “materialization”.
Currently, three fixed rulesets are supported:
- RDFS rule reasoner: Implements a configurable subset of the RDFS entailments.
- OWL-Horst, and OWL-EL Reasoners: A set of useful implementation of the OWL language.
Later versions will contain a generic rule-based reasoner such that a user can define it’s own set of rules which will be used to materialize the given dataset.
SANSA Querying API
SANSA uses vertical partitioning (VP) approach and is designed to support extensible partitioning of RDF data. Instead of dealing with a single three-column table (s, p, o), data is partitioned into multiple tables based on the used RDF predicates, RDF term types and literal datatypes. The first column of these tables is always a string representing the subject. The second column always represents the literal value as a Scala/Java datatype. Tables for storing literals with language tags have an additional third string column for the language tag.
- RdfPartition, as the name suggests, represents a partition of the RDF data and defines two methods:
- matches(Triple): Boolean: This method is used to test whether a triple fits into a partition.
- Layout => TripleLayout: This method returns the TripleLayout associated with the partition, as explained below.
- Furthermore,RdfPartitions are expected to be serializable, and to define equals and hash code.
- TripleLayout instances are used to obtain framework-agnostic compact tabular representations of triples according to a partition. For this purpose it defines the two methods:
- fromTriple(triple:Triple): Product: This method must, for a given triple, return its representation as a Product(this is the super class of all scalaTuples)
- schema:Type: This method must return the exact scala type of the objects returned by fromTriple, such as typeOf[Tuple2[String,Double]]. Hence, layouts are expected to only yield instances of one specific type.
See the available layouts for details.
SANSA ML – Machine Learning for SANSA
SANSA-ML is the Machine Learning (ML) library for SANSA. With ML we aim to provide scalable ML algorithms, an intuitive API, and tools that help minimize glue code in end-to-end ML systems.
ML currently supports the following algorithms:
- Distributed SPARQL Query Tree Learning
- Decision Tree Learning
- Decision Tree Learning
- Frequent Pattern Mining
- Association Rule Learning