SANSA includes several APIs for creating applications:



At SANSA-Stack, when in search for a big data science and engineering framework, we look for one key thing: productivity. Big data science and engineering need a framework for much better productivity. Instead of viewing things bottom-up, we take a top-down view of the big data stack, and ask what kind of API we would want to maximize data-engineering productivity. We ended up with developing the SANSA-Stack, a Distributed Structured ML framework.

Read / Write RDF / OWL Library

The lowest layer is the read/write layer. This layer essentially provides the facility to read and write native RDF or OWL data from HDFS or a local drive and represent it in the native distributed data structures of the frameworks. In addition, we also require a dedicated serialization mechanism for faster I/O. We aim to support the Jena and OWL API interfaces for processing RDF and OWL data, respectively. This particularly targets usability, as many users are already familiar with the corresponding libraries and thus would require less time to get productive with the SANSA stack.

Querying Library

Querying an RDF graph is a major source of information extraction and searching from the underlying linked data. This is essential to browse, search and explore the structured information available in a fast and user friendly manner. SPARQL, also known as RDF query language, is the W3C standard for querying RDF graphs. SPARQL is very expressive and allows to extract complex relationships using intelligent and comprehensive SPARQL queries. SPARQL takes the description in the form of a query and returns that information in the form of a set of bindings or an RDF graph.

In order to efficiently answer runtime queries for large RDF data, we are exploring different representation formats of distributed frameworks, namely graphs, tables and tensors. Our aim is to have cross representational transformations for efficient query answering. Our conclusion so far is that the Spark GraphX is not very efficient, due to complex querying related to graph structure. On the other hand, an RDD based representation is efficient for queries like filters or applying a User Defined Function (UDF) on specific resources, the data frames have been found efficient for calculating the support of rules. We will explore the performance of different data structures and analyse the representations that suit particular type of queries and workflows.

SANSA contains methods to perform queries directly in programs instead of writing the code corresponding to those queries (grouping, sorting, filtering etc.). It also provides a W3C standard compliant SPARQL endpoint for externally querying data that has been loaded using SANSA.

Inference Library

Both RDFS and OWL contain schema information in addition to links between different resources. This additional information and rules allows to perform reasoning on the knowledge bases in order to infer new knowledge and expanding the existing one. The core of the inference process is to continuously apply schema related rules on the input data to infer new facts. This process is helpful for deriving new knowledge and for detecting inconsistencies. SANSA contains an adaptive rule engine that can use a given set of rules and derive an efficient execution plan from those.

ML Library

SANSA-ML is the Machine Learning (ML) library in SANSA. While most machine learning algorithms are based on processing simple features, the machine learning algorithms in SANSA exploit the graph structure and semantics of the background knowledge specified using the RDF and OWL standards. In many cases, this allows to obtain either more accurate or more human-understandable results. The ML layer currently supports the following algorithms:

Supervised Learning Unsupervised Learning
  • Classification
    • Decision Tree Learning
  • Regression
    • Decision Tree Learning
  • Clustering
    • RDF Modularity Clustering
  • Frequent Pattern Mining
    • Association Rule Learning