Overview

One of the key features of Big Data is its complexity. We can define complexity in different ways. It could be that data is coming from different sources, it could be the same data source representing different aspects of a resource, it could be different data sources representing the same property; this difference in representation, structure, or association makes it difficult to introduce common methodologies or algorithms to learn and predict from different types of data. The state of the art to handle this ambiguity and complexity of data is its representation or modelling in the form of Linked RDF Data.

Linked Data is associated to a set of standards for the integration of data and information in addition to searching and querying it. To create linked data, the information represented in unstructured form or referring to other structured or semi-structured representation is mapped to the RDF data model, this process is called extraction. RDF has a very flexible data model consisting of so-called triples (subject, predicate, object) that can be interpreted as a labeled directed edge (s, p, o) with s and o being arbitrary resources and p being the property among these two resources. Thus, a set of RDF triples forms an interlinkable graph whose flexibility allows to represent a large variety of highly to loosely structured datasets. RDF, which was standardized by W3C, is increasingly being adapted to model data in a variety of scenarios, partly due to the popularity of projects like linked open data and schema.org. This linked or semantically annotated data has grown steadily towards a massive scale.

SANSA-Stack’s core is a processing data flow engine that provides data distribution, communication, and fault tolerance for distributed computations over RDF large-scale datasets.

SANSA can run on a top of a Spark and Flink, which allows users to test and debug SANSA programs in the most common Big Data processing environments.

SANSA RDF / OWL API Programming Guide


RDF API – RDF programs in SANSA are regular programs that implement transformations on RDF datasets (e.g., filtering, mapping, joining, grouping). The data sets are initially created from certain sources (e.g., by reading files (HDFS, local), or from collections).

The main data structures provided are distributed sets of ‘Triples’. RDF data model for representing graphs consisting of triples with subject, predicate and object. RDF datasets may contains multiple RDF graphs and record information about each graph, allowing any of the upper layers of sansa (Querying and ML) to make queries that involve information from more than one graph. Instead of directly dealing with RDF datasets, the target RDF datasets need to be converted into an distributed version of triples. We name such an dataset a main dataset. The

For Spark the main dataset is based on an RDD data structure, which is a basic building block of the Spark framework. RDDs are in-memory collections of records that can be operated on in parallel on large clusters. The Flink implementation contains methods to read nt files into Flink data sets.

RDF Processing

SANSA uses the RDF data model for representing graphs consisting of triples with subject, predicate and object. RDF datasets may contains multiple RDF graphs and record information about each graph, allowing any of the upper layers of sansa (Querying and ML) to make queries that involve information from more than one graph. Instead of directly dealing with RDF datasets, the target RDF datasets need to be converted into an RDD of triples. We name such an RDD a main dataset. The main dataset is based on an RDD data structure, which is a basic building block of the Spark framework. RDDs are in-memory collections of records that can be operated on in parallel on large clusters.

The following RDF formats are supported by SANSA.

  • N-Triples
  • N-Quads
  • TURTLE
  • RDF/XML

In addition, we are working on generalization of TripleWritter and TripleReader.



  • Full example code: https://github.com/SANSA-Stack/SANSA-Examples/blob/master/sansa-examples-spark/src/main/scala/net/sansa_stack/examples/spark/rdf/TripleOps.scala

  • Full example code: https://github.com/SANSA-Stack/SANSA-Examples/blob/master/sansa-examples-flink/src/main/scala/net/sansa_stack/examples/flink/rdf/TripleOps.scala





  • By implementing your udf into :


  • By applying your filter operations over:


  • In Spark, the method textFile() takes an URI for the file (either a local path or a hdfs:// ). You could run this method on a single file or a directory which may contains more than one file by calling :


  • The PageRank algorithm compute the importance of each vertex (represented as resource) in a graph. Resource PageRank is build on top of Spark GraphX.

    Full example code: https://github.com/SANSA-Stack/SANSA-Examples/blob/master/sansa-examples-spark/src/main/scala/net/sansa_stack/examples/spark/rdf/PageRank.scala

Usage examples and further information can be found in the README file of the SANSA RDF GitHub repository.

OWL API – OWL programs in SANSA are regular programs that implement transformations on OWL axioms (e.g., filtering, mapping, joining, grouping). The data sets are initially created from certain sources, e.g., by reading files (HDFS or local).

The main data structures provided are distributed sets of either so called ‘expressions’ or OWL axioms. Expressions are string-based representations of single entities of the given input format. These could for example be single functional-style axiom descriptions like [crayon-5e527b3319854093681120-i/]  or whole Manchester Syntax frames like

[crayon-5e527b3319864653327070/]
In case of distributed axiom sets these expressions are already parsed and replaced by the corresponding OWL API [crayon-5e527b3319867812698150-i/] objects.

For Spark we provide builder objects for RDDs and Spark datasets. The Flink implementation contains builders to read OWL files into Flink data sets.

OWL Processing

    • E.g. OWLSubClassOfAxiom:

Usage examples and further information can be found in the README file of the SANSA OWL GitHub repository.

 

SANSA Inference API


This section of the documentation describes the current support for inference available within SANSA. It includes an outline of the general inference API, together with details of the specific rule engines and configurations for RDFS and OWL inference supplied with SANSA.

The inference layer supports rule-based reasoning, i.e. given a set of rules it computes all possible inferences on the given dataset. Technically, forward-chaining [1] is applied, i.e. it starts with the available data and uses inference rules to extract more data. This is sometimes also referred to as “materialization”.

Currently, three fixed rulesets are supported:

  1. RDFS rule reasoner: Implements a configurable subset of the RDFS entailments.
  2. OWL-Horst, and OWL-EL Reasoners: A set of useful implementation of the OWL language.

Later versions will contain a generic rule-based reasoner such that a user can define it’s own set of rules which will be used to materialize the given dataset.

Inference

The inference layer supports rule-based reasoning, i.e. given a set of rules it computes all possible inferences on the given dataset. Technically, forward-chaining [1] is applied, i.e. it starts with the available data and uses inference rules to extract more data. This is sometimes also referred to as “materialization”.

Currently, three fixed rulesets are supported, namely RDFS, OWL-Horst, and OWL-EL. Later versions will contain a generic rule-based reasoner such that a user can define it’s own set of rules which will be used to materialize the given dataset.

[1] https://en.wikipedia.org/wiki/Forward_chaining

The easiest way is to use the RDFGraphMaterializer:



  • Full example code: https://github.com/SANSA-Stack/SANSA-Examples/blob/master/sansa-examples-spark/src/main/scala/net/sansa_stack/examples/spark/inference/RDFGraphInference.scala



  • Full example code: https://github.com/SANSA-Stack/SANSA-Examples/blob/master/sansa-examples-flink/src/main/scala/net/sansa_stack/examples/flink/inference/RDFGraphInference.scala

Usage examples and further information can be found in the README file of the SANSA Inference GitHub repository.

[1] https://en.wikipedia.org/wiki/Forward_chaining

SANSA  Querying API


SANSA uses vertical partitioning (VP) approach and is designed to support extensible partitioning of RDF data. Instead of dealing with a single three-column table (s, p, o), data is partitioned into multiple tables based on the used RDF predicates, RDF term types and literal datatypes. The first column of these tables is always a string representing the subject. The second column always represents the literal value as a Scala/Java datatype. Tables for storing literals with language tags have an additional third string column for the language tag.

On Spark the method for partitioning a RDD[Triple] is located in RdfPartitionUtilsSpark. It uses an RdfPartitioner which maps a Triple to a single RdfPartition instance.

  • RdfPartition, as the name suggests, represents a partition of the RDF data and defines two methods:
    • matches(Triple): Boolean: This method is used to test whether a triple fits into a partition.
    • Layout => TripleLayout: This method returns the TripleLayout associated with the partition, as explained below.
    • Furthermore,RdfPartitions are expected to be serializable, and to define equals and hash code.
  • TripleLayout instances are used to obtain framework-agnostic compact tabular representations of triples according to a partition. For this purpose it defines the two methods:
    • fromTriple(triple:Triple): Product: This method must, for a given triple, return its representation as a Product(this is the super class of all scalaTuples)
    • schema:Type: This method must return the exact scala type of the objects returned by fromTriple, such as typeOf[Tuple2[String,Double]]. Hence, layouts are expected to only yield instances of one specific type.

See the available layouts for details.

SPARQL Queries

SANSA uses vertical partitioning (VP) approach and is designed to support extensible partitioning of RDF data. Instead of dealing with a single three-column table (s, p, o), data is partitioned into multiple tables based on the used RDF predicates, RDF term types and literal datatypes. The first column of these tables is always a string representing the subject. The second column always represents the literal value as a Scala/Java datatype. Tables for storing literals with language tags have an additional third string column for the language tag.

  • The method for partitioning a RDD[Triple] is located in RdfPartitionUtilsSpark. It uses an RdfPartitioner which maps a Triple to a single RdfPartition instance.

    • RdfPartition, as the name suggests, represents a partition of the RDF data and defines two methods:
      • matches(Triple): Boolean: This method is used to test whether a triple fits into a partition.
      • Layout => TripleLayout: This method returns the TripleLayout associated with the partition, as explained below.
      • Furthermore,RdfPartitions are expected to be serializable, and to define equals and hash code.
    • TripleLayout instances are used to obtain framework-agnostic compact tabular representations of triples according to a partition. For this purpose it defines the two methods:
      • fromTriple(triple:Triple): Product: This method must, for a given triple, return its representation as a Product(this is the super class of all scalaTuples)
      • schema:Type: This method must return the exact scala type of the objects returned by fromTriple, such as typeOf[Tuple2[String,Double]]. Hence, layouts are expected to only yield instances of one specific type.

    See the available layouts for details.

Usage examples and further information can be found in the README file of the SANSA Inference GitHub repository.

 

SANSA ML – Machine Learning for SANSA


 SANSA-ML is the Machine Learning (ML) library for SANSA. With ML we aim to provide scalable ML algorithms, an intuitive API, and tools that help minimize glue code in end-to-end ML systems.

ML currently supports the following algorithms:

Supervised Learning

  • Classification
    • Distributed SPARQL Query Tree Learning
    • Decision Tree Learning
  • Regression
    • Decision Tree Learning

Unsupervised Learning

  • Clustering
    • RDF Modularity Clustering
    • BorderFlow Clustering
    • Power Iteration Clustering
    • Link-based Silvia Clustering
  • Frequent Pattern Mining
    • Association Rule Learning
  • Relational Learning for Knowledge Graphs
    • Translation Embedding (TransE)
    • DistMult (Bilinear-Diag)

Machine Learning

SANSA contains the implementation of a partitioning algorithm for RDF graphs given as NTriples. The algorithm uses the structure of the underlying undirected graph to partition the nodes into different clusters. SANSA’s clustering procedure follows a standard algorithm for partitioning undirected graphs aimed to maximize a modularity function, which was first introduced by Newman.

You will need your RDF graph in the form of a text file, with each line containing exactly one triple of the graph. Then you specify the number of iterations and supply a file path where you want your resulting clusters to be saved to.

Rule mining for knowledge bases is used to look for new facts, or such rules can be used to identify errors in the knowledge bases. These rules can be used for reasoning, and the rules that define regularities in the data can be used to understand the data better.

SANSA uses AMIE+ algorithm to mine association rules or correlations in the RDF dataset. These rules have the form r(x,y) <= B1 & B2 & ... Bn while r(x,y) is the head and B the body, a conjunction of atoms of the rule. The process starts with rules with only one atom, which are then refined to add more atoms with fresh or previously used variables. The rules is accepted as output rule, if its support is above a support threshold.

  • Within spark, the support of a rule is calculated using DataFrames. For the rules with two atoms, the predicates of head and body are filtered against a dataframe, which contains all the instantiated atoms with a particular predicate. Different dataframes are then joined and only the rows with correct variables are kept. For greater sizes, new atoms are joined with the previous dataframes (previously refined rules), which are stored in the parquet format with rules as name for corresponding folders.


    Full example code: https://github.com/SANSA-Stack/SANSA-Examples/blob/master/sansa-examples-spark/src/main/scala/net/sansa_stack/examples/spark/ml/mining/MineRules.scala

Currently there are two Knowledge Graph Embedding (KGE) models are implemented: TransE [1] and DistMult (Bilinear-Diag) [2].
The following code snippets show you how you can load your dataset and apply cross validation techniques supported on SANSA KGE.


[1] Bordes et. al., Translating Embeddings for Modeling Multi-relational Data

[2] Yang et. al., Embedding Entities and Relations for Leaning and Inference in Knowledge Graphs

Usage examples and further information can be found in the README file of the SANSA Inference GitHub repository.

Back to top