SANSA Frequently Asked Questions (FAQ)

The following questions are frequently asked with regard to the SANSA project in general. If you have further questions, make sure to have a look the documentation or ask the community.


1. What does SANSA stand for?

Semantic Analytics Stack

Here, we see analytics as the combination of querying, inference and machine learning tasks.

2. Is the project inspired by Game of Thrones?

Any relationship between the SANSA Stack (or SANSA on Spark) and the character Sansa Stark in the above series is purely coincidental.

3. What is the idea behind SANSA?

In SANSA, we combine distributed computing frameworks (specifically Spark and Flink) with the semantic technology stack. Here is an illustration:

sansa_combining_ml_sw_2_color_coded (1)

The SANSA vision combines distributed analytics (left) and semantic technologies (right) into a scalable semantic analytics stack (top). The colours encode what part of the two original stacks influence which part of the SANSA stack. The main objective of SANSA is to investigate whether the characteristics of each technology stack (bottom) can be combined to retain the respective advantages.

4. Why is SANSA useful?

The combination of distributed computing and semantic technologies has the potential to exploit several critical advantages. We are clearly not there yet, but initial steps have already been made.

SANSA inherits the following advantages from the semantic technology stack:

a) Powerful Data Integration: Current analytics pipelines have to handle increasing data variety and complexity. Moving from common short term ad hoc solutions that require a lot of engineering effort, to standardised and well-understood semantic technology approaches had and will have significant impact.

b) Expressive Modelling: The vast majority of machine learning algorithms have to rely on simple input formats, such as feature vectors, rather than being able to use expressive modelling via the Resource Description Framework (RDF) and the Web Ontology Language (OWL). While this has been researched in fields such as Statistical Relational Learning and Inductive Logic Programming, these methods usually do not scale horizontally. Initial work on horizontally scalable machine learning on structured data has been performed, particularly in terms of adding graph processing capabilities to distributed computing frameworks, but those are not aimed at semantic technologies and currently provide limited capabilities.

c) Standards: The usage of W3C standards can generally reduce pre-processing time in those cases when data sources are used for more than one analytics task. This is the case for knowledge graphs, which are often combined with several applications including search, information retrieval, advanced querying and filtering of information, as well as visualisation. Beyond this, the standardisation allows to draw on generic approaches, e.g. for querying and merging data, rather than developing ad hoc solutions, which are less reusable and often less efficient and effective. The use of standards will also enable a clearer separation of the data pre-processing step, i.e. RDF modelling, and the actual analytics step. This allows experts in either step to focus efforts on their expertise, increasing overall efficiency.

SANSA inherents the following advantages from machine learning research and distributed computing:

d) Measurable Benefits: A key driver for the success of machine learning is that its benefits are often directly measurable, e.g. an accuracy improvement can often be directly translated into a financial benefit. This is not really the case for semantic technologies where the benefits gained through the effort of modelling, editing and extracting knowledge are often not easily measurable. A seamless integration of semantic technologies and machine learning, as envisioned in SANSA, will also help to make the benefits of semantic technologies more visible, as they will translate to machine learning results which are more accurate and easier to understand.

e) Horizontal Scalability: Distributed in-memory computing can provide the horizontal scalability required by the high computation and storage demands of large-scale semantic knowledge graphs analytics. However, it does not magically result in higher scalability and requires a deep understanding of the underlying structures and models. For instance, distributed machine learning for expressive logics and the inclusion of inference in knowledge graph embedding models are challenging problems with many open research questions.

5. What architecture does SANSA use?

SANSA uses a technology stack consisting of several layers as shown below:

Copy of Copy of Arch_SANSA

6. How can I use SANSA?

Back to top

RDF Processing

0. How does the distribution of RDF data in SANSA work?

SANSA uses the RDF data model for representing graphs consisting of triples with subject, predicate and object. RDF datasets may contains multiple RDF graphs and record information about each graph, allowing any of the upper layers of sansa (Querying and ML) to make queries that involve information from more than one graph. Instead of directly dealing with RDF datasets, the target RDF datasets need to be converted into an RDD of triples. We name such an RDD a main dataset. The main dataset is based on an RDD data structure, which is a basic building block of the Spark framework. RDDs are in-memory collections of records that can be operated on in parallel on large clusters.

1. How can I read an RDF file and retrieve a Spark RDD representation of it?

2. Does SANSA support different serialisation formats for RDF?

The following RDF formats are supported by SANSA.

  • N-Triples
  • N-Quads

In addition, we are working on generalization of TripleWritter and TripleReader.

3. How can I filter all triples with a certain subject / predicate / object in an RDF file?

  • Full example code:

4. How can I count the number of subjects / predicates / objects / triples of my RDF file?

5. How can I apply a user defined function to all literals / URIs?

  • By implementing your udf into :

6. How can I search for entities?

  • By applying your filter operations over:

7. Can I load several files without merging them beforehand?

  • In Spark, the method textFile() takes an URI for the file (either a local path or a hdfs:// ). You could run this method on a single file or a directory which may contains more than one file by calling :

8. How do I write RDF files?

9. How can I compute the pagerank of resources in RDF files?

  • The PageRank algorithm compute the importance of each vertex (represented as resource) in a graph. Resource PageRank is build on top of Spark GraphX.

    Full example code:

Back to top

OWL Processing

1. How can I load an OWL file in format XYZ?

2. How can I retrieve all axioms of type XYZ?

    • E.g. OWLSubClassOfAxiom:

3. How can I print loaded OWL axioms?

Back to top

SPARQL Queries

1. How does SANSA perform distributed RDF querying?

SANSA uses vertical partitioning (VP) approach and is designed to support extensible partitioning of RDF data. Instead of dealing with a single three-column table (s, p, o), data is partitioned into multiple tables based on the used RDF predicates, RDF term types and literal datatypes. The first column of these tables is always a string representing the subject. The second column always represents the literal value as a Scala/Java datatype. Tables for storing literals with language tags have an additional third string column for the language tag.

  • The method for partitioning a RDD[Triple] is located in RdfPartitionUtilsSpark. It uses an RdfPartitioner which maps a Triple to a single RdfPartition instance.

    • RdfPartition, as the name suggests, represents a partition of the RDF data and defines two methods:
      • matches(Triple): Boolean: This method is used to test whether a triple fits into a partition.
      • Layout => TripleLayout: This method returns the TripleLayout associated with the partition, as explained below.
      • Furthermore,RdfPartitions are expected to be serializable, and to define equals and hash code.
    • TripleLayout instances are used to obtain framework-agnostic compact tabular representations of triples according to a partition. For this purpose it defines the two methods:
      • fromTriple(triple:Triple): Product: This method must, for a given triple, return its representation as a Product(this is the super class of all scalaTuples)
      • schema:Type: This method must return the exact scala type of the objects returned by fromTriple, such as typeOf[Tuple2[String,Double]]. Hence, layouts are expected to only yield instances of one specific type.

    See the available layouts for details.

2. How can I query an RDF file using SPARQL?

3. How can I start an HTTP SPARQL server?

Back to top


1. How does the SANSA inference module work?

The inference layer supports rule-based reasoning, i.e. given a set of rules it computes all possible inferences on the given dataset. Technically, forward-chaining [1] is applied, i.e. it starts with the available data and uses inference rules to extract more data. This is sometimes also referred to as “materialization”.

Currently, three fixed rulesets are supported, namely RDFS, OWL-Horst, and OWL-EL. Later versions will contain a generic rule-based reasoner such that a user can define it’s own set of rules which will be used to materialize the given dataset.


2. How can I use the inference layer?

The easiest way is to use the RDFGraphMaterializer:

  • Full example code:

  • Full example code:

Back to top

Machine Learning

1. How can I use SANSA for clustering on RDF graph?

  • SANSA contains the implementation of a partitioning algorithm for RDF graphs given as NTriples. The algorithm uses the structure of the underlying undirected graph to partition the nodes into different clusters. SANSA’s clustering procedure follows a standard algorithm for partitioning undirected graphs aimed to maximize a modularity function, which was first introduced by Newman.

    You  will need your RDF graph in the form of a text file, with each line containing exactly one triple of the graph. Then you specify the number of iterations and supply a file path where you want your resulting clusters to be saved to.

    Full example code:

2. How can I use SANSA for mining rules?

Rule mining for knowledge bases is used to look for new facts, or such rules can be used to identify errors in the knowledge bases. These rules can be used for reasoning, and the rules that define regularities in the data can be used to understand the data better.

SANSA uses AMIE+ algorithm to mine association rules or correlations in the RDF dataset. These rules have the form r(x,y) <= B1 & B2 & … Bn  while r(x,y) is the head and B the body, a conjunction of atoms of the rule. The process starts with rules with only one atom, which are then refined to add more atoms with fresh or previously used variables. The rules is accepted as output rule, if its support is above a support threshold.


    Within spark, the support of a rule is calculated using DataFrames. For the rules with two atoms, the predicates of head and body are filtered against a dataframe, which contains all the instantiated atoms with a particular predicate. Different dataframes are then joined and only the rows with correct variables are kept. For greater sizes, new atoms are joined with the previous dataframes (previously refined rules), which are stored in the parquet format with rules as name for corresponding folders.

    Full example code:

Back to top