3. How can I use SANSA for Knowledge Graph Embedding Models?

Currently there are two Knowledge Graph Embedding (KGE) models are implemented: TransE [1] and DistMult (Bilinear-Diag) [2].
The following code snippets show you how you can load your dataset and apply cross validation techniques supported on SANSA KGE.


2. How can I use SANSA for mining rules?

Rule mining for knowledge bases is used to look for new facts, or such rules can be used to identify errors in the knowledge bases. These rules can be used for reasoning, and the rules that define regularities in the data can be used to understand the data better.

SANSA uses AMIE+ algorithm to mine association rules or correlations in the RDF dataset. These rules have the form r(x,y) <= B1 & B2 & ... Bn while r(x,y) is the head and B the body, a conjunction of atoms of the rule. The process starts with rules with only one atom, which are then refined to add more atoms with fresh or previously used variables. The rules is accepted as output rule, if its support is above a support threshold.

  • Within spark, the support of a rule is calculated using DataFrames. For the rules with two atoms, the predicates of head and body are filtered against a dataframe, which contains all the instantiated atoms with a particular predicate. Different dataframes are then joined and only the rows with correct variables are kept. For greater sizes, new atoms are joined with the previous dataframes (previously refined rules), which are stored in the parquet format with rules as name for corresponding folders.


    Full example code: https://github.com/SANSA-Stack/SANSA-Examples/blob/master/sansa-examples-spark/src/main/scala/net/sansa_stack/examples/spark/ml/mining/MineRules.scala

1. How can I use SANSA for clustering on RDF graph?

SANSA contains the implementation of a partitioning algorithm for RDF graphs given as NTriples. The algorithm uses the structure of the underlying undirected graph to partition the nodes into different clusters. SANSA’s clustering procedure follows a standard algorithm for partitioning undirected graphs aimed to maximize a modularity function, which was first introduced by Newman.

You will need your RDF graph in the form of a text file, with each line containing exactly one triple of the graph. Then you specify the number of iterations and supply a file path where you want your resulting clusters to be saved to.