Rule mining for knowledge bases is used to look for new facts, or such rules can be used to identify errors in the knowledge bases. These rules can be used for reasoning, and the rules that define regularities in the data can be used to understand the data better.

SANSA uses AMIE+ algorithm to mine association rules or correlations in the RDF dataset. These rules have the form r(x,y) <= B1 & B2 & ... Bn while r(x,y) is the head and B the body, a conjunction of atoms of the rule. The process starts with rules with only one atom, which are then refined to add more atoms with fresh or previously used variables. The rules is accepted as output rule, if its support is above a support threshold.

  • Within spark, the support of a rule is calculated using DataFrames. For the rules with two atoms, the predicates of head and body are filtered against a dataframe, which contains all the instantiated atoms with a particular predicate. Different dataframes are then joined and only the rows with correct variables are kept. For greater sizes, new atoms are joined with the previous dataframes (previously refined rules), which are stored in the parquet format with rules as name for corresponding folders.

    Full example code: