SANSA Frequently Asked Questions (FAQ)
Semantic Analytics Stack
Here, we see analytics as the combination of querying, inference and machine learning tasks.
In SANSA, we combine distributed computing frameworks (specifically Spark and Flink) with the semantic technology stack. Here is an illustration:
The SANSA vision combines distributed analytics (left) and semantic technologies (right) into a scalable semantic analytics stack (top). The colours encode what part of the two original stacks influence which part of the SANSA stack. The main objective of SANSA is to investigate whether the characteristics of each technology stack (bottom) can be combined to retain the respective advantages.
The combination of distributed computing and semantic technologies has the potential to exploit several critical advantages. We are clearly not there yet, but initial steps have already been made.
SANSA inherits the following advantages from the semantic technology stack:
Powerful Data Integration
Current analytics pipelines have to handle increasing data variety and complexity. Moving from common short term ad hoc solutions that require a lot of engineering effort, to standardised and well-understood semantic technology approaches had and will have significant impact.
The vast majority of machine learning algorithms have to rely on simple input formats, such as feature vectors, rather than being able to use expressive modelling via the Resource Description Framework (RDF) and the Web Ontology Language (OWL). While this has been researched in fields such as Statistical Relational Learning and Inductive Logic Programming, these methods usually do not scale horizontally. Initial work on horizontally scalable machine learning on structured data has been performed, particularly in terms of adding graph processing capabilities to distributed computing frameworks, but those are not aimed at semantic technologies and currently provide limited capabilities.
The usage of W3C standards can generally reduce pre-processing time in those cases when data sources are used for more than one analytics task. This is the case for knowledge graphs, which are often combined with several applications including search, information retrieval, advanced querying and filtering of information, as well as visualisation. Beyond this, the standardisation allows to draw on generic approaches, e.g. for querying and merging data, rather than developing ad hoc solutions, which are less reusable and often less efficient and effective. The use of standards will also enable a clearer separation of the data pre-processing step, i.e. RDF modelling, and the actual analytics step. This allows experts in either step to focus efforts on their expertise, increasing overall efficiency.
SANSA inherents the following advantages from machine learning research and distributed computing:
A key driver for the success of machine learning is that its benefits are often directly measurable, e.g. an accuracy improvement can often be directly translated into a financial benefit. This is not really the case for semantic technologies where the benefits gained through the effort of modelling, editing and extracting knowledge are often not easily measurable. A seamless integration of semantic technologies and machine learning, as envisioned in SANSA, will also help to make the benefits of semantic technologies more visible, as they will translate to machine learning results which are more accurate and easier to understand.
Distributed in-memory computing can provide the horizontal scalability required by the high computation and storage demands of large-scale semantic knowledge graphs analytics. However, it does not magically result in higher scalability and requires a deep understanding of the underlying structures and models. For instance, distributed machine learning for expressive logics and the inclusion of inference in knowledge graph embedding models are challenging problems with many open research questions.