Quality Assessment of RDF Datasets at Scale

“Data is the new oil. It’s valuable, but if unrefined it cannot really be used. It has to be changed into gas, plastic, chemicals, etc to create a valuable entity that drives profitable activity; so must data be broken down, analyzed for it to have value.”
— Clive Humby (UK Mathematician and Chief Data Scientist at Starcount)

Extracting meaningful information from data which is considered being chaos has been a challenge for many years now. Often, this data is coming in an unstructured form and is increasing daily, there is a challenge to analyze and use such data. Data quality issues (when dealing with such amount of data) becomes a challenge in the data world. Such unstructured data acquired from different sources often causes a delay in deriving insights and analysis due to data quality issues.

In order to handle this ambiguity and complexity of data is modeling using the Semantic Web Technologies. Over the last years, Linked Data has grown continuously. Today, we count more than 10,000 datasets being available online following Linked Data standards. These standards allow data to be machine-readable and interoperable. Nevertheless, many applications, such as data integration, search, and interlinking, cannot take full advantage of Linked Data if it is of low quality. There exist a few approaches for the quality assessment of Linked Data, but their performance degrades with the increase in data size and quickly grows beyond the capabilities of a single machine.

In order to overcome this, we present DistQualityAssessment [1] — an open-source implementation of quality assessment of large RDF datasets (integrated into SANSA) that can scale out to a cluster of machines.

Getting Started

Within the scope of this post, we will be using SANSA-Notebooks as a base setup for running SANSA and in particular distributed quality assessment. You can also run it using your prefered IDE (see here for more details on how to setup SANSA on IDE).

SANSA Notebooks (or SANSA Workbench) is an interactive Spark Notebooks for running SANSA-Examples and are easy to deploy with docker-compose. Deployment stack includes Hadoop for HDFS, Spark for running SANSA examples, Hue for navigation and copying file to HDFS. The notebooks are created and run using Apache Zeppelin. The setup requires that you have “Docker Engine >= 1.13.0, docker-compose >= 1.10.0, and around 10 GB of disk space for Docker images”.

After you are done with the docker installation, clone the SANSA-Notebooks git repository:

Get the SANSA Examples jar file (requires wget):

Start the cluster (this will lead to downloading BDE docker images):

When start-up is done you will be able to access the SANSA notebooks.

To load the data to your cluster simply do:

After you load the data (you can upload your own data using Hue – HDFS File Browser or simply adding them to the folder and run the command above to lead the data), go on and open SANSA Notebooks, choose any available notebooks or create a new one (choose Spark as Interpreter) and add the following code snippets:

In order to read data in an efficient and scalable way, SANSA has its own readers for reading different RDF serialization formats. In our case, we are going to use NTRIPLES (see Line 5). For that reason, we have to import io API (see Line 2) in order to use such io operations.

As the data is kept in HDFS, we will have to specify the path where data is standing (see Line 5) and the syntax they are represented (in our case we are using NTRIPLES, see Line 6). Afterword, we generate Spark RDD (Resilient Distributed Datasets) representations of Triples (see Line 7) which allows us to perform a quality assessment at scale.

The SANSA quality assessment component is part of the framework and therefore the call can be done easily by just using the qualityassessment namespace (see Line 3). It comes with different quality assessment metrics which can be easily used. Below, we list part of that list:

The quality assessment metric evaluation generate a numerical representation of the quality check. Here, for example we want to assess the Schema Completeness (see Quality Assessment for Linked Open Data: A Survey paper for more details on the metric definitions) on a given dataset. We can assess it by just running:

Within the SANSA Notebooks, we can visualize the metric values and export the results if needed for further analysis.

sansa-quality-assessment

For more details about the architecture, technical details of the approach and its evaluation/comparison with other systems, please check out our paper [1].

After you are done, you can stop the whole stack:

That’s it, hope you enjoyed it. If you have any question or suggestion, or you are planning a project with SANSA, perhaps by using this quality assessment?  would love to hear about it! b.t.w we always welcome new contributors to the project! Please see our contribution guide for more details on how to get started contributing to SANSA.

References

[1]. A Scalable Framework for Quality Assessment of RDF DatasetsGezim Sejdiu; Anisa Rula; Jens Lehmann; and Hajira Jabeen. In Proceedings of 18th International Semantic Web Conference, 2019.

SANSA Parser Performance Improved

More efficient RDF N-Triples Parser introduced in SANSA: Parsing Improvements of up to an order of Magnitude, e.g. DBpedia can be read in <100 seconds on a 7 node cluster

SANSA provides a set of different RDF serialization readers and for most of them, the Jena Riot Reader has been used. Unfortunately, this represented a bottleneck for a range of use cases on which SANSA has been applied when dozens of billions of triples need to be processed.

SANSAparser

Fig. 1. Different RDF serialization readers supported on SANSA.

In order to improve the efficiency of the N-Triples reader, we have been working on a Spark-based implementation in the last weeks and are happy to announce that we could speed up the processing time up to an order of magnitude. As an example, the data currently loaded in the main public DBpedia endpoint (450 million triples) can be read in less than 100 seconds on 6 node cluster now. All improvements will be integrated on SANSA with the new version (SANSA 0.4), which we will ship end of June.

For people interested in performance details: We did some experiments during the development also on publicly available datasets which we can report here. We tested on three well known RDF datasets, namely LUBM, DBpedia, and Wikidata and measured the time it takes to read and distributed the datasets on the cluster. As you can see in the image below, we basically reduced the processing time of those data sets by up to an order of magnitude.

rdf_parser_lubm-dbpedia

For future work, we will be working on the Flink part of the RDF layer to align its performance with the Spark implementation and investigating the relation to different optimization techniques like compression, partitioning strategies, etc. when working on Spark and Flink.

Datasets:

Dataset #triples
LUBM[note]The LUBM datasets have been generated via an improved version of the generator maintained by Rob Vesse and can be found on Github: https://github.com/rvesse/lubm-uba[/note] 10 1 316 342
LUBM 100 13 876 156
LUBM 1000 138 280 374
LUBM 5000 690 895 862
DBpedia 2016-10[note]We loaded the data which is also published on the public DBpedia SPARQL endpoint as described in http://wiki.dbpedia.org/public-sparql-endpoint[/note] 451 685 478
Wikidata (truthy)[note]Latest truthy taken from https://dumps.wikimedia.org/wikidatawiki/entities/latest-truthy.nt.bz2 ( we used version of May 07, 2018 )[/note] 2 786 548 055

Experimental Settings: 

  • Server Setup: 7 nodes, 128 GB RAM, 16 physical cores
  • Spark Setup:  1 master node, 6 slave nodes, Spark version 2.3.1
  • Spark Submit Settings:

 

SANSA Collaboration with Alethio

The SANSA team is excited to announce our collaboration with Alethio (a ConsenSys formation). SANSA is the major distributed, open source solution for RDF querying, reasoning and machine learning. Alethio is building an Ethereum analytics platform that strives to provide transparency over what’s happening on the Ethereum p2p network, the transaction pool and the blockchain and provide “blockchain archeology”. Their 5 billion triple data set contains large scale blockchain transaction data modelled as RDF according to the structure of the Ethereum ontology. EthOn – The Ethereum Ontology – is a formalization of concepts/entities and relations of the Ethereum ecosystem represented in RDF and OWL format. It describes all Ethereum terms including blocks, transactions, contracts, nonces etc. as well as their relationships. Its main goal is to serve as a data model and learning resource for understanding Ethereum.

Alethio is interested in using SANSA as a scalable processing engine for their large-scale batch and stream processing tasks, such as querying the data in real time via SPARQL and performing related analytics on a wide range of subjects (e.g. asset turnover for sets of accounts, attack pattern detection or Opcode usage statistics). At the same time, SANSA is interested in further industrial pilot applications for testing the scalability on larger datasets, mature its code base and gain experience on running the stack on production clusters. Specifically, the initial goal of Alethio was to load a 2TB EthOn dataset containing more than 5 billion triples and then performing several analytic queries on it with up to three inner joins. The queries are used to characterize movement between groups of ethereum accounts (e.g. exchanges or investors in ICOs) and aggregate their in and out value flow over the history of the Ethereum blockchain. The experiments were successfully run by Alethio on a cluster with up to 100 worker nodes and 400 cores that have a total of over 3TB of memory available.

I am excited to see that SANSA works and scales well to our data. Now, we want to experiment with more complex queries and tune the Spark parameters to gain the optimal performance for our dataset” said Johannes Pfeffer, co-founder of Alethio. I am glad that Alethio managed to run their workload and to see how well our methods scale to a 5 billion triple dataset”, added Gezim Sejdiu, PhD student at the Smart Data Analytics Group and SANSA core developer.

Parts of the SANSA team, including its leader Prof. Jens Lehmann as well as Dr. Hajira Jabeen, Dr. Damien Graux and Gezim Sejdiu, will now continue the collaboration together with the data science team of Alethio after those successful experiments. Beyond the above initial tests, we are jointly discussing possibilities for efficient stream processing in SANSA, further tuning of aggregate queries as well as suitable Apache Spark parameters for efficient processing of the data. In the future, we want to join hands to optimize the performance of loading the data (e.g. reducing the disk footprint of datasets using compression techniques allowing then more efficient SPARQL evaluation), handling the streaming data, querying, and analytics in real time.

The SANSA team is happily looking forward to further interesting scientific research as well as industrial adaptation.

image1
Core model of the fork history of the Ethereum Blockchain modeled in EthOn

SANSA 0.3 (Semantic Analytics Stack) Released

We are happy to announce SANSA 0.3 – the third release of the Scalable Semantic Analytics Stack. SANSA employs distributed computing via Apache Spark and Flink in order to allow scalable machine learning, inference and querying capabilities for large knowledge graphs.

You can find the FAQ and usage examples at http://sansa-stack.net/faq/.

The following features are currently supported by SANSA:

  • Reading and writing RDF files in N-Triples, Turtle, RDF/XML, N-Quad format
  • Reading OWL files in various standard formats
  • Support for multiple data partitioning techniques
  • SPARQL querying via Sparqlify (with some known limitations until the next Spark 2.3.* release)
  • SPARQL querying via conversion to Gremlin path traversals (experimental)
  • RDFS, RDFS Simple, OWL-Horst (all in beta status), EL (experimental) forward chaining inference
  • Automatic inference plan creation (experimental)
  • RDF graph clustering with different algorithms
  • Rule mining from RDF graphs based AMIE+
  • Terminological decision trees (experimental)
  • Anomaly detection (beta)
  • Distributed knowledge graph embedding approaches: TransE (beta), DistMult (beta), several further algorithms planned

Deployment and getting started:

  • There are template projects for SBT and Maven for Apache Spark as well as for Apache Flink available to get started.
  • The SANSA jar files are in Maven Central i.e. in most IDEs you can just search for “sansa” to include the dependencies in Maven projects.
  • There is example code for various tasks available.
  • We provide interactive notebooks for running and testing code via Docker.

We want to thank everyone who helped to create this release, in particular the projects Big Data EuropeHOBBITSAKEBig Data OceanSLIPOQROWD and BETTER.

Greetings from the SANSA Development Team

SANSA at ISWC 2017 and a Demo Award

cropped-icon_iswc-1

The International Semantic Web Conference (ISWC) is the premier international forum where Semantic Web / Linked Data researchers, practitioners, and industry specialists come together to discuss, advance, and shape the future of semantic technologies on the web, within enterprises and in the context of the public institution.

 We are very pleased to announce that we got a paper accepted at ISWC 2017 for presentation at the main conference. Additionally, we also had a Demo paper accepted as well.

 

Distributed Semantic Analytics using the SANSA Stack” by Jens LehmannGezim SejdiuLorenz BühmannPatrick WestphalClaus Stadler, Ivan Ermilov, Simon Bin, Muhammad Saleem, Axel-Cyrille Ngonga Ngomo and Hajira Jabeen.
Prof. Dr. Jens Lehmann presented a work done on SANSA project with the main focus on offering a compact scalable engine for the whole Semantic Web Stack. The audience showed a high interest on the project, Room was very packed, a lot of people even standing ; around 150 attendees.

Website: http://sansa-stack.net/
GitHub: https://github.com/SANSA-Stack
Slides:https://www.slideshare.net/JensLehmann/sansa-iswc-international-semantic-web-conference-2017-talk

Furthermore, we are very happy to announce that we won the Best Demo Award for the SANSA Notebooks:
The Tale of Sansa Spark” by Ivan Ermilov, Jens LehmannGezim SejdiuBuehmann LorenzPatrick WestphalClaus StadlerSimon BinNilesh Chakraborty, Henning Petzka, Muhammad Saleem, Axel-Cyrille Ngonga Ngomo and Hajira Jabeen.


Here are some further pointers in case you want to know more about SANSA:

The audience displayed enthusiasm during the demonstration appreciating the work and asking questions regarding the future of SANSA, technical details and possible synergy with industrial partners and projects. Gezim Sejdiu and Jens Lehmann, who were presenting the demo, were talking 3+ hours non-stop (without even time to eat 😉 ).

ISWC17 was a great venue to meet the community, create new connections, talk about current research challenges, share ideas and settle new collaborations.

SANSA @ 4th Big Data Europe Plenary at Leipzig University

logo-BigDataEuropeThe meeting, hosted by our partner InfAI e. V., took place on the 14th and 15th of December at the University of Leipzig.
The 29 attendees in total, including 15 partners, discussed and reviewed the progress of all work packages in 2016 and planned the activities and workshops taking place in the next six months.

 

During the first day of the meeting, Prof. Dr. Jens Lehmann presented the current status of SANSA.


The audience showed high interest in his presentation and appreciated the usage of distributed frameworks applied on the Web of Data.The following discussion included further challenges on SANSA specific layers and constructive suggestions for possible improvements.

 

SANSA 0.1 (Semantic Analytics Stack) Released

Dear all,

We’re very happy to announce SANSA 0.1 – the initial release of the Scalable Semantic Analytics Stack. SANSA combines distributed computing and semantic technologies in order to allow powerful machine learning, inference and querying capabilities for large knowledge graphs.

Website: http://sansa-stack.net
GitHub: https://github.com/SANSA-Stack
Download: http://sansa-stack.net/downloads-usage/
ChangeLog: https://github.com/SANSA-Stack/SANSA-Stack/releases

You can find the FAQ and usage examples at http://sansa-stack.net/faq/.

The following features are currently supported by SANSA:

  • Support for reading and writing RDF files in N-Triples format
  • Support for reading OWL files in various standard formats
  • Querying and partitioning based on Sparqlify
  • Support for RDFS/RDFS Simple/OWL-Horst forward chaining inference
  • Initial RDF graph clustering support
  • Initial support for rule mining from RDF graphs

Visit the release notes to read about the new features, or download the release today.

We want to thank everyone who helped to create this release, in particular, the projects Big Data Europe, HOBBIT and SAKE.

Kind regards,

The SANSA Development Team