Many governments or governmental organizations throughout the world allow the public to access data they produce or collect (e.g. http://data.gv.at/ or http://data.opendataportal.at). A large portion, very often data of statistical nature, of these open data is published in form of tables, encoded as common CSV files. This practice, however, is not ideal, because the overall usefulness of the data would be greatly improved by making them into Linked Open Data. This generally means to assign the individual pieces of content globally unique identifiers and link them to other, external sources. In detail this involves:
- Classifying the table columns, based on their content and context against existing knowledge bases.
- Assigning globally unique identifiers (URIs, or even national characters supporting IRIs) to cell values according to Linked Data principles. Such identifiers may be reused from one of the existing knowledge bases (e.g. DBpedia).
- Discovery of relations between columns, based on the evidence for the relations in the existing knowledge bases.
- Converting the data in the tables and the annotations produced in the previous steps to RDF; using proper data types, language tags, well-known Linked Data vocabularies (e.g. RDF Schema, DBpedia Ontology, ...) and other RDF-related tools and technologies.
Odalic as a platform guides its users through this process and makes it easy to reproduce, automate and customize. Based on the work of Ziqi Zhang and the prototype implementation of the described algorithm, Odalic turns it into working, user-focused application and introduces several major extensions and improvements to the original idea, while shifting the focus toward already prepared CSV files, to serve the needs of parental project ADEQUATe. The original TableMiner+ algorithm has itself established as a leading solution to the problem (see the original paper which compares it with alternative and legacy methods, distancing itself strictly from attempts to apply general NLP solutions, wrapper inducing methods and other means not tailored to work on actual tables). Development version of Odalic was a subject of one of the workshops at Semantics 2016 conference and was met with positive response and genuine interest. In cooperation with Mr. Zhang, more widespread user evaluation and feedback gathering is planned in the near future.
Odalic platform consists of three key components:
Odalic Semantic Table Interpretation
- It is a server, deployable to Apache Tomcat as web archive, accessible and controllable through REST API specified here. This allows to draw upon its resources (as demonstrated by the UnifiedViews plugin) in new ways, unforeseen by the authors.
- Its users can provide extensive feedback on results of the automatic conversion, which our modification of the original algorithm takes into account during subsequent runs.
- Users can add their own custom resources and use them for feedback and in the exported data.
- It introduces ability to employ multiple knowledge bases at once.
- Allows export of results conforming to CSV on the Web draft specification or in popular RDF serialization formats, such as Turtle and JSON+LD.
- Supports running of the conversions in independent tasks and their comfortable management.
- Supports multiple users plus administrator, employing token-based authorization and authentication friendly to further extensions.
- Includes necessary local and remote CSV files management.
- Task configuration is exportable in RDF for easier data provenance.
Summary of major implemented improvements over the original TableMiner+ algorithm
- Queries resolving the table content are now, where possible, constrained to the appropriate types, thus eliminating evidently wrong results.
- The original TableMiner+ algorithm used now deprecated Freebase knowledge base. Its usage was substituted with support of DBpedia family of knowledge bases, general support for bases accessible through SPARQL endpoints and the ADEQUATe PoolParty.
- All of the resources and predicates used in the code searching the bases have been externalized to configuration files and are no longer hard-coded.
- The algorithm is now more robust and efficiently handles all errors in communication with the knowledge bases.
- The algorithm now accepts constraints originating from a feedback provided by the user on top of results from previous runs. This helps to fix mistakes in automatic conversion as well as provides way to introduce custom annotations made by the users to the exported data.
- The algorithm was modified to allow computation of results forming a data cube, which makes it much more usable in processing of statistical data.
Odalic UI
Odalic UI is a web application serving as a graphical user interface to the Odalic Semantic Table Interpretation backend, allowing its users to extract and export Linked Data from provided CSV files. Its key properties are the following:
- Pleasant, easy-to-use single-page user interface.
- The data and the computed annotations are presented in the form of interactive tables, which makes it easy to overview the results, and simple to provide appropriate feedback and customize the output.
- Relations between columns are neatly visualized and modifiable in a dynamic graph component.
- Supports practically all of the server features, including separate user spaces, files and tasks management.
- Support for runtime management and configuration of proxies to the knowledge bases is provided as well.
Odalic UnifiedViews Plugin
The plugin is built atop the server API, making it easy to run the processing within the UnifiedViews. UnifiedViews is a mature ETL tool specialized on processing of and into Linked Data. Its capabilities allow to plan and schedule the use of Odalic Semantic Table Interpretation in many intricate scenarios, combining its power with other present plugins. The plugin itself manifests in the UnifiedViews as the so called Data Processing Unit (DPU), which instances can become part of arbitrarily complex virtual pipelines. The plugin otherwise follows a pattern similar to the case when the processing would be defined in the Odalic UI: user specifies the input (which can now be the output of other DPUs), configuration, and connects the Odalic DPU outputs (exported table annotations or even RDF) to inputs of other DPUs. More can be found in UnifiedViews DPU user documentation.