Motivation
The result of the execution is determined by the following factors:
- content of the processed file
- parsing format assigned to the file
- e.g. changing the used delimiter can affect how many columns the CSV file has
- available knowledge bases
- e.g. when some resource is missing in the base, the algorithm will not use it to annotate any of the table parts
- task configuration, which stands for:
- task description
- the input file
- provided feedback which the server uses as constraints for the next algorithm run
- set of of bases, that the user selected to run the processing against
- chosen primary base
- specified maximum number of rows that will be processed from the file
- whether to approach the input as statistical data (which ultimately results in the export of the RDF data cude)
It is impractical to transfer the whole knowledge bases, the more so because they are usually remotely accessible. The definition of proxies on the other hand are small and easy to transfer. Files are also easy to send from one machine to another, the remote file location can just be shared (apart from the parsing format, but it hardly ever changes so it does not hurt to set it up on another machine). What remains to solve is the task configuration export and import.
In the current state the task configuration includes also the definitions of the used base proxies. When a base proxy of the same name is already present, it is used as it is; if not the serialized definition is used to create the used base proxy first. The base proxies can also be exported and imported independently on the tasks.
Tasks do also have other properties: owning user, task ID, time of creation/modification. But these are ephemeral and not that useful to keep in a configuration transferable from machine to machine.
Format choice
One has almost too many choices when choosing how to encode the configurations. Serialized RDF is an appealing option, not only because of the project subject matter, but it also makes it easy to accompany the processing results with their provenance. What makes the conversion to RDF a challenge is a relatively large and diverse class hierarchy of the task configuration that must be turned into RDF statements. It would be possible to write a code doing that by manually constructing the RDF model, but an option to annotate the involved classes and employ some framework to construct the statements automatically (as is the case when mapping the domain classes to JSON through JAXB annotations for the REST API) appeared like a better choice.
There are few options, such as Alibaba, Empire, but these are focused on storing Java objects in RDF stores and are therefore too heavy for simple round-trip conversion. There is an existing older library http://rdfbeans.sourceforge.net/, which appears to be abandoned. Ultimately Pinto library was chosen. Following the practice established to convert domain objects to JSON for the REST API, a separate package cz.cuni.mff.xrg.odalic.api.rdf.values containing the mapped versions of objects was established and these versions annotated with Pinto annotations. Because the Pinto lacks the concept of XMLJavaTypeAdapters, the mapped version of objects must refer to other mapped versions, and methods converting the value objects back to the domain ones have to be provided.
Much larger complication appeared when attempting to convert Java Maps of more complex types than Strings. Pinto did not handle these cases well, so the maps had to be converted to collections of key-value entries first. Apart from that, the solution finalized in cz.cuni.mff.xrg.odalic.api.rdf.TurtleRdfMappingTaskSerializationService
and in cz.cuni.mff.xrg.odalic.api.rdf.TurtleRdfMappingknowledgeBaseSerializationService
proved to be reliable, even for a complicated configuration cases, involving extensive user feedback.
Format specification
All the exported tasks have a unique identifier generated, which is present in the serialized configuration in a triple in the form:
<http://odalic.eu/odalic/SerializedTask/V5/246c095d-f89b-4151-962b-34bd25b02843> a <http://odalic.eu/internal/Task>
The subject URI consists of three main parts: application instance web address (odalic.eu in this case), version identifier (currently the fifth version) and UUID. The subject is of type http://odali.eu/internal/Task, and through other RDF statements has all the other exported properties linked. Properties such as http://odalic.eu/internal/Task/configuration follow a common pattern where every property has a suffix in the form http://odalic.eu/internal , followed by the name of the class the property belongs to and the name of the property, derived from the properties as defined by the objects exchanged through REST API.
Knowledge base proxies follow the same schema, only substituting Task for KnowledgeBase. The underlying library does not map Java collections using RDF collections, but instead opts for one-time (but not anonymous) nodes forming the defining connections. They share the same prefix http://odalic.eu/odalic/SerializedTask/Node/ followed by UUID. These are also used to represent contained entities (which alleviates the need to create manually unique identifier for each "pointer"). The only exception are maps which are before mapping converted to a set of entries (for example in the case of a map from base name to annotation candidates).
The typical fragment of exported configuration looks like this, which illustrates the above mentioned peculiarities of the format:
<http://odalic.eu/odalic/SerializedTask/V5/29eac34b-de44-4328-8d54-ad97799384f7> a <http://odalic.eu/internal/Task> ; <http://odalic.eu/internal/Task/configuration> <http://odalic.eu/odalic/SerializedTask/Node/8a0734da21ccdc62ad71f2483c8519d8> . <http://odalic.eu/odalic/SerializedTask/Node/f1188357b6a2c82746d36af42dd04cb3> a <http://odalic.eu/internal/Entity> . <http://odalic.eu/odalic/SerializedTask/Node/6ea67dc770af060bc491dbb8385bc867> <http://odalic.eu/internal/Entity/resource> "http://dbpedia.org/dbtax/Surname" . <http://odalic.eu/odalic/SerializedTask/Node/6532423b372f99780ed96f118409c1d1> <http://odalic.eu/internal/EntityCandidate/score> <http://odalic.eu/odalic/SerializedTask/Node/928ec89c7f9e3d7a72590c219e1cb611> . <http://odalic.eu/odalic/SerializedTask/Node/f501ceaf6326453502a757e2a22ddcfa> <http://odalic.eu/internal/KnowledgeBase/insertGraph> "http://odalic.eu" . <http://odalic.eu/odalic/SerializedTask/Node/59d958855ae15086db58fea03a825b6f> a <http://odalic.eu/internal/Entity> . <http://odalic.eu/odalic/SerializedTask/Node/4b072229a26e9e103703e060b1133020> <http://odalic.eu/internal/EntityCandidate/entity> <http://odalic.eu/odalic/SerializedTask/Node/156ae04d9c18edca709e51a644076014> . <http://odalic.eu/odalic/SerializedTask/Node/156ae04d9c18edca709e51a644076014> <http://odalic.eu/internal/Entity/label> "Person" . <http://odalic.eu/odalic/SerializedTask/Node/fdcd19e5f18a7469262778efed4bc633> <http://odalic.eu/internal/EntityCandidateNavigableSetWrapper/value> <http://odalic.eu/odalic/SerializedTask/Node/239164724ee31eea35337b2d0a83a21b> . <http://odalic.eu/odalic/SerializedTask/Node/75734a569feba5ad18339a2dbfa7cab7> <http://odalic.eu/internal/KnowledgeBaseEntityCandidateNavigableSetEntry/base> "DBpedia" . <http://odalic.eu/odalic/SerializedTask/Node/6888029707e6588411dbbe33f195597a> <http://odalic.eu/odalic/SerializedTask/Node/value> 3.0438887148351625E0 . ...