Phases
The main goal of the core algorithm is to take the the tabular input data, interpret it as best as it can using selected knowledge bases, and return an annotated table. This annotation process has several distinct phases:
- Subject Column Detection
The subject column candidate is determined from both the column headers and cell values. The values in the subject column then represent main entities, values from other columns are interpreted as properties of these main entities. - Cells Disambiguation
Individual cell values are searched in available knowledge bases. Search results are then scored and their types and properties are loaded for further processing. The result of this phase is a list of candidates for every table cell. - Columns Classification
Classification determines candidates for columns based on types of disambiguated column cells. The desired result is to pick a class that is a type for majority of the column cells, but also is not too broad (e.g. when classifying a column of writers, we want to receive dbpedia:Writer and not dbpedia:Thing). - Relations Enumeration
Relations between cells and columns are determined from properties of disambiguated cells.
A more in-depth description of the original algorithm, upon which is the core algorithm base on, can be found in the paper http://www.semantic-web-journal.net/content/effective-and-efficient-semantic-table-interpretation-using-tableminer-0 . Its implementation of the algorithm is spread into several modules.
- sti-main
The main part of the algorithm. Contains the implementation of the above mentioned steps. The original TableMiner+ contained many experimental interpreter implementations. In the Odalic, most of them were removed in favor of the TMP implementation. The reason for this is, that each of them would have to be extended by new features like user feedback, which would go beyond the scope of our project. Also there are mostly proven inferior by the paper. - sti-kbproxy
Provides an abstracted interface for communication with various KBs. Used both by the core algorithm and Odalic server. - sti-websearch
Handles scoring of results based on web search.
Main Module
The entry point for the algorithm is the uk.ac.shef.dcs.sti.core.algorithm.tmp.TMPOdalicInterpreter class, to successfully initialize the interpreter it is necessary to obtain instances of following classes.
- uk.ac.shef.dcs.sti.core.algorithm.tmp.TCellDisambiguator
Handles cell disambiguation. - uk.ac.shef.dcs.sti.core.algorithm.tmp.TColumnClassifier
Handles columns classification. - uk.ac.shef.dcs.sti.core.algorithm.tmp.sampler.TContentCellRanker
Provides ranking of rows based on number of non-empty cells. Currently used implementation is uk.ac.shef.dcs.sti.core.algorithm.tmp.sampler.OSPD_nonEmpty. - uk.ac.shef.dcs.sti.core.algorithm.tmp.LEARNING
Performs preliminary disambiguation and classification on a sample of rows. - uk.ac.shef.dcs.sti.core.algorithm.tmp.UPDATE
Updates results scores at the of phase 3, after the classification is done. - uk.ac.shef.dcs.sti.core.algorithm.tmp.TColumnColumnRelationEnumerator
Discovers relations between columns and cells. - uk.ac.shef.dcs.sti.core.algorithm.tmp.LiteralColumnTagger
Used at the end of phase 4. Annotates any not yet annotated columns as data properties and tries to find relations with the subject column. - uk.ac.shef.dcs.kbproxy
Provides access to the underlying knowledge bases.
KB Proxy Module
The KB Proxy module is used by both Main module and Odalic server module to search configured KBs. The uk.ac.shef.dcs.kbproxy.KBProxy instances are created from Odalic configuration files using the uk.ac.shef.dcs.kbproxy.KBProxyFactory. The base class has built in Solr cache and provides methods for saving search results to the cache and retrieving them from the cache. Each KBproxy has it's own solr cache defined by the KB name. There are currently two implementations of the uk.ac.shef.dcs.kbproxy.KBProxy.
- uk.ac.shef.dcs.kbproxy.sparql.SPARQLProxy
Generic implementation of KB Search of SPARQL KBs. - uk.ac.shef.dcs.kbproxy.sparql.DBpediaProxy
Specific implementation of KB Search for DBpedia type of KBs. Extends the SPARQLProxy and currently has only modified label retrieval methods.
The original TableMiner+ used proxy class for Freebase. This was replaced by a more generic SPARQL proxy. The Freebase is no longer supported, because it's original public API is no longer available.The uk.ac.shef.dcs.kbproxy.KBProxy has four main groups of public methods.
Core algorithm search
These methods are used by the core algorithm, they do not throw any exceptions. To implement them, it is necessary to override the "*Internal" methods with same names. Any potential errors are caught and returned as warning for the user.
- findAttributesOfClazz
Returns a collection of attributes of the selected class. - findAttributesOfEntities
Returns a collection of attributes of the selected entity. - findAttributesOfProperty
Returns a collection of attributes of the selected property. - findEntityCandidates
Method used for the preliminary disambiguation or for main disambiguation when the preliminary disambiguation returned no types. Searches for candidates in the KB based on their label. The entities are returned with complete information about attributes and types. - findEntityCandidatesOfTypes
Same as findEntityCandidates with the difference that results are only of certain types. Used in disambiguation when preliminary disambiguation returned some candidate types. - findEntityClazzSimilarity
Evaluates similarity between two classes. - findGranularityOfClazz
Evaluates granularity of a class. - loadEntity
Loads single entity from the KB with complete information about attributes and types.
User initiated Search
These methods are used by the Odalic server in the user search dialog.
- findPredicateByFulltext
Returns candidate entities (predicates) from the KB based on supplied string value, domain and range. - findResourceByFulltext
Returns candidate entities (resources) from the KB based on supplied string value. - findClassByFulltext
Returns candidate entities (classes) from the KB based on supplied string value.
Proposals
- isInsertSupported
Information about whether the knowledge base supports inserting new concepts. - insertClass
Inserts a new class into the knowledge base. - insertConcept
Inserts a new concept into the knowledge base. - insertProperty
Inserts a new property type into the knowledge base.
Export
- getPropertyDomains
Returns domain of the given resource. - getPropertyRanges
Returns range properties of the given resource.
SPARQL Proxy
The SPARQL Proxy implements the above mentioned methods using Jena to generate the required SPARQL SELECT, INSERT and ASK requests. The original TableMiner+ created SPARQL queries by string concatenation, the current approach makes use of Jena "bulder" classes and is both more readable and less error prone. The fulltext search is implemented by querying the DBpedia fulltext catalogue through the "bif:contains" predicate. If the fulltext catalogue is not available, the proxy falls back on regex based filters, that are somewhat slower.
In user search, it is important to be able to return any resources found during the disambiguation. This is not always possible, because some results, like types from column classification, may not have labels. It is also important to be able to find any recently proposed resources. For this reason, the user initiated search always performs both exact match query and fulltext query. The exact match query usually returns less results, but can find recently proposed resources, that have not yet been added to the fulltext catalogue of the knowledge base.
The disambiguation of cells usually creates following queries.
Exact match query by label
ExamplePREFIX geonames: <http://www.geonames.org/ontology#> PREFIX owl: <http://www.w3.org/2002/07/owl#> PREFIX skos: <http://www.w3.org/2004/02/skos/core#> ... # Other prefixes left out. SELECT DISTINCT ?subject WHERE { { SELECT DISTINCT ?subject WHERE { { ?subject foaf:name "Gardens of the Moon"@en } UNION { ?subject dbpprop:fullname "Gardens of the Moon"@en } UNION { ?subject rdfs:label "Gardens of the Moon"@en } UNION { ?subject dbpprop:name "Gardens of the Moon"@en } UNION { ?subject dbpedia-owl:originalTitle "Gardens of the Moon"@en } } } ?subject rdf:type ?class }
Fulltext query by label
Example# Prefixes left out. SELECT DISTINCT ?subject ?object WHERE { { SELECT DISTINCT ?subject ?object WHERE { { ?subject foaf:name ?object . ?object <bif:contains> "\"Gardens\" AND \"of\" AND \"the\" AND \"Moon\"" } UNION { ?subject dbpprop:fullname ?object . ?object <bif:contains> "\"Gardens\" AND \"of\" AND \"the\" AND \"Moon\"" } UNION { ?subject rdfs:label ?object . ?object <bif:contains> "\"Gardens\" AND \"of\" AND \"the\" AND \"Moon\"" } UNION { ?subject dbpprop:name ?object . ?object <bif:contains> "\"Gardens\" AND \"of\" AND \"the\" AND \"Moon\"" } UNION { ?subject dbpedia-owl:originalTitle ?object . ?object <bif:contains> "\"Gardens\" AND \"of\" AND \"the\" AND \"Moon\"" } } } ?subject rdf:type ?class }
Query attributes for every result
Example# Prefixes left out. SELECT DISTINCT ?predicate ?object WHERE { dbpedia:Gardens_of_the_Moon ?predicate ?object }
Query label for every attribute
Example# Prefixes left out. SELECT DISTINCT ?object WHERE { { dbpedia-owl:Book foaf:name ?object } UNION { dbpedia-owl:Book dbpprop:fullname ?object } UNION { dbpedia-owl:Book rdfs:label ?object } UNION { dbpedia-owl:Book dbpprop:name ?object } UNION { dbpedia-owl:Book dbpedia-owl:originalTitle ?object } }
Websearch Module
Provides functionality for searching concepts of the web. Used for scoring results. Formerly implemented using Bing Web Search API, now renamed to Microsoft Cognitive Services.