ADEQUATE : Core algorithm description

Phases

The main goal of the core algorithm is to take the the tabular input data, interpret it as best as it can using selected knowledge bases, and return an annotated table. This annotation process has several distinct phases:

  1. Subject Column Detection
    The subject column candidate is determined from both the column headers and cell values. The values in the subject column then represent main entities, values from other columns are interpreted as properties of these main entities.
  2. Cells Disambiguation
    Individual cell values are searched in available knowledge bases. Search results are then scored and their types and properties are loaded for further processing. The result of this phase is a list of candidates for every table cell.
  3. Columns Classification
    Classification determines candidates for columns based on types of disambiguated column cells. The desired result is to pick a class that is a type for majority of the column cells, but also is not too broad (e.g. when classifying a column of writers, we want to receive dbpedia:Writer and not dbpedia:Thing).
  4. Relations Enumeration
    Relations between cells and columns are determined from properties of disambiguated cells.

A more in-depth description of the original algorithm, upon which is the core algorithm base on, can be found in the paper http://www.semantic-web-journal.net/content/effective-and-efficient-semantic-table-interpretation-using-tableminer-0 . Its implementation of the algorithm is spread into several modules.

  • sti-main
    The main part of the algorithm. Contains the implementation of the above mentioned steps. The original TableMiner+ contained many experimental interpreter implementations. In the Odalic, most of them were removed in favor of the TMP implementation. The reason for this is, that each of them would have to be extended by new features like user feedback, which would go beyond the scope of our project. Also there are mostly proven inferior by the paper.
  • sti-kbproxy
    Provides an abstracted interface for communication with various KBs. Used both by the core algorithm and Odalic server.
  • sti-websearch
    Handles scoring of results based on web search.

Main Module

The entry point for the algorithm is the uk.ac.shef.dcs.sti.core.algorithm.tmp.TMPOdalicInterpreter class, to successfully initialize the interpreter it is necessary to obtain instances of following classes.

  • uk.ac.shef.dcs.sti.core.algorithm.tmp.TCellDisambiguator
    Handles cell disambiguation.
  • uk.ac.shef.dcs.sti.core.algorithm.tmp.TColumnClassifier
    Handles columns classification.
  • uk.ac.shef.dcs.sti.core.algorithm.tmp.sampler.TContentCellRanker
    Provides ranking of rows based on number of non-empty cells. Currently used implementation is uk.ac.shef.dcs.sti.core.algorithm.tmp.sampler.OSPD_nonEmpty.
  • uk.ac.shef.dcs.sti.core.algorithm.tmp.LEARNING
    Performs preliminary disambiguation and classification on a sample of rows.
  • uk.ac.shef.dcs.sti.core.algorithm.tmp.UPDATE
    Updates results scores at the of phase 3, after the classification is done.
  • uk.ac.shef.dcs.sti.core.algorithm.tmp.TColumnColumnRelationEnumerator
    Discovers relations between columns and cells.
  • uk.ac.shef.dcs.sti.core.algorithm.tmp.LiteralColumnTagger
    Used at the end of phase 4. Annotates any not yet annotated columns as data properties and tries to find relations with the subject column.
  • uk.ac.shef.dcs.kbproxy
    Provides access to the underlying knowledge bases.

KB Proxy Module

The KB Proxy module is used by both Main module and Odalic server module to search configured KBs. The uk.ac.shef.dcs.kbproxy.KBProxy instances are created from Odalic configuration files using the uk.ac.shef.dcs.kbproxy.KBProxyFactory. The base class has built in Solr cache and provides methods for saving search results to the cache and retrieving them from the cache. Each KBproxy has it's own solr cache defined by the KB name. There are currently two implementations of the uk.ac.shef.dcs.kbproxy.KBProxy.

  • uk.ac.shef.dcs.kbproxy.sparql.SPARQLProxy
    Generic implementation of KB Search of SPARQL KBs.
  • uk.ac.shef.dcs.kbproxy.sparql.DBpediaProxy
    Specific implementation of KB Search for DBpedia type of KBs. Extends the SPARQLProxy and currently has only modified label retrieval methods.

The original TableMiner+ used proxy class for Freebase. This was replaced by a more generic SPARQL proxy. The Freebase is no longer supported, because it's original public API is no longer available.The uk.ac.shef.dcs.kbproxy.KBProxy has four main groups of public methods.

Core algorithm search

These methods are used by the core algorithm, they do not throw any exceptions. To implement them, it is necessary to override the "*Internal" methods with same names. Any potential errors are caught and returned as warning for the user.

  • findAttributesOfClazz
    Returns a collection of attributes of the selected class.
  • findAttributesOfEntities
    Returns a collection of attributes of the selected entity.
  • findAttributesOfProperty
    Returns a collection of attributes of the selected property.
  • findEntityCandidates
    Method used for the preliminary disambiguation or for main disambiguation when the preliminary disambiguation returned no types. Searches for candidates in the KB based on their label. The entities are returned with complete information about attributes and types.
  • findEntityCandidatesOfTypes
    Same as findEntityCandidates with the difference that results are only of certain types. Used in disambiguation when preliminary disambiguation returned some candidate types.
  • findEntityClazzSimilarity
    Evaluates similarity between two classes.
  • findGranularityOfClazz
    Evaluates granularity of a class.
  • loadEntity
    Loads single entity from the KB with complete information about attributes and types.

User initiated Search

These methods are used by the Odalic server in the user search dialog.

  • findPredicateByFulltext
    Returns candidate entities (predicates) from the KB based on supplied string value, domain and range.
  • findResourceByFulltext
    Returns candidate entities (resources) from the KB based on supplied string value.
  • findClassByFulltext
    Returns candidate entities (classes) from the KB based on supplied string value.

Proposals

  • isInsertSupported
    Information about whether the knowledge base supports inserting new concepts.
  • insertClass
    Inserts a new class into the knowledge base.
  • insertConcept
    Inserts a new concept into the knowledge base.
  • insertProperty
    Inserts a new property type into the knowledge base.

Export

  • getPropertyDomains
    Returns domain of the given resource.
  • getPropertyRanges
    Returns range properties of the given resource.

SPARQL Proxy

The SPARQL Proxy implements the above mentioned methods using Jena to generate the required SPARQL SELECT, INSERT and ASK requests. The original TableMiner+ created SPARQL queries by string concatenation, the current approach makes use of Jena "bulder" classes and is both more readable and less error prone. The fulltext search is implemented by querying the DBpedia fulltext catalogue through the "bif:contains" predicate. If the fulltext catalogue is not available, the proxy falls back on regex based filters, that are somewhat slower.

In user search, it is important to be able to return any resources found during the disambiguation. This is not always possible, because some results, like types from column classification, may not have labels. It is also important to be able to find any recently proposed resources. For this reason, the user initiated search always performs both exact match query and fulltext query. The exact match query usually returns less results, but can find recently proposed resources, that have not yet been added to the fulltext catalogue of the knowledge base.

The disambiguation of cells usually creates following queries.

  1. Exact match query by label

    Example
    PREFIX  geonames: <http://www.geonames.org/ontology#>
    PREFIX  owl:  <http://www.w3.org/2002/07/owl#>
    PREFIX  skos: <http://www.w3.org/2004/02/skos/core#>
    ... # Other prefixes left out.
    
    SELECT DISTINCT  ?subject
    WHERE
      { { SELECT DISTINCT  ?subject
          WHERE
            {   { ?subject  foaf:name  "Gardens of the Moon"@en }
              UNION
                { ?subject  dbpprop:fullname  "Gardens of the Moon"@en }
              UNION
                { ?subject  rdfs:label  "Gardens of the Moon"@en }
              UNION
                { ?subject  dbpprop:name  "Gardens of the Moon"@en }
              UNION
                { ?subject  dbpedia-owl:originalTitle  "Gardens of the Moon"@en }
            }
        }
        ?subject  rdf:type  ?class
      }
  2. Fulltext query by label

    Example
    # Prefixes left out.
    
    SELECT DISTINCT  ?subject ?object
    WHERE
      { { SELECT DISTINCT  ?subject ?object
          WHERE
            {   { ?subject  foaf:name       ?object .
                  ?object   <bif:contains>  "\"Gardens\" AND \"of\" AND \"the\" AND \"Moon\""
                }
              UNION
                { ?subject  dbpprop:fullname  ?object .
                  ?object   <bif:contains>    "\"Gardens\" AND \"of\" AND \"the\" AND \"Moon\""
                }
              UNION
                { ?subject  rdfs:label      ?object .
                  ?object   <bif:contains>  "\"Gardens\" AND \"of\" AND \"the\" AND \"Moon\""
                }
              UNION
                { ?subject  dbpprop:name    ?object .
                  ?object   <bif:contains>  "\"Gardens\" AND \"of\" AND \"the\" AND \"Moon\""
                }
              UNION
                { ?subject  dbpedia-owl:originalTitle  ?object .
                  ?object   <bif:contains>        "\"Gardens\" AND \"of\" AND \"the\" AND \"Moon\""
                }
            }
        }
        ?subject  rdf:type  ?class
      }
  3. Query attributes for every result

    Example
    # Prefixes left out.
    
    SELECT DISTINCT  ?predicate ?object
    WHERE
      { dbpedia:Gardens_of_the_Moon
                  ?predicate  ?object
      }
  4. Query label for every attribute

    Example
    # Prefixes left out.
    
    SELECT DISTINCT  ?object
    WHERE
      {   { dbpedia-owl:Book
                      foaf:name  ?object
          }
        UNION
          { dbpedia-owl:Book
                      dbpprop:fullname  ?object
          }
        UNION
          { dbpedia-owl:Book
                      rdfs:label  ?object
          }
        UNION
          { dbpedia-owl:Book
                      dbpprop:name  ?object
          }
        UNION
          { dbpedia-owl:Book
                      dbpedia-owl:originalTitle  ?object
          }
      }

Websearch Module

Provides functionality for searching concepts of the web. Used for scoring results. Formerly implemented using Bing Web Search API, now renamed to Microsoft Cognitive Services.