ADEQUATE : Possible extensions and improvements

Algorithm

Learning from the user feedback

In general, the algorithm should be able to learn from feedbacks provided by the users: 

  • When certain classification is marked as being wrong for file X, column C, then we can learn from that and in the future penalize such classification for documents with same/similar structure.
  • The same for a chosen alternative, which could be prioritized in similar conditions.
  • Use the already executed classifications/disambiguations/relations discoveries as a learning set of cases in supervised learning and try to deduce classifications/disambiguations/relations discoveries for the other cases based on that.

Different kinds of feedback

Apart from the feedback where user essentially overrules the algorithm, allow the user to provide a negative feedback, marking some resources as undesirable, but let the algorithm to try alternatives.

General performance

  • Find a better balance between the context taken into account and the number of queries.
  • Reduce the usage of web search API.
  • Enable parallel processing of interpreters and cooperative interruption of tasks.

Relations discovery

  • Try to suggest properties with the domain being equal to the concept classifying the subject column.
    • Also it should try to look into relations being already associated with the entities classified with the same class, e.g. if the algorithm knew that column is classified as Country, it should try to map existing properties with the domain Country to the columns with relation values - in other words it should try to find a column containing population, area (which define the Country as the domain).
    • The same for ranges.
  • Integrate relations discovery in a better way - e.g. so that found relations (and their domain/ranges) influence the classification and may cause selection of a different class and then different disambiguations.
  • For statistical data, the algorithm does not run relation discovery.
    • But modified version of relation discovery could be executed - it could search the range types, and use the fact that every such predicate is qb:dimensionProperty, qb:measureProperty.
  • Taking into account distribution of the values when looking for the property.
    • E.g. if there is a column containing values such as "1.8", "2.2", "2.0", it probably is not weight of a person, but rather his or her height.
  • Taking into account recommendations for relations based on the similarity (in terms of the structure) between processed files.
    • For example if file A contains relations X, Y, Z and it is similar (in terms of its structure) to file B, which contains relations X, Y, it is probable that file B also contains relation Z and such relation should be suggested. 
  • Taking into account recommendations for relations based on the fact that certain relations typically occur next to each other. E.g. When there are properties foaf:firstName and foaf:age, then there probably is also property foaf:surname. 
  • The algorithm may also take account that subject column in the CSV file may be also an object of some triple, not just the subject - so it makes sense to also look for inverse relations. 
  • Algorithm selects the best matching relation not just based on the comparison of the cell value and object of the triple in the knowledge bases, but also by comparing CSV column title and name/URI of the candidate predicate in the given knowledge base. Nevertheless, in case of two knowledge bases giving evidence for the given relation, the selected predicate should not be taken by just comparing the similarity of the property name and the column title (which may be misleading), but rather by consulting Linked Open Data cloud and selecting more widely used predicate for these situations.
  • Detect common violations of the established vocabularies, such is not respecting domains and ranges of the properties, when changing the classifications of related columns.   
  • Relations when the primary key is split into two or more columns (in other words subject column does not represent the whole "primary key") currently is not supported.
  • Enable of processing of sets of tables and detect relations across two or more tables (foreign keys, M:N tables).
  • Explore algorithm behaviour in certain corner cases:
    • What if two columns are classified by the same class? How to introduce handling of self-relations, e.g. some person is another person's boss?
  • Use the vocabularies to infer other relations.

Classification/Disambiguation

  • There are still some issues with the user feedback to the classification/disambiguation results of the algorithm:
    • Cells with the same literal value share the same disambiguation, which is the first one resolved.
  • Performance issues with respect to classification/disambiguation,
    • Too many queries to the knowledge bases during disambiguation caused by ineffective restriction of the searched entities in the knowledge bases. For example, the algorithm for disambiguation typically takes into account the context of the disambiguated cell, such as the row and column in the table the cell is part of. Nevertheless, there is no differentiation among the meaning of the other columns' cells forming the context. For example, if I would disambiguate name of the school, the information about 'locality' (state, country) is really important to reduce the number of entities probed in the knowledge bases and it would also increase precision.
  • Too many false positives in case of lower evidence for the disambiguated cells/classified columns or CSV files providing low context for the classified columns/disambiguated cells.
  • Take into account the distribution of values.
  • When there is not enough evidence, do not produce the almost arbitrary classification or disambiguation, but rather produce no result.
  • If you have a file which contains name and abbreviation, then the algorithm tries to associate the same class with the name and abbreviation at the same time. But in this case one of the columns should be chosen as the preferred one, and the other one to become a property of the first one.

Knowledge Bases

  • Proper use of hierarchy of concepts.
  • Improve the behaviour to always select the most specific concepts
  • Support for GeoNames base and Wikidata.
  • Do a rigorous evaluation:
    • Evaluate precision/recall/performance gain.
    • Evaluate how the precision/recall changed after proper use of hierarchies, other tweaks to KBProxy.
    • Evaluate how the performance improved after adjustments.
  • Alternative labels are not searched when searching the bases.
  • Provide more then one winning concept for disambiguation.
    • Currently there is just one winning concept for disambiguation when e.g. DBpedia is used. It is not a desired behaviour, but it is so because of the way the queries for candidates are made.

      • We use exact string match first, and if no matching resource is found, we use regular expression. So in most cases, there will be one exact match.

      • Before with the now deprecated Freebase, the search API was more similar to a free text engine, hence we got many candidates. But DBpedia is a SPARQL database and therefore has the text matching comparatively limited.
      • Ideally, we should first build an inverted index of labels of all URIs in DBpedia, and use that instead of doing string matching on labels using SPARQL or we we could use an existing full-text index when available.

  • Take into account the hierarchy of KB.
    • Follow the "subclassOf" relations and automatically assume that more generic classes are also candidates.
      • Unfortunately this relations have to materialized first.
  • Currently the multiple KBs are handled as separate runs of the algorithm. It could be interesting to run the algorithm only once and handle multiple KBs in the KBProxy (return results from all of the configured KBs and merge results with the same URL).
    • At least allow parallel execution, but with respect to the shared resources (hard-drive and network access).
     

UI Improvements

  •  Introduce graph visualisation for data cube export.
  • Prevent the roll-on effect in some less used browsers when the screens change.
  • Add user administration module to the UI.
  • Improve token generation and management:
    • Allow the user to overview the issued tokens.
  • Use OAuth and external services to log the users in.
  • Adapt LodView or other means to view the details of resources.