Statistical data can be published as RDF Data cube. General documentation of RDF Data cube vocabulary can be found at http://www.w3.org/TR/vocab-data-cube/. The following content of this section discusses information and proposals for RDF Data cube export functionality of Odalic and corresponding issues and problems that were encountered.
Input file structure
First we had to decide which structure of input file would be supported by Odalic for processing statistical data. There is an example of input file in documentation on the page http://www.w3.org/TR/vocab-data-cube/#example:
2004-2006 | 2005-2007 | 2006-2008 | ||||
---|---|---|---|---|---|---|
Male | Female | Male | Female | Male | Female | |
Newport | 76.7 | 80.7 | 77.1 | 80.9 | 77.0 | 81.5 |
Cardiff | 78.7 | 83.3 | 78.6 | 83.7 | 78.7 | 83.4 |
Monmouthshire | 76.6 | 81.3 | 76.5 | 81.5 | 76.6 | 81.7 |
Merthyr Tydfil | 75.5 | 79.1 | 75.5 | 79.4 | 74.9 | 79.6 |
There are three dimensions: time period (rolling averages over three year time-spans), region and sex. Each observation represents the life expectancy for that population (the measure) and we needed an attribute to define the units (years) of the measured values. This table has multiline headers, heading rows and heading column. Then every cell represents one observation. But this structure of the table is not in the end supported by Odalic. Odalic supports only tables with exactly one header row and no header columns. So the data above can be transformed to following table structure (slightly extended):
Country | Region | Time period | Sex | Life expectancy |
---|---|---|---|---|
Count1 | Newport | 2004-2006 | Male | 76.7 |
Count1 | Newport | 2004-2006 | Female | 80.7 |
Count1 | Newport | 2005-2007 | Male | 77.1 |
Count1 | Newport | 2005-2007 | Female | 80.9 |
Count1 | Newport | 2006-2008 | Male | 77.0 |
Count1 | Newport | 2006-2008 | Female | 81.5 |
Count1 | Cardiff | 2004-2006 | Male | 78.7 |
Count1 | Cardiff | 2004-2006 | Female | 83.3 |
Count1 | Cardiff | 2005-2007 | Male | 78.6 |
Count1 | Cardiff | 2005-2007 | Female | 83.7 |
Count1 | Cardiff | 2006-2008 | Male | 78.7 |
Count1 | Cardiff | 2006-2008 | Female | 83.4 |
Count2 | Monmouthshire | 2004-2006 | Male | 76.6 |
Count2 | Monmouthshire | 2004-2006 | Female | 81.3 |
Count2 | Monmouthshire | 2005-2007 | Male | 76.5 |
Count2 | Monmouthshire | 2005-2007 | Female | 81.5 |
Count2 | Monmouthshire | 2006-2008 | Male | 76.6 |
Count2 | Monmouthshire | 2006-2008 | Female | 81.7 |
Count2 | Merthyr Tydfil | 2004-2006 | Male | 75.5 |
Count2 | Merthyr Tydfil | 2004-2006 | Female | 79.1 |
Count2 | Merthyr Tydfil | 2005-2007 | Male | 75.5 |
Count2 | Merthyr Tydfil | 2005-2007 | Female | 79.4 |
Count2 | Merthyr Tydfil | 2006-2008 | Male | 74.9 |
Count2 | Merthyr Tydfil | 2006-2008 | Female | 79.6 |
Then every row represents one observation. One column represents measure (Life expectancy) and three columns represent dimensions (Region, Time period, Sex). First column is neither measure nor dimension, because there is a relation between Country and Region. There are no relations among other columns. Theoretically there could be more columns representing measures.
Resulting RDF Data cube and the generated patterns
Based on the example above, there is complete resulting RDF Data cube in documentation at http://www.w3.org/TR/vocab-data-cube/#full-example. According to the example in documentation the RDF Data cube contains these parts:
- Data Set
- Data structure definition
- Dimensions and measures
- Observations
For every part there is a "pattern" showing how Odalic producec the RDF. For producing the RDF Data cube we needed the Result provided by Odalic core algorithm and also the Data cube definition ("CubeDef") provided by user. For every pattern there is depicted what information we need from Result and CubeDef for producing the RDF.
Data Set pattern
Input from Odalic Result:
- (none)
Input from user's CubeDef:
Parameter | Value |
---|---|
Title | Life expectancy title |
Label | Life expectancy desc |
Comment | Life expectancy within Welsh Unitary authorities comment |
Description | Life expectancy within Welsh Unitary authorities - extracted from Stats Wales |
Subject | http://purl.org/linked-data/sdmx/2009/subject/3.2 |
Organization | Example org |
- Note: Date for "issued" can be computed by program during RDF producing.
RDF output pattern:
# -- Data Set -------------------------------------------- eg:dataset a qb:DataSet; dct:title "Life expectancy title"; rdfs:label "Life expectancy desc"; rdfs:comment "Life expectancy within Welsh Unitary authorities comment"; dct:description "Life expectancy within Welsh Unitary authorities - extracted from Stats Wales"; dct:publisher eg:organization; dct:issued "2016-09-22"; dct:subject <http://purl.org/linked-data/sdmx/2009/subject/3.2>; qb:structure eg:dsd; . eg:organization a org:Organization, foaf:Agent; rdfs:label "Example org"; .
Data structure definition pattern
Input from Odalic Result:
- (none)
Input from user's CubeDef:
- Which columns are dimensions and measures - the column numbers (order in the Input of the task) are enough.
RDF output pattern:
# -- Data structure definition ---------------------------- eg:dsd a qb:DataStructureDefinition; # The dimensions qb:component [ qb:dimension eg:refArea; qb:order 1 ]; qb:component [ qb:dimension eg:refPeriod; qb:order 2 ]; # The measure(s) qb:component [ qb:measure eg:lifeExpectancy ]; # The attributes qb:component [ qb:attribute sdmx-attribute:unitMeasure ]; .
Dimensions and measures pattern
Input from Odalic Result:
- Classification of columns pointed by user as dimensions and measures (Label and Resource)
Input from user's CubeDef:
- Which columns are dimensions and measures (for example the number of column in input is enough).
RDF output pattern:
# -- Dimensions and measures ---------------------------- eg:refPeriod a rdf:Property, qb:DimensionProperty; rdfs:label "reference period"; qb:concept <http://dbpedia.org/resource/Reference_period>; . eg:refArea a rdf:Property, qb:DimensionProperty; rdfs:label "reference area"; qb:concept <http://dbpedia.org/resource/Region>; . eg:lifeExpectancy a rdf:Property, qb:MeasureProperty; rdfs:label "life expectancy"; rdfs:subPropertyOf sdmx-measure:obsValue; qb:concept <http://dbpedia.org/resource/Life_expectancy>; .
Observations pattern
Input from Odalic Result:
- Disambiguation of cells in columns pointed by user as dimensions (Resource)
Input from user's CubeDef:
- Unit of measure (Resource)
Note: Values of cells in column pointed by user as measure are obtained from the Input of the task.
RDF output pattern:
# -- Observations ----------------------------------------- eg:o1 a qb:Observation; qb:dataSet eg:dataset ; eg:refArea <http://dbpedia.org/page/Newport,_New_South_Wales> ; eg:refPeriod <http://reference.data.gov.uk/id/gregorian-interval/2004-01-01T00:00:00/P3Y> ; sdmx-attribute:unitMeasure <http://dbpedia.org/resource/Year> ; eg:lifeExpectancy 76.7 ; . eg:o2 a qb:Observation; qb:dataSet eg:dataset ; eg:refArea <http://dbpedia.org/resource/Cardiff> ; eg:refPeriod <http://reference.data.gov.uk/id/gregorian-interval/2004-01-01T00:00:00/P3Y> ; sdmx-attribute:unitMeasure <http://dbpedia.org/resource/Year> ; eg:lifeExpectancy 78.7 ; .