
Some time ago, I did a bit of work on the RIESE project, aiming at publishing the
Eurostat dataset on the Semantic
Web, and interlinking it with further datasets (eg. Geonames and DBpedia). This can look a bit far from my interests in
music data, but there is a connexion which illustrates the power of linked data, as explained at the end of this
post.
Original data
There are three distinct things we consider in the Eurostat dataset:
- A table of content in HTML defining the hierarchical structure of the
Eurostat datasets;
- Tab-separated values dictionary files defining the ~80 000 data codes used
in the dataset (eg. "eu27" for the European Union of 27 countries);
- The actual statistical data, in tab-separated values files. Around 4000
datasets for roughly 350 million statistical items.
Ontology
The first thing we need to figure out when exposing data on the Semantic Web
is the model we'll link to. This lead into the design of SCOVO (Statistical Core Vocabulary). The
concepts in this ontology can be depicted as follows:

The interesting thing about this model is that the statistical item is
considered as a primary entity. We used as a basis the Event ontology - a statistical item is
a particular classification of a space/time region. This allows to be really
flexible and extensible. We can for example attach multiple dimensions to a
particular item, resources pertaining to its creation, etc.
RDF back-end
I wanted to see how to publish such large amounts of RDF data and how
my publication tools perform, so
I designed the Ceriese
software to handle that.
The first real concern when dealing with such large amounts of data is, of
course, scalability. The overall Eurostat dataset is over 3 billion triples.
Given that we don't have high-end machines with lots of memory, using the core
SWI RDF store was
out of the question (I wonder if any in-memory triple store can handle 1
billion triples, btw).
So there are three choices at this point:
- Use a database-backed triple store;
- Dump static RDF file in a file-system served through Apache;
- Generate the RDF on-the-fly.
We don't have the sort of money it takes (for both the hardware and the
software) for the first choice to really scale, so we tried the second and the
third solution. I took my old Prolog-2-RDF software that I am using to
publish the Jamendo dataset and we
wrote some P2R mapping
files converting the tab-separated value files. Then, we made P2R dump
small RDF files in a file-system hierarchy, corresponding to the description of
the different web resources we wanted to publish. Then, some Apache tweaks and
Michael and Wolfgang's work on XHTML/RDFa publishing were enough
to make the dataset available in the web of data.
But this approach had two main problems. First, it took ages to run this RDF
dump, so we never actually succeeded to complete it once. Also, it was
impossible to provide a SPARQL querying facility. No
aggregation of data was made available.
So we eventually settled on the third solution. I took my Prolog hacker hat,
and tried to optimise P2R to make it fast enough. I did it by using the same
trick I used in my small N3
reasoner, Henry. P2R mappings are compiled as native Prolog clauses
(rdf(S,P,O) :- ... ), which cut down the search space a lot. TSV
files are accessed within those rules and parsed on-the-fly. The location of
the TSV file to access is derived from a small in-memory index. Parsed TSV
files are cached for a whole query, to avoid parsing the same file for
different triple patterns in the query.
Same mechanisms are applied to derive a SKOS hierarchy from the HTML table of
content.
Now, a DESCRIBE query takes less than 0.5 seconds for any item, on my
laptop. Not perfect, but, still...
A solution to improve the access time a lot would be to dump the TSV file in
a relational database, and access this database in our P2R mappings instead of
the raw TSV files.
Trying it out and creating your own SPARQL end-point from Eurostat data is
really easy.
- Get SWI-Prolog;
- Get the software from there;
- Get the raw Eurostat
data (or get it from the Eurostat web-site, as this one can be
slightly out-dated);
- Put it in
data/, in your Ceriese directory;
- Launch start.pl;
- Go to http://localhost:3020/
Now what?
Michael and Wolfgang did an amazing work at putting together a really nice web interface, publishing the data
in XHTML+RDFa. They also included some interlinking, especially for
geographical locations, which are now linked to Geonames.
So what's the point from my music geek point of view?? Well, now, after
aggregating Semantic Web data about my music collection (using these tools), I can sort hip-hop
artists by murder rates in their city :-) This is quite fun as it is
(especially as the Eurostat dataset holds a really diverse range of
statistics), but it would be really interesting to mine that to get some
associations between statistical data and musical facts. This would surely lead
to interesting sociological results (eg. how does musical "genre" associate
with particular socio-economic indicators?)