Some time ago, I did a bit of work on the RIESE project, aiming at publishing the Eurostat dataset on the Semantic Web, and interlinking it with further datasets (eg. Geonames and DBpedia). This can look a bit far from my interests in music data, but there is a connexion which illustrates the power of linked data, as explained at the end of this post.
There are three distinct things we consider in the Eurostat dataset:
- A table of content in HTML defining the hierarchical structure of the Eurostat datasets;
- Tab-separated values dictionary files defining the ~80 000 data codes used in the dataset (eg. "eu27" for the European Union of 27 countries);
- The actual statistical data, in tab-separated values files. Around 4000 datasets for roughly 350 million statistical items.
The first thing we need to figure out when exposing data on the Semantic Web is the model we'll link to. This lead into the design of SCOVO (Statistical Core Vocabulary). The concepts in this ontology can be depicted as follows:
The interesting thing about this model is that the statistical item is considered as a primary entity. We used as a basis the Event ontology - a statistical item is a particular classification of a space/time region. This allows to be really flexible and extensible. We can for example attach multiple dimensions to a particular item, resources pertaining to its creation, etc.
The first real concern when dealing with such large amounts of data is, of course, scalability. The overall Eurostat dataset is over 3 billion triples. Given that we don't have high-end machines with lots of memory, using the core SWI RDF store was out of the question (I wonder if any in-memory triple store can handle 1 billion triples, btw).
So there are three choices at this point:
- Use a database-backed triple store;
- Dump static RDF file in a file-system served through Apache;
- Generate the RDF on-the-fly.
We don't have the sort of money it takes (for both the hardware and the software) for the first choice to really scale, so we tried the second and the third solution. I took my old Prolog-2-RDF software that I am using to publish the Jamendo dataset and we wrote some P2R mapping files converting the tab-separated value files. Then, we made P2R dump small RDF files in a file-system hierarchy, corresponding to the description of the different web resources we wanted to publish. Then, some Apache tweaks and Michael and Wolfgang's work on XHTML/RDFa publishing were enough to make the dataset available in the web of data.
But this approach had two main problems. First, it took ages to run this RDF dump, so we never actually succeeded to complete it once. Also, it was impossible to provide a SPARQL querying facility. No aggregation of data was made available.
So we eventually settled on the third solution. I took my Prolog hacker hat,
and tried to optimise P2R to make it fast enough. I did it by using the same
trick I used in my small N3
reasoner, Henry. P2R mappings are compiled as native Prolog clauses
rdf(S,P,O) :- ... ), which cut down the search space a lot. TSV
files are accessed within those rules and parsed on-the-fly. The location of
the TSV file to access is derived from a small in-memory index. Parsed TSV
files are cached for a whole query, to avoid parsing the same file for
different triple patterns in the query.
Same mechanisms are applied to derive a SKOS hierarchy from the HTML table of content.
Now, a DESCRIBE query takes less than 0.5 seconds for any item, on my laptop. Not perfect, but, still...
A solution to improve the access time a lot would be to dump the TSV file in a relational database, and access this database in our P2R mappings instead of the raw TSV files.
Trying it out and creating your own SPARQL end-point from Eurostat data is really easy.
- Get SWI-Prolog;
- Get the software from there;
- Get the raw Eurostat data (or get it from the Eurostat web-site, as this one can be slightly out-dated);
- Put it in
data/, in your Ceriese directory;
- Launch start.pl;
- Go to http://localhost:3020/
Michael and Wolfgang did an amazing work at putting together a really nice web interface, publishing the data in XHTML+RDFa. They also included some interlinking, especially for geographical locations, which are now linked to Geonames.
So what's the point from my music geek point of view?? Well, now, after aggregating Semantic Web data about my music collection (using these tools), I can sort hip-hop artists by murder rates in their city :-) This is quite fun as it is (especially as the Eurostat dataset holds a really diverse range of statistics), but it would be really interesting to mine that to get some associations between statistical data and musical facts. This would surely lead to interesting sociological results (eg. how does musical "genre" associate with particular socio-economic indicators?)