DBTune blog

To content | To menu | To search

Tag - ontology

Entries feed

Tuesday 20 May 2008

Ceriese: RDF translator for Eurostat data

Riese logo

Some time ago, I did a bit of work on the RIESE project, aiming at publishing the Eurostat dataset on the Semantic Web, and interlinking it with further datasets (eg. Geonames and DBpedia). This can look a bit far from my interests in music data, but there is a connexion which illustrates the power of linked data, as explained at the end of this post.

Original data

There are three distinct things we consider in the Eurostat dataset:

  • A table of content in HTML defining the hierarchical structure of the Eurostat datasets;
  • Tab-separated values dictionary files defining the ~80 000 data codes used in the dataset (eg. "eu27" for the European Union of 27 countries);
  • The actual statistical data, in tab-separated values files. Around 4000 datasets for roughly 350 million statistical items.

Ontology

The first thing we need to figure out when exposing data on the Semantic Web is the model we'll link to. This lead into the design of SCOVO (Statistical Core Vocabulary). The concepts in this ontology can be depicted as follows:

SCOVO ontology

The interesting thing about this model is that the statistical item is considered as a primary entity. We used as a basis the Event ontology - a statistical item is a particular classification of a space/time region. This allows to be really flexible and extensible. We can for example attach multiple dimensions to a particular item, resources pertaining to its creation, etc.

RDF back-end

I wanted to see how to publish such large amounts of RDF data and how my publication tools perform, so I designed the Ceriese software to handle that.

The first real concern when dealing with such large amounts of data is, of course, scalability. The overall Eurostat dataset is over 3 billion triples. Given that we don't have high-end machines with lots of memory, using the core SWI RDF store was out of the question (I wonder if any in-memory triple store can handle 1 billion triples, btw).

So there are three choices at this point:

  • Use a database-backed triple store;
  • Dump static RDF file in a file-system served through Apache;
  • Generate the RDF on-the-fly.

We don't have the sort of money it takes (for both the hardware and the software) for the first choice to really scale, so we tried the second and the third solution. I took my old Prolog-2-RDF software that I am using to publish the Jamendo dataset and we wrote some P2R mapping files converting the tab-separated value files. Then, we made P2R dump small RDF files in a file-system hierarchy, corresponding to the description of the different web resources we wanted to publish. Then, some Apache tweaks and Michael and Wolfgang's work on XHTML/RDFa publishing were enough to make the dataset available in the web of data.

But this approach had two main problems. First, it took ages to run this RDF dump, so we never actually succeeded to complete it once. Also, it was impossible to provide a SPARQL querying facility. No aggregation of data was made available.

So we eventually settled on the third solution. I took my Prolog hacker hat, and tried to optimise P2R to make it fast enough. I did it by using the same trick I used in my small N3 reasoner, Henry. P2R mappings are compiled as native Prolog clauses (rdf(S,P,O) :- ... ), which cut down the search space a lot. TSV files are accessed within those rules and parsed on-the-fly. The location of the TSV file to access is derived from a small in-memory index. Parsed TSV files are cached for a whole query, to avoid parsing the same file for different triple patterns in the query.

Same mechanisms are applied to derive a SKOS hierarchy from the HTML table of content.

Now, a DESCRIBE query takes less than 0.5 seconds for any item, on my laptop. Not perfect, but, still...

A solution to improve the access time a lot would be to dump the TSV file in a relational database, and access this database in our P2R mappings instead of the raw TSV files.

Trying it out and creating your own SPARQL end-point from Eurostat data is really easy.

  • Get SWI-Prolog;
  • Get the software from there;
  • Get the raw Eurostat data (or get it from the Eurostat web-site, as this one can be slightly out-dated);
  • Put it in data/, in your Ceriese directory;
  • Launch start.pl;
  • Go to http://localhost:3020/

Now what?

Michael and Wolfgang did an amazing work at putting together a really nice web interface, publishing the data in XHTML+RDFa. They also included some interlinking, especially for geographical locations, which are now linked to Geonames.

So what's the point from my music geek point of view?? Well, now, after aggregating Semantic Web data about my music collection (using these tools), I can sort hip-hop artists by murder rates in their city :-) This is quite fun as it is (especially as the Eurostat dataset holds a really diverse range of statistics), but it would be really interesting to mine that to get some associations between statistical data and musical facts. This would surely lead to interesting sociological results (eg. how does musical "genre" associate with particular socio-economic indicators?)

Tuesday 18 March 2008

Describing a recording session in RDF

Danny Ayers just posted a new Talis Platform application idea, dealing with music/audio equipment. As I was wondering it would actually be nice to have a set of web identifiers and corresponding RDF representation for audio equipment, I remembered a small Music Ontology example I wrote about a year ago. In fact, the Music Ontology (along with the Event ontology) is expressive enough to handle the description of recording sessions. Here is a small excerpt of such a description:

@prefix mo: <http://purl.org/ontology/mo/>.
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.
@prefix event: <http://purl.org/NET/c4dm/event.owl#>.
@prefix rd: <http://example.org/audioequipment/>.
@prefix : <#>.

:rec a mo:Recording;
   rdfs:label “live recording of my band in studio”;
   event:sub_event :guitar1, :guitar2, :drums1, :kick1, :sing.

:sing a mo:Recording;
   rdfs:label “Voice recorded with a SM57″;
   event:factor rd:sm57;
   event:place [rdfs:label “Middle of the room-I could be more precise here”].

:kick1 a mo:Recording;
   rdfs:label “Kick drum using a Shure PG52″;
   event:factor rd:pg52;
   event:place [rdfs:label “Kick drum microphone location”].

Well, it would indeed by nice if the rd namespace could point to something real! Who would fancy RDFising Harmony Central? :-)

Tuesday 30 October 2007

Specifications for the Event and the Timeline ontologies

It has been a long time since my last post, but I was busy traveling (ISMIR 2007, ACM Multimedia, and AES), and also took some holidays afterwards (first ones since last Xmas... it was great :-) ).

Anyway, in my slow process of getting back to work, I finally wrote specification documents for the Timeline ontology and the Event ontology, that Samer Abdallah and I worked on three years ago. These are really early documentation draft though, and might be a bit unclear, don't hesitate to send me comments about them!

The Timeline ontology, extending some OWL-Time concepts, allows to address time points and intervals on multiple timelines, backing signals, video, performances, works, scores, etc. For example, using this ontology, you can express "from 1 minute and 21 seconds to 1 minutes and 55 seconds on this signal".

Timeline ontology

The Event ontology allows to deal with, well, events. In it, events are seen as arbitrary classification of space/time regions. This definition makes it extremely flexible: it covers everything from music festivals to conferences, meeting notes or even annotations of a signal. It is extremely simple, and defines one single concept (event), and five properties (agent, factor, product, place and time).

Event ontology

The following representations are available for these ontology resources:

  • RDF/XML

$ curl -L -H "Accept: application/rdf+xml" http://purl.org/NET/c4dm/event.owl

  • RDF/Turtle

$ curl -L -H "Accept: text/rdf+n3" http://purl.org/NET/c4dm/event.owl

  • Default (XHTML)

curl -L http://purl.org/NET/c4dm/event.owl

And also, make sure you check out the Chord ontology designed by Chris Sutton, and the associated URI service (eg. A major with an added 7th). All the code (RDF, specification, specification generations cripts, URI parsing, 303 stuff, etc.) is available in the motools sourceforge project.

Tuesday 14 August 2007

New revision of the Music Ontology

The last revision of the Music Ontology (1.12) is finally out - it took indeed some time to get through all the suggested changes on the TODO list! So, what's new in this release?

  • The Instrument concept is now linked to Ivan's taxonomy of Musical Instrument expressed in SKOS, and extracted from the Musicbrainz instrument taxonomy ;
  • Some peer-2-peer related concepts (Bittorrent and ED2K, two new subconcepts of MusicalItem) ;
  • Large amount of URI refactoring for predicates: camelCase becomes camel_case, and nouns are used instead of verbs, to be more N3 compliant. The older predicates are still in the ontology, but marked as deprecated, and declared as being owl:sameAs the newer predicates - so datasets still using the old ones won't hold dead links ;
  • A large number of term description have been re-written to clearly state in which case they should be use, when it can be a bit ambiguous. For example, available_as and free_download: one links to an item (something like ed2k://...), and the other one links to a web page giving access to the song (perhaps through a Flash player) ;
  • Terms are annotated by a mo:level property, specifying to which level (1,2 or 3) they belong. Terms in level 1 allow to describe simple editorial information (ID3v1-like), terms in level 2 allow to describe workflow information (this work was composed by Schubert, performed 10 times, but only 2 of these performances have been recorded, and terms in level 3 allow to describe the inner structure of the different events composing this worflow (at this time, this performer was playing in this particular key) ;
  • But surely, this release main improvement lies into the infrastructure for maintaining the code and the documentation. MO has now a dedicated SourceForge project, with a subversion repository holding the up-to-date RDF code, all the tool chain allowing to generate the whole specification, and a couple of related projects (which I will describe in more details in later posts). Drop me a line if you want to be registered as a developer on the project!

Still, there are a couple of things I'd like to do before the end of the week, like replacing the examples (some of which are pretty out-dated, or just wrong) by real-world MO data (as there begins to quite a lot available out-there:-) ).

Anyway, thanks to everyone who contributed to this release (especially Fred and Chris, and all the people on the mailing list who suggested changes)!!

Wednesday 18 July 2007

Music Ontology: Some thoughts about ontology design

Today, I came across this blog post by Seth Ladd, which has actually nothing to do with ontology design, but with a RESTful way of designing an account activation system. Anyway, the last paragraph of it says:

In summary, I love working with REST because it forces me to think in nouns, which are classes. I find it easier to model the world with nouns than with verbs. Plus, the problem with verbs is you can’t say anything about them, so you lose the ability to add metadata to the events in the system.

This particular sentence reminded me of a lot of discussion on the MO mailing list, which happened when we started looking towards the description of the music production workflow (an idea coming from the older Music Production Ontology) and the Event ontology as a foundation for it. Indeed, the ontology started with only a small number of concepts (well, basically, only the 4 standard FRBR terms), but with many relationships trying to cover a wide range: from this expression is in fact an arrangement of this work to this person is playing this instrument. But, once you want to be more expressive, you are stuck. For example, you can't express things such as this person is playing this instrument in this particular performance anymore---you can't say anything about verbs (unless you go into RDF reification, but, well, who really wants to go into it? :-) ).

workflow-top.png

When you start talking about a workflow of interconnected events (composition/arrangement/performance/recording, for example), you limit the number of relationships you have to provide (ultimately, relations between things are all held by an event - so you just need the five relationships defined in the Event ontology) in favor of some event concepts and some concepts covering your objects (musical work, score, signal, etc.). Now, you can actually attach any information you want to any of these events, allowing a large number of possible extensions to be built on top of your ontology. For example, we can refer to a custom recording device taxonomy by just stating something like ex:myrecording event:factor ex:shureSM58.

Moreover, the Event ontology also provides a way to break down events, so you can even break complex events (such as a group performance) into simpler events (a particular performer playing a particular instrument at a particular time).

(Actually, there are lots of papers on this sort of subject, like these ones on the ABC/Harmony project, this one on token reification in temporal reasoning or this one on modularisation of domain ontologies.)

Thursday 5 July 2007

Specification generation script

I just put online a small Prolog script, allowing to generate XHTML specification out of a RDF vocabulary (it should work for both RDFS and OWL). It is really similar to the Python script the SIOC community uses (thanks Uldis for the code:-) ), regarding the external behavior of the script. It provides a few enhancement though, like support of the status of terms, OWL constructs, and a few other things. You can check the Music Ontology specification to see one example output, generated from a RDFS/OWL file.

Here is the script.