DBTune blog

To content | To menu | To search

Tag - rdf

Entries feed

Tuesday 13 July 2010

First BBC microsite powered by a triple-store

Jem Rayfield wrote a very interesting post on the technologies used by the World Cup BBC web site, which also got covered by Read Write Web.

All this is very exciting, the World Cup Website proved that triple store technologies can be used to drive a production website with significant traffic. I am expecting lots more parts of the BBC web infrastructure to evolve in the same way :-)

There are two issues we are still currently trying to solve though:

  • We need to be able to cluster our triples in several dimension. For example, we may want to have a graph for a particular programme, and a much larger graph for a particular dataset (e.g. programme data, wildlife finder data, world cup data). The smaller graph is used to make our updates relatively cheap (we replace the whole graph whenever we receive an update). The bigger graph is used to give some degree of isolations between the different sources of data. For that, we need graphs within graphs. It can be done with N3-type graph literals, but is impossible to achieve in a standard quad-store setup, where one single triple can't be part of several graphs.
  • With regards to programme data, the main bottleneck we're facing is the number of updates per second we need to be able to process, which most of available triple stores struggle to keep up. The 4store instance on DBTune does keep up, but it has a negative impact on the querying performances, as the write operations are blocking the reads. We were quite surprised to see that the available triple store benchmarks do not take the write throughput into account!

Wednesday 29 October 2008

Freebase does linked data!

Just a small post, live from ISWC: Freebase does linked data!

You can try it there, and you can try this instance, for example.

Freebase linked data

Added to the wonderful David Huynh's Parallax, that's a lot of great news coming from the other side of the Atlantic :-)

Now, to see whether their linked data actually use the Web! Do they link to other web identifiers, available outside Freebase?

I just noticed something weird, also: the read/write permissions are attached to the tracks/films/whatever resources, instead of being attached to the RDF document itself.

Tuesday 1 July 2008

Echonest Analyze XML to Music Ontology RDF

I wrote a small XSL stylesheet to transform the XML results of the Echonest Analyze API to Music Ontology RDF. The Echonest Analyze API is a really great (and simple) web service to process audio files and get back an XML document describing some of their features (rhythm, structure, pitch, timbre, etc.). A lot of people already did really great things with it, from collection management to visualisation.

The XSL is available on that page. The resulting RDF can be queried using SPARQL. For example, the following query selects the boundaries of structural segments (chorus, verse, etc.):

PREFIX af: <http://purl.org/ontology/af/>
PREFIX event: <http://purl.org/NET/c4dm/event.owl#>
PREFIX tl: <http://purl.org/NET/c4dm/timeline.owl#>

SELECT ?start ?duration
FROM <http://dbtune.org/echonest/analyze-example.rdf>
WHERE
{
?e      a af:StructuralSegment;
        event:time ?time.
?time   tl:start ?start;
        tl:duration ?duration.
}

I also added on that page the small bit to add to the Echonest Analyze XML to make it GRDDL-ready. That means that the XML document can be automatically translated to actual RDF data (which can then be aggregated, stored, linked to, queried, etc.).

<Analysis    xmlns:grddl="http://www.w3.org/2003/g/data-view#" 
                grddl:transformation="http://dbtune.org/echonest/echonest.xsl">
...
</Analysis>

This provides a lot more data to aggregate for describing my music collection !

If there is one thing I really wish could be integrated in the Echonest API, it would be a Musicbrainz lookup... Right now, I have to manually link the data I get from it to the rest of my aggregated data. If the Echonest results could include a link to the corresponding Musicbrainz resource, it would really simplify this step :-)

Wednesday 25 June 2008

Linking Open Data: BBC playcount data as linked data

For the Mashed event this week end, the BBC released some really interesting data. This includes playcount data, stating how much an artist is featured within a particular BBC programmes (at the brand or episode level).

During the event, I wrote some RDF translators for this data, linking web identifiers in the DBTune Musicbrainz linked data to web identifiers in the BBC Programmes linked data. We used it with Kurt and Ben in our hack. Ben made a nice write-up about it. By finding web identifiers for tracks in a collection and following links to the BBC Programmes data, and finally connecting this Programmes data to the box holding all recorded BBC radio programmes over a year that was available at the event, we can quite easily generate playlists from an audio collection. Two python scripts implementing this mechanism are available there. The first one uses solely brands data, whereas the second one uses episodes data (and therefore helps to get fewer and more accurate items in the resulting playlist). Finally, the thing we spent the most time on was the SQLite storage for our RDF cache :-)

This morning, I published the playcount data as linked data. I wrote a new DBTune service for that. It publishes a set of web identifiers for playcount data, interlinking Musicbrainz and BBC Programmes. I also put online a SPARQL end-point holding all this playcount data along with aggregated data from Musicbrainz and the BBC Programmes linked data (around 2 million triples overall).

For example, you can try the following SPARQL query:

SELECT ?brand ?title ?count
WHERE {
   ?artist a mo:MusicArtist;
      foaf:name "The Beatles". 
   ?pc pc:object ?artist;
       pc:count ?count.
   ?brand a po:Brand;
       pc:playcount ?pc;
       dc:title ?title 
    FILTER (?count>10)}

This will return every BBC brand that has featured The Beatles more than 10 times.

Thanks to Nicholas and Patrick for their help!

Thursday 12 June 2008

Describing the content of RDF datasets

There seems to be an overall consensus in the Linking Open Data community that we need a way to describe in RDF the different datasets published and interlinked within the project. There is already a Wiki page detailing some aspects of the corresponding vocabulary, called voiD (vocabulary of interlinked datasets).

One thing I would really like this vocabulary to do would be to describe exactly the inner content of a dataset - what could we find in this SPARQL end-point or in this RDF document? I thought quite a lot about this recently, as I begin to really need that. Indeed, when you have RDF documents describing lots of audio annotations, and which generation is really computation intensive, you want to pick just the one that fits your request. There have been quite a lot of similar efforts in the past. However, most of them rely on one or another sort of reification, which makes it quite hard to actually use.

After some failed tries, I came up with the following, which I hope is easy and expressive enough :-)

It relies on a single property void:example, which links a resource identifying a particular dataset to a small RDF document holding an example of what you could find in that dataset. Then, with just a bit of SPARQL magic, you can easily query for datasets having a particular capability. Easy, isn't it? :-)

Here is a real-world example of that. A first RDF document describes one of the DBtune dataset:

:ds1
        a void:Dataset;
        rdfs:label "Jamendo end-point on DBtune";
        dc:source <http://jamendo.com/>;
        foaf:maker <http://moustaki.org/foaf.rdf#moustaki>;
        void:sparql_end_point <http://dbtune.org:2105/sparql/>;
        void:example <http://moustaki.org/void/jamendo_example.n3>;
        .

The void:example property points towards a small RDF file, giving an example of what you can find within this dataset.

Then, the following SPARQL query asks whether this dataset has a SPARQL end-point and holds information about music records, associated tags, and places to download them.

PREFIX void: <http://purl.org/ontology/void#>
PREFIX mo: <http://purl.org/ontology/mo/>
PREFIX tags: <http://www.holygoat.co.uk/owl/redwood/0.1/tags/>

ASK
FROM NAMED <http://moustaki.org/void/void.n3>
FROM NAMED <http://moustaki.org/void/jamendo_example.n3>
{
        GRAPH <http://moustaki.org/void/void.n3> {
                ?ds a void:Dataset;
                        void:sparql_end_point ?sparql;
                        void:example ?ex.
        }
        GRAPH ?ex {
                ?r a mo:Record;
                        mo:available_as ?l;
                        tags:taggedWithTag ?t.
        }
}

I tried this query with ARQ, and it works perfectly :-)

$ sparql --query=void.sparql
Ask => Yes

Update: It also works with ARC2. Although it does not load automatically the SPARQL FROM clause. You can try the same query on this SPARQL end-point, which previously loaded the two documents (the voiD description and the example).

Update 2: A nice blog post about automatically generating the data you need for describing an end-point - thanks shellac for the pointer!

Update 3: Following discussion on the #swig IRC channel.

Tuesday 20 May 2008

Ceriese: RDF translator for Eurostat data

Riese logo

Some time ago, I did a bit of work on the RIESE project, aiming at publishing the Eurostat dataset on the Semantic Web, and interlinking it with further datasets (eg. Geonames and DBpedia). This can look a bit far from my interests in music data, but there is a connexion which illustrates the power of linked data, as explained at the end of this post.

Original data

There are three distinct things we consider in the Eurostat dataset:

  • A table of content in HTML defining the hierarchical structure of the Eurostat datasets;
  • Tab-separated values dictionary files defining the ~80 000 data codes used in the dataset (eg. "eu27" for the European Union of 27 countries);
  • The actual statistical data, in tab-separated values files. Around 4000 datasets for roughly 350 million statistical items.

Ontology

The first thing we need to figure out when exposing data on the Semantic Web is the model we'll link to. This lead into the design of SCOVO (Statistical Core Vocabulary). The concepts in this ontology can be depicted as follows:

SCOVO ontology

The interesting thing about this model is that the statistical item is considered as a primary entity. We used as a basis the Event ontology - a statistical item is a particular classification of a space/time region. This allows to be really flexible and extensible. We can for example attach multiple dimensions to a particular item, resources pertaining to its creation, etc.

RDF back-end

I wanted to see how to publish such large amounts of RDF data and how my publication tools perform, so I designed the Ceriese software to handle that.

The first real concern when dealing with such large amounts of data is, of course, scalability. The overall Eurostat dataset is over 3 billion triples. Given that we don't have high-end machines with lots of memory, using the core SWI RDF store was out of the question (I wonder if any in-memory triple store can handle 1 billion triples, btw).

So there are three choices at this point:

  • Use a database-backed triple store;
  • Dump static RDF file in a file-system served through Apache;
  • Generate the RDF on-the-fly.

We don't have the sort of money it takes (for both the hardware and the software) for the first choice to really scale, so we tried the second and the third solution. I took my old Prolog-2-RDF software that I am using to publish the Jamendo dataset and we wrote some P2R mapping files converting the tab-separated value files. Then, we made P2R dump small RDF files in a file-system hierarchy, corresponding to the description of the different web resources we wanted to publish. Then, some Apache tweaks and Michael and Wolfgang's work on XHTML/RDFa publishing were enough to make the dataset available in the web of data.

But this approach had two main problems. First, it took ages to run this RDF dump, so we never actually succeeded to complete it once. Also, it was impossible to provide a SPARQL querying facility. No aggregation of data was made available.

So we eventually settled on the third solution. I took my Prolog hacker hat, and tried to optimise P2R to make it fast enough. I did it by using the same trick I used in my small N3 reasoner, Henry. P2R mappings are compiled as native Prolog clauses (rdf(S,P,O) :- ... ), which cut down the search space a lot. TSV files are accessed within those rules and parsed on-the-fly. The location of the TSV file to access is derived from a small in-memory index. Parsed TSV files are cached for a whole query, to avoid parsing the same file for different triple patterns in the query.

Same mechanisms are applied to derive a SKOS hierarchy from the HTML table of content.

Now, a DESCRIBE query takes less than 0.5 seconds for any item, on my laptop. Not perfect, but, still...

A solution to improve the access time a lot would be to dump the TSV file in a relational database, and access this database in our P2R mappings instead of the raw TSV files.

Trying it out and creating your own SPARQL end-point from Eurostat data is really easy.

  • Get SWI-Prolog;
  • Get the software from there;
  • Get the raw Eurostat data (or get it from the Eurostat web-site, as this one can be slightly out-dated);
  • Put it in data/, in your Ceriese directory;
  • Launch start.pl;
  • Go to http://localhost:3020/

Now what?

Michael and Wolfgang did an amazing work at putting together a really nice web interface, publishing the data in XHTML+RDFa. They also included some interlinking, especially for geographical locations, which are now linked to Geonames.

So what's the point from my music geek point of view?? Well, now, after aggregating Semantic Web data about my music collection (using these tools), I can sort hip-hop artists by murder rates in their city :-) This is quite fun as it is (especially as the Eurostat dataset holds a really diverse range of statistics), but it would be really interesting to mine that to get some associations between statistical data and musical facts. This would surely lead to interesting sociological results (eg. how does musical "genre" associate with particular socio-economic indicators?)

Monday 7 April 2008

D2RQ mapping for Musicbrainz

I just started a D2R mapping for Musicbrainz, which allows to create a SPARQL end-point and to provide linked data access out of Musicbrainz fairly easily. A D2R instance loaded with the mapping as it is now is also available (be gentle, it is running on a cheap computer :-) ).

Added to the things that are available within the Zitgist mapping:

  • SPARQL end point ;
  • Support for tags ;
  • Supports a couple of advanced relationships (still working my way through it, though) ;
  • Instrument taxonomy directly generated from the db, and related to performance events;
  • Support for orchestras ;
  • Linked with DBpedia for places and Lingvoj for languages

There is still a lot to do, though: it is really a start. The mapping is available on the motools sourceforge project. I hope to post a follow-up soon! (including examples of funny SPARQL queries :-) ).

Update: For some obscure port-forwarding reasons, the SNORQL interface to the SPARQL end point does not work on the test server.

Update 2: This is fixed. (thanks to the anonymous SPARQL crash tester which helped me find the bug, by the way :-) )

Wednesday 2 April 2008

13.1 billion triples

After a rough estimation, it looks like the services hosted on DBTune provide access to 13.1 billion triples, therefore making a significant addition to the data web!

Here is the break-down of such an estimation:

  • MySpace: 250 million people * 50 triples (in average) = 12.5 billion triples ;
  • AudioScrobbler: 1.5 million users (only Europe?) * 400 triples = 600 million ;
  • Jamendo: 1.1 million triples + 5000 links to other data sources ;
  • Magnatune: 322 000 triples + 233 links ;
  • BBC John Peel sessions: 277 000 triples + 2100 links ;
  • Chord URI service: I don't count it, as it is potentially infinite (the RDF descriptions are generated from the chord symbol in the URI).

However, SPARQL end-points are not available for AudioScrobbler and MySpace, as the RDF is generated on-the-fly, from the XML feeds for the earlier, and from scraping for the latter.

Now, I wish linked data could be provided directly by the data sources themselves :-) (Again, all the code used to run the DBTune services is available in the motools project on Sourceforge).

Tuesday 18 March 2008

Describing a recording session in RDF

Danny Ayers just posted a new Talis Platform application idea, dealing with music/audio equipment. As I was wondering it would actually be nice to have a set of web identifiers and corresponding RDF representation for audio equipment, I remembered a small Music Ontology example I wrote about a year ago. In fact, the Music Ontology (along with the Event ontology) is expressive enough to handle the description of recording sessions. Here is a small excerpt of such a description:

@prefix mo: <http://purl.org/ontology/mo/>.
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.
@prefix event: <http://purl.org/NET/c4dm/event.owl#>.
@prefix rd: <http://example.org/audioequipment/>.
@prefix : <#>.

:rec a mo:Recording;
   rdfs:label “live recording of my band in studio”;
   event:sub_event :guitar1, :guitar2, :drums1, :kick1, :sing.

:sing a mo:Recording;
   rdfs:label “Voice recorded with a SM57″;
   event:factor rd:sm57;
   event:place [rdfs:label “Middle of the room-I could be more precise here”].

:kick1 a mo:Recording;
   rdfs:label “Kick drum using a Shure PG52″;
   event:factor rd:pg52;
   event:place [rdfs:label “Kick drum microphone location”].

Well, it would indeed by nice if the rd namespace could point to something real! Who would fancy RDFising Harmony Central? :-)

Wednesday 12 March 2008

MySpace RDF service

Thanks to the amazing work of Kurt and Ben on the MyPySpace project, members of the MySpace social network can have a Semantic Web URI!

This small service provides such URIs and corresponding FOAF (top friends, depiction, name) and Music Ontology (URIs of available tracks in the streaming audio cache) RDF representations.

That means I can add such statements to my FOAF profile:

<http://moustaki.org/foaf.rdf#moustaki> foaf:knows <http://dbtune.org/myspace/lesversaillaisesamoustache>.

And then, using the Tabulator Firefox extension:

MySpace friends

PS: The service is still a bit slow and can be highly unstable, though - it is slightly faster with URIs using MySpace UIDs.

PS2: We don't host any data - everything is scraped on the fly using the MyPySpace tools.

Monday 10 March 2008

Finding a new flat using Exhibit

You can't play around with music data and RDF all day. Sometimes, you also need to find accommodation in the real world :-) But, as it was kind of difficult to motivate myself to the task, I thought it'd be easier if there was something RDFy about it.

So I created a new project on github (thanks for the invite, Tom, github is great!!), able to scrape RDF data out of Gum Tree. This python hack is able to scrape single advertisements for flats on Gum Tree, geo-code the corresponding location to find the lat/long coordinates, and to output the corresponding data as RDF. It is also able to process a GumTree RSS feed to scrape multiple ads, and to produce a Exhibit 2.0 JSON file, so it becomes really easy to create an Exhibit out of Gum Tree.

For example, this Exhibit shows the last 40 posted Gum Tree ads for one bedroom flats, north of the river.

Gum Tree Exhibit