DBTune blog

To content | To menu | To search

Monday 28 July 2008

Music Ontology linked data on BBC.co.uk/music

Just a couple of minutes ago on the Music Ontology mailing list, Nicholas Humfrey from the BBC announced the availability of linked data on BBC Music.

$ rapper -o turtle \
   http://www.bbc.co.uk/music/artists/cc197bad-dc9c-440d-a5b5-d52ba2e14234

[...]
<http://www.bbc.co.uk/music/artists/cc197bad-dc9c-440d-a5b5-d52ba2e14234#artist>
   a mo:MusicGroup;
   foaf:name "Coldplay";
   owl:sameAs <http://dbpedia.org/resource/Coldplay>;
   mo:member
<http://www.bbc.co.uk/music/artists/18690715-59fa-4e4d-bcf3-8025cf1c23e0#artist>,
<http://www.bbc.co.uk/music/artists/d156ceb2-fd90-4e82-baea-829bbdf1c127#artist>,
<http://www.bbc.co.uk/music/artists/6953c4db-7214-4724-a140-e87550bde420#artist>,
<http://www.bbc.co.uk/music/artists/98d1ec5a-dd97-4c0b-9c83-7928aac89bca#artist>
[...]

This is just really, really, really great... Congratulations to the /music team!

Update: Tom Scott just wrote a really nice post about the new BBC music site, explaining what the BBC is trying to achieve by going down the linked data path.

Wednesday 9 July 2008

Nominated!

We learned yesterday that DBTune was nominated for the Triplify Challenge! The other seven projects are really interesting as well, so I guess the competition will be really high! The final results will be given at the I-Semantics conference in early September.

Also, Tim Berners-Lee made a great talk about linked data and the semantic web on Radio 4 earlier today. The first use-case he mentions sounds quite familiar: finding bands based on geo-location data. He already mentioned that in one of his blog posts, linking to this screencast.

An interesting discussion took place on the Linking Open Data mailing list just afterwards, to gather use-cases for explaining to a general public what linked data can be useful for.

Wednesday 25 June 2008

Linking Open Data: BBC playcount data as linked data

For the Mashed event this week end, the BBC released some really interesting data. This includes playcount data, stating how much an artist is featured within a particular BBC programmes (at the brand or episode level).

During the event, I wrote some RDF translators for this data, linking web identifiers in the DBTune Musicbrainz linked data to web identifiers in the BBC Programmes linked data. We used it with Kurt and Ben in our hack. Ben made a nice write-up about it. By finding web identifiers for tracks in a collection and following links to the BBC Programmes data, and finally connecting this Programmes data to the box holding all recorded BBC radio programmes over a year that was available at the event, we can quite easily generate playlists from an audio collection. Two python scripts implementing this mechanism are available there. The first one uses solely brands data, whereas the second one uses episodes data (and therefore helps to get fewer and more accurate items in the resulting playlist). Finally, the thing we spent the most time on was the SQLite storage for our RDF cache :-)

This morning, I published the playcount data as linked data. I wrote a new DBTune service for that. It publishes a set of web identifiers for playcount data, interlinking Musicbrainz and BBC Programmes. I also put online a SPARQL end-point holding all this playcount data along with aggregated data from Musicbrainz and the BBC Programmes linked data (around 2 million triples overall).

For example, you can try the following SPARQL query:

SELECT ?brand ?title ?count
WHERE {
   ?artist a mo:MusicArtist;
      foaf:name "The Beatles". 
   ?pc pc:object ?artist;
       pc:count ?count.
   ?brand a po:Brand;
       pc:playcount ?pc;
       dc:title ?title 
    FILTER (?count>10)}

This will return every BBC brand that has featured The Beatles more than 10 times.

Thanks to Nicholas and Patrick for their help!

Mashed!

I was at Mashed (the former Hack Day) this week-end - a really good and geeky event, organised by the BBC at Alexandra Palace. We arrived on the Saturday morning for some talks, detailing the different things we'd be able to play with over the week-end. Amongst these, a full DVB-T multiplex (apparently, it was the first time since 1956 that a TV signal was broadcasted from Alexandra Palace), lots of data from the BBC Programmes team and a box full of recorded radio content over the last year.

After these presentations, the 24 hours hacking session began. We sat down with Kurt and Ben and wrote a small hack which basically starts from a personal music collection and creates you a playlist of recorded BBC programmes. I will write a bit more about this later today

During the 24 hours hack, we had a Rock Band session on big screen, a real-world Tron game (basically, two guys running with GPS phones, guided by two persons watching their trail on a google satellite map :-) ), a rocket launching...

Finally, at 2pm on the Sunday, people presented their hacks. Almost 50 hacks were presented, all extremely interesting. Take a look at the complete list of hacks! On the music side, Patrick's recommender was particularly interesting. It used Latent Semantic Analysis on playcount data for artists in BBC brands and episodes to recommend brands from artists or artists from artists. It gave some surprising results :-) Jamie Munroe resurrected the FPFF Musicbrainz fingerprinting algorithm (which was apparently due to replace the old TRM one before MusicIP offered their services to Musicbrainz) to identify tracks played several times in BBC programmes. The WeDoID3 team talked about creating RSS feeds from embedded metadata in audio and video, but the demo didn't work.

My personal highlight was the hack (which actually won a prize) from Team Bob. Here is a screencast of it:


BBC Dylan - News 24 Revisited (Clip) from James Adam on Vimeo.

Thanks to Matthew Cashmore and the rest of the BBC backstage team for this great event! (and thanks to the sponsors for all the free stuff - I think I have enough T-shirts for about a year now :-))

Tuesday 4 December 2007

Linking open data: interlinking the BBC John Peel sessions and the DBPedia datasets

For the last Hackday, the BBC released data about the John Peel sessions. In June, I did publish them as Linked Data, using a SWI-Prolog representation of this data, bundled with a custom P2R mapping. But so far, it was not interlinked with any dataset, making it a small island :-)

In order to enrich my interlinking experiences, I wanted to tackle a dataset I never really tried to link to: DBPedia, holding structured data extracted from Wikipedia (well, the Magnatune RDF dataset is linked to it, but just for geographical locations, which was fairly easy).

And my conclusion is... it is not that easy :-)

Try 1 - Matching on labels:

The first thing I tried was matching directly on labels, using an algorithm which might be summarized by:

1 - For all items I_i in the John Peel sessions dataset we want to link (instance of foaf:Agent or mo:MusicalWork, take label L_i ({I_i rdfs:label L_i})

2 - Issue the following SPARQL query to the DBPedia dataset:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> 
SELECT ?u
WHERE 
{       
?u rdfs:label L_i
}

3 - For all results R_i_j of this query, assert I_i owl:sameAs R_i_j

Just for fun, here are the results of such a query. You can't imagine how many people are called exactly the same... For example Jules Verne in the BBC John Peel sessions dataset is quite different from Jules Verne... Also, Jules Verne is quite different from the Jules Verne category, though they share the same label.

Try 2 - Still matching on labels, but with restrictions:

In the agent case, the disambiguation appears easy to achieve, by just expressing ''I am actually looking for someone which could be somehow related to the John Peel sessions''. But, err, Wikipedia (hence, DBPedia) is a bit messy some times, and it is quite difficult to find a reliable and consistent way of expressing this criteria. So I had to sample the John Peel data (taking some producers, some engineers, some artists, some bands) and look out manually how I could restrict the range of resources I was looking for in DBPedia and still be able to retrieve all my linked agents. This leads to the following SPARQL query (involving L_i as defined earlier):

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> 
SELECT ?u
WHERE 
{       
   {
      {?u <http://dbpedia.org/property/name> L_i} UNION 
      {?u rdfs:label L_i} UNION 
      {?u <http://dbpedia.org/property/bandName> L_i} 
    } 
    {
       {?u <http://dbpedia.org/property/wikiPageUsesTemplate> <http://dbpedia.org/resource/Template:infobox_musical_artist>} UNION 
       {?u a <http://dbpedia.org/class/yago/Group100031264>} UNION 
       {?u a ?mus.?mus rdfs:subClassOf <http://dbpedia.org/class/yago/Musician110340312>} UNION 
       {?u a ?artist. ?artist rdfs:subClassOf <http://dbpedia.org/class/yago/Creator109614315>} 
    } 
}

And for musical works?

Of course, this does not hold as soon as I broaden the range of resources I want to link. I first tried to use exactly the same methodology (basically restricting the resources I was looking for to be related to the Yago song concept). But, err, it did not work that well :-) You can't imagine how many songs have the same name! Just look at the results - this is enlightening! So far, the best I found was Walked Away, which appears to be the most popular title :-)

So what did I do to disambiguate? I took these results, got the RDF corresponding to the DBPedia resources, and made a literal search using the SWI literal index module on the abstract, looking for the name of the artist involved in the performance of this work. That's a bit hacky, but well, it did well even with cover songs (like Nirvana's Love Buzz).

Results

Surprisingly, the results do not seem to be too bad! I still have to check them more, but nothing seems obviously wrong, at first glance (I went through all the links manually, looking for something that would not make sense). The links are available here.

The SWI-Prolog code doing the trick is available here. Sorry, the code is a bit messy, and got increasingly hacky...

Wednesday 11 July 2007

John Peel sessions available as RDF

Yesterday, I put online the John Peel sessions as linked data (dereferencable identifiers, content negotiation, RDF, etc.).

It uses the data the BBC has released for the Hackday, some weeks ago. I wrote a SWI-Prolog wrapper for this data, which is then made accessible through SPARQL using P2R (which I have updated to handle dynamic construction of literals, by the way) and this mapping. The URIs are then made dereferencable through UriSpace.

Some documentation is available there.

Here are a bunch of URIs that you can try:

And then, for example

$ curl -L -H "Accept: application/rdf+xml" http://dbtune.org/bbc/peel/artist/1036
<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE rdf:RDF [
    <!ENTITY foaf 'http://xmlns.com/foaf/0.1/'>
    <!ENTITY mo 'http://purl.org/ontology/mo/'>
    <!ENTITY rdf 'http://www.w3.org/1999/02/22-rdf-syntax-ns#'>
    <!ENTITY rdfs 'http://www.w3.org/2000/01/rdf-schema#'>
    <!ENTITY xsd 'http://www.w3.org/2001/XMLSchema#'>
]>

<rdf:RDF
    xmlns:foaf="&foaf;"
    xmlns:mo="&mo;"
    xmlns:rdf="&rdf;"
    xmlns:rdfs="&rdfs;"
    xmlns:xsd="&xsd;"
>
<mo:MusicArtist rdf:about="http://dbtune.org/bbc/peel/artist/1036">
  <rdfs:label rdf:datatype="&xsd;string">King Crimson</rdfs:label>
  <foaf:img rdf:resource="http://bbc.co.uk/music/king_crimson.jpg"/>
  <foaf:name rdf:datatype="&xsd;string">King Crimson</foaf:name>
</mo:MusicArtist>

<rdf:Description rdf:about="http://dbtune.org/bbc/peel/session/1788">
  <mo:performer rdf:resource="http://dbtune.org/bbc/peel/artist/1036"/>
</rdf:Description>

<rdf:Description rdf:about="http://dbtune.org/bbc/peel/session/1789">
  <mo:performer rdf:resource="http://dbtune.org/bbc/peel/artist/1036"/>
</rdf:Description>

</rdf:RDF>

So far, this dataset is not linked to anything external! But I plan to link it to Musicbrainz, Geonames, and Last.fm snippets soon.