DBTune blog

To content | To menu | To search

Monday 3 December 2012

World Service Archive prototype

During the summer we (the Internet Research & Future Services team in BBC R&D) built a prototype around the BBC World Service Archive, combining automated metadata extraction and corwdsourcing through various Semantic Web technologies and datasets (namely, 4store and DBpedia). We just finished three blog posts describing various aspects of it:

The World Service Archive prototype

Friday 19 August 2011

4Store stuff

Update: The repository below is not maintained anymore, as official packages have been pushed into Debian. They are not yet available for Ubuntu 11.04 though. In order install 4store on Natty you'd have to install the following packages from the Oneiric repository, in order:

  • libyajl1
  • libgmp10
  • libraptor2
  • librasqal3
  • lib4store0
  • 4store

And you should have a running 4store (1.1.3).

Old post, for reference: I've been playing a lot with Garlik's 4store recently, and I have been building a few things around it. I just finished building packages for Ubuntu Jaunty, which you can get by adding the following lines in your /etc/apt/sources.list:

deb http://moustaki.org/apt jaunty main
deb-src http://moustaki.org/apt jaunty main

And then, an apt-get update && apt-get install 4store should do the trick. The packages are available for i386 and amd64. It is also one of my first packages, so feedback is welcomed (I may have gotten it completely wrong). After being installed, you can create a database and start a SPARQL server.

I've also been writing two client libraries for 4store, all available on Github:

  • 4store-php, a PHP library to interact with 4store over HTTP (so not exactly similar to Alexandre's PHP library, which interacts with 4store through the command-line tools);
  • 4store-ruby, a Ruby library to interact with 4store over HTTP or HTTPS.

Tuesday 28 June 2011

Using RDFa for testing templates

I promised a bunch of people from the BBC that I would write about this, so here it is! This is my first Software Engineering-related post, bear with me :)

We recently released a new iteration of /programmes, built on top of a completely new technology stack. As part of that move, we decided we wanted to review our template testing strategy. In our old application, the process for a new feature would basically be:

  1. Software Engineer writes models, controllers and data views;
  2. Web Developer writes XHTML templates;
  3. Software Engineer writes controller tests, which would actually test the routes, the controllers and the templates (such controller 'unit tests' are actually fairly standard across MVC frameworks, for some reason - e.g. in Zend)

Those controller tests were based on CSS selectors or XPaths. Therefore, any time a small front-end tweak needed to be done, the controller tests would break, which is very annoying for everyone.

We had two problems:

  • Our controller 'unit tests' were not really unit tests - front-end developers shouldn't have to understand the whole routing, controllers, models for making a front end change.
  • Using CSS selectors or XPaths for template tests is brittle. We don't want our tests to break every time we change the name of a CSS class.

In order to solve the first problem, we divided our controller tests in route tests (here is a route, assert that the application forwards it to the right controller/action with the right parameters), real unit controller tests (mock the models, call an action with some request parameters, check that the right data is sent to the view), and template tests.

In order to solve our second problem, we made those template tests rely on RDFa markup embedded within the page. In order to test a template, we create some mock data, generate the template using this data, and check the right RDF triples can be extracted from the resulting page. It ensures that tests are actually based on data - front-end changes won't have an impact on them. We just want to make sure we present the right data to the user. As the tests are not relying on other application code, it also means that someone writing the templates can maintain his own test suite.

A simple example of one of these tests is the following one:

    public function testLetterSearch()
        $data = (object) array(
            "by"      => "by",
            "search"  => "b",
            "slice" => "player",
            "letters" => array('@', 'a', 'b', 'c'),

        $this->assertTriples($data, array(
            array('/programmes/a-z/by/b', 'rdfs:seeAlso', '/programmes/a-z/by/%40/player'),
            array('/programmes/a-z/by/b', 'rdfs:seeAlso', '/programmes/a-z/by/a/player'),
            array('/programmes/a-z/by/b', 'rdfs:seeAlso', '/programmes/a-z/by/c/player'),

which can be read as if you are displaying a list of letters on an A-Z page and you have selected one, you shouldn't link to that one

Another very nice side-effect is that developers have a motivation to put lots of RDFa inside our pages! Compare the 168 triples extracted from a new /programmes page, including full tracklist information, programme metadata and broadcasts, to the 18 triples extracted from an old one. And as we add new components to this page, more RDFa will become available.

Also, the speed at which our developers picked up RDFa (1.0, not even 1.1, which is apparently simpler) defeats the eternal argument about RDFa being too complicated, but that's just my opinion :) The RDFa cheat sheet has proved immensely helpful.

Thursday 19 May 2011

Music and the Semantic Web workshop

Quite a big week for Semantic Web and music last week in London. On Thursday, there was the MusicNet workshop, with (among others) a talk from the BBC, given by Nick Humfrey. Sadly, I could not attend the workshop due to other BBC duties.

On the Friday, David De Roure and I organised a Music and the Semantic Web workshop at the AES, which David already blogged about. We had four panelists:

We started with four presentations from each of the panelists. David started by presenting MusicNet, aiming at creating canonical Web identifiers for classical music composers. He demonstrated a new tool for merging identifiers for music composers - finding common properties between groups of composers, and providing an interface to review those groups.

Alexandre then presented Seevl. He explained how it worked, aggregating and consolidating structured data about music artists from a range of different places, and generating recommendations using this data, as well as explanations of these recommendations (those two artists played together, they had the same producer for their first album, etc.). I wrote quite a lot about this kind of things on this blog - it's really nice to finally see it taking shape!

Gregg presented his experience within the Connected Media Experience (CME) project. His presentation was supported by a position paper, which is extremely interesting. CME (a large consortium of key music industry players) worked for a couple of years on a RDF/Music Ontology format for online releases. Sadly, they recently abandoned this format for simple HTML5+CSS - structured data about those releases is not a priority anymore. Gregg gave us insights on what went wrong, and what were the lessons to be learned by both the Semantic Web community and the Music Industry.

Evan then presented Decibel, giving us very interesting insights about music metadata, and a demonstration of their service. It was interesting to see semantic technologies used in a completely different model. The richness of the data they hold is truly amazing (Evan demoed their internationalisation feature as well - all their data is available in a variety of languages), but sadly not available under an open license.

After that, we had a number of questions from David and I, as well as from the audience, about ease of editing and owning of music metadata (who should own it? third parties? artists? record labels? who should host the canonical URI for an artist?), about relationships of Semantic Web standards with industry standards like ISRCs, ISNIs, MPEG etc.

Overall, a very interesting workshop - I hope we can do it again next year!

Tuesday 21 December 2010

Named Graphs and quad stores

I was recently looking again at the Named Graphs paper, and was wondering why the consensus for implementing Named Graphs support in triple stores was to store quads.

Named Graphs are defined as follows. We consider U as being the set of all web identifiers, L as being the set of all literals and B the set of all blank nodes. U , L and B are pairwise disjoint. Let N = ULB, then the set T = N × U × N is the set of all RDF triples. The set of RDF graphs G is the power set of T, and a named graph is a pair ng = (u, g), where uU and gG.

As mentioned in one my previous post, one thing that I am really keen on is the idea of having triples that belong to multiple graphs. This situation is already happening in the wild a fair bit. If you look at a /programmes aggregation feed (all available Radio 4 programmes starting with g), it mentions multiple programmes. All of the statements about each of these programmes also belong to the corresponding programme graph (e.g. Gardener's Question).

Nothing in the above definition forbids that from happening - each graph and sub-graph can be named. ( (a,b,c), (d,f,g) ) can be named x and ( (a,b,c) ) can be named y.

Now, most triple stores implement Named Graphs by, in fact, storing quads. They store tuples of the form (subject, predicate, object, graph). Which means that when you try to store the two graphs above, they will in fact store (a,b,c,x), (d,f,g,x) and (a,b,c,y) --- the triples will be replicated in each graph they are mentioned in. For the above /programmes example, it is terribly inefficient. If you crawl /programmes, you will end up hitting lots of such aggregation URLs, leading in a tremendous amount of data duplication - lots of triples will be repeated across graphs. Perhaps it would be better to store something like (a,b,c,(x,y)) and (d,f,g,(x))?

I would be curious to know if anyone else is hitting this issue. I can think of multiple use-cases for triples belonging to multiple graphs: access control, versioning, attribution at different granularities...

Wednesday 20 October 2010

Linking Enterprise Data

Just a quick note to mention that the Linking Enterprise Data book is now available, for free, online. Along with Tom Scott, Silver Oliver, Patrick Sinclair and Michael Smethurst, we wrote a chapter on the use of Semantic Web technologies at the BBC, which expands on the W3C Case Study we wrote at the beginning of the year. If you're interested in how Semantic Web technologies were used to build BBC Programmes, BBC Music and Wildlife Finder, make sure you read it (I also noticed it was available for pre-order on Amazon).

Tuesday 7 September 2010

Geocaching for music wins the 7digital prize at London Music Hack Day!

We had a really good time last week-end at the London Music Hack Day. If you haven't seen the list of hacks that were done over the week-end yet, go take a look! I found Speakatron, BumbleTab and Earth Destroyers to be particularly funny :)

Andrew, Chris and myself (as well as our beloved video producer Patrick) created an Android application to create and find 'musical treasures'. Think geocaching for music. You wander around and can drop tracks from your personal music collection, and you see what tracks people have dropped nearby you. If you're close enough, you can fetch and play the tracks. And we won the 7digital prize!

Here is a small video demonstration, and below are some screenshots:

On the data side, we use the recently linked 7digital and Echonest APIs. The back-end was written using Rails (check out the Cucumber tests!), and the Android application using the Android SDK. A list of clues for recently dropped tracks appear on the main website and on a twitter feed. And of course, here is the APK for the Android application.

Tuesday 13 July 2010

First BBC microsite powered by a triple-store

Jem Rayfield wrote a very interesting post on the technologies used by the World Cup BBC web site, which also got covered by Read Write Web.

All this is very exciting, the World Cup Website proved that triple store technologies can be used to drive a production website with significant traffic. I am expecting lots more parts of the BBC web infrastructure to evolve in the same way :-)

There are two issues we are still currently trying to solve though:

  • We need to be able to cluster our triples in several dimension. For example, we may want to have a graph for a particular programme, and a much larger graph for a particular dataset (e.g. programme data, wildlife finder data, world cup data). The smaller graph is used to make our updates relatively cheap (we replace the whole graph whenever we receive an update). The bigger graph is used to give some degree of isolations between the different sources of data. For that, we need graphs within graphs. It can be done with N3-type graph literals, but is impossible to achieve in a standard quad-store setup, where one single triple can't be part of several graphs.
  • With regards to programme data, the main bottleneck we're facing is the number of updates per second we need to be able to process, which most of available triple stores struggle to keep up. The 4store instance on DBTune does keep up, but it has a negative impact on the querying performances, as the write operations are blocking the reads. We were quite surprised to see that the available triple store benchmarks do not take the write throughput into account!

Wednesday 19 May 2010

DBpedia and BBC Programmes

We just put live a new exciting feature on BBC Programmes: programme aggregations powered by DBpedia. For example, you can look at:

Of course, the RDF representations are linked up to DBpedia. Try loading adolescence in the Tabulator, for example - you will get an immediate mashup of BBC data, DBpedia data, and Freebase data. Or if you're not afraid of getting overloaded with data, try the California one.

One of the most interesting things about using web identifiers as tags for our programmes (apart from being able to automatically generate those aggregation pages, of course), is that we can use ancillary information about those tags to create new sorts of aggregations, and new visualisations of our data. We could for example plot all our Radio 3 programmes on a map, depending on the geolocation of the people associated to these programmes. Or we could create an aggregation of BBC programmes featuring artists living in the cities with the highest rainfall (why not?). And, of course, this will be a fantastic new source of data for the MusicBore! The possibilities are basically endless, and we are very excited about it!

Friday 29 January 2010

BBC Semantic Web use-case

After a very long time writing it, we finally have a BBC Semantic Web use-case on the W3C website! It describes work we did around BBC Programmes, BBC Music, BBC Wildlife Finder and Search+. I hope it all makes a bit of sense :-) For a more detailed writeup about these issues, Patrick's Linked Data on the BBC are very good.

Thursday 14 January 2010

Live SPARQL end-point for BBC Programmes

Update: We seem to have an issue with the 4store hosting the dataset, so the data is stale since the end of February. Update 2: All should be back to normal and in sync. Please comment on this post if you spot any issue, or general slowliness.

Last year, we got OpenLink and Talis to crawl BBC Programmes and provide two SPARQL end-points on top of the aggregated data. However, getting the data by crawling it means that the end-points did not have all the data, and that the data can get quite outdated -- especially as our programme data changes a lot.

At the moment, our data comes from two sources: PIPs (the central programme database at the BBC) and PIT (our content mangement system for programme information). In order to populate the /programmes database, we monitor changes on these two sources and replicate them on our database. We have a small piece of Ruby/ActiveRecord software (that we call the Tapp) which handles this process.

I made a small experiment, converting our ActiveRecord objects to RDF and hooking an HTTP POST or an HTTP DELETE request to a 4store instance for each change we receive. This means that this 4store instance is kept in sync with upstream data sources.

It took a while to backfill, but it is now up-to-date. Check out the SPARQL end-point, a test SPARQL query form and the size of the endpoint (currently about 44 million triples).

The end-point holds all information about services, programmes, categories, versions, broadcasts, ondemands, time intervals and segments, as defined within the Programme Ontology. All of these resources are held within their own named graph, which means we have a very large number of graphs (about 5 million). It makes it far easier to update the endpoint, as we can just replace the whole graph whenever something changes for a resource.

This is still highly experimental though, and and I already found a few bugs: some episodes seem to be missing (for example, some Strictly Come Dancing episodes are missing, for some reason). I've also encountered some really weird crashes of the machine hosting the end-point when concurrently pushing a large number of RDF documents at it - I still didn't succeed to identify the cause of it. To summarise: it might die without notice :-)

Here are some example SPARQL queries:

  • All programmes related to James Bond:
PREFIX po: <http://purl.org/ontology/po/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?uri ?label
  ?uri po:category 
    <http://www.bbc.co.uk/programmes/people/bmFtZS9ib25kLCBqYW1lcyAobm8gcXVhbGlmaWVyKQ#person> ; rdfs:label ?label
  • FInd all Eastenders broadcast dates after 2009-01-01, along with the type of the version that was broadcast:
PREFIX event: <http://purl.org/NET/c4dm/event.owl#> 
PREFIX tl: <http://purl.org/NET/c4dm/timeline.owl#> 
PREFIX po: <http://purl.org/ontology/po/> 
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
SELECT ?version_type ?broadcast_start
{ <http://www.bbc.co.uk/programmes/b006m86d#programme> po:episode ?episode .
  ?episode po:version ?version .
  ?version a ?version_type .
  ?broadcast po:broadcast_of ?version .
  ?broadcast event:time ?time .
  ?time tl:start ?broadcast_start .
  FILTER ((?version_type != <http://purl.org/ontology/po/Version>) && (?broadcast_start > "2009-01-01T00:00:00Z"^^xsd:dateTime))}
PREFIX po: <http://purl.org/ontology/po/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX mo: <http://purl.org/ontology/mo/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX event: <http://purl.org/NET/c4dm/event.owl#>
PREFIX tl: <http://purl.org/NET/c4dm/timeline.owl#>
PREFIX owl: <http://www.w3.org/2002/07/owl#> 
SELECT DISTINCT ?programme ?label
  ?event1 po:track ?track1 .
  ?track1 foaf:maker ?maker1 . ?maker1 owl:sameAs <http://www.bbc.co.uk/music/artists/67f66c07-6e61-4026-ade5-7e782fad3a5d#artist> .
  ?event2 po:track ?track2 .
  ?track2 foaf:maker ?maker2 . ?maker2 owl:sameAs <http://www.bbc.co.uk/music/artists/fb7272ba-f130-4f0a-934d-6eeea4c18c9a#artist> .
  ?event1 event:time ?t1 .
  ?event2 event:time ?t2 .
  ?t1 tl:timeline ?tl .
  ?t2 tl:timeline ?tl .
  ?version po:time ?t .
  ?t tl:timeline ?tl .
  ?programme po:version ?version .
  ?programme rdfs:label ?label .

Tuesday 27 October 2009

Music recommendation and Linked Data

We just presented yesterday at ISMIR a tutorial about Linked Data for music-related information. More information on the tutorial is available on the tutorial website, and the slides are also available.

In particular, we had two sets of slides dealing with the relationship between music recommendation and linked data. As this is something we're investigating within the NoTube project, I thought I would write up a bit more about it.

Let's focus on artist to artist recommendation for now. If we look at last.fm for recommendations for New Order, here is what we get.

Artists similar to New Order, from last.fm

Similarly, using the Echonest API for similar artists, we get back an ordered list of artists similar to New Order, including Orchestral Manoeuvres in the Dark, Depeche Mode, etc.

Now, let's play word associations for a few bands and musical genres. My colleague Michael Smethurst took the Sex Pistols, Acid House and Public Enemy, and draw the following associations:

Sex Pistols associated words

Acid House word associations

Pubic Enemy word associations

We can see that among the different terms in these diagrams, some refer to people, to TV programmes, to fashion styles, to drugs, to music hardware, to places, to laws, to political groups, to record labels, etc. Just a couple of these terms are actually other bands or tracks. If you were to describe these artists just in musical terms, you'd probably be missing the point. And all these things are also linked to each other: you could play word associations for any of them and see what are the connections between Public Enemy and the Sex Pistols. So how does that relate to recommendations? When recommending an artist from another artist, the context is key. You need to provide an explanation of why they actually relate to each other, whether it's through common members, drugs, belonging to the same independent record label, acoustically similar (if so, how exactly), etc. The main hypothesis here being that users are much more likely to be accepting a recommendation that is explicitly backed by some contextual information.

On the BBC website, we cover quite a few domains, and we try to create as much links as possible between these domains, by following the Linked Data principles. From our BBC Music site, we can explore much more information, from other BBC content (programmes, news etc.) to other Linked Data sources, e.g. DBpedia, Freebase and Musicbrainz. This provides us with a wealth of structured information that we would ultimately want to use for driving and backing up our recommendations.

The MusicBore I've described earlier on this blog kind of uses the same approach. Playlists are generated by following paths in Linked Data. Introduction of each artists is done by generating a sentence from the path leading from the seed artist to the target artist. The prototype described in that paper from the SDOW workshop last year also illustrates that approach.

So we developed a small prototype of these kind of ideas, rqommend (and when I say small, it is very small :) ). Basically, we define "relatedness rules" in the form of SPARQL queries, like "Two artists born in Detroit in the 60s are related". We could go for very general rules, e.g. "Any paths between two artists make them related", but it would be very hard to generate an accurate textual explanation for it, and might give some, hem, not very interesting connections. Then, we just go through these rules on an aggregation of Linked Data, and generate recommendations from them. Here is a greasemonkey script injecting such recommendations with BBC Music (see for example the Fugazi page). It injects Linked Data based recommendations, along with the associated explanation, within BBC artist pages. For example, for New Order:

BBC Music recs for New Order

To conclude, I think there is a really strong influence of traditional information retrieval systems on the music information retrieval community. But what makes Google, for example, particularly successful is to exploit links, not the documents themselves. We definitely need to go towards the same sort of model. Exploiting links surrounding music, and all the cross-domain information that makes it so rich, to create better music recommendation systems which combine the what is recommended with the why it is recommended.

Thursday 10 September 2009

Linked Data London event screencasts and London Web Standards meetup

With Tom Scott, we presented a talk on contextualising BBC programmes using linked data for the Linked Data London event. For the occasion, I made a couple of screencasts.

The first one shows some browsing of the linked data we expose on the BBC website, using the Tabulator Firefox extension. I start by getting to a Radio 2 programme, to get to its segmentation in musical tracks, to get to another programme featuring one of the tracks, to get to another artist featured in that programme. The Tabulator ends up displaying data aggregated from BBC Programmes, BBC Music and DBpedia.

Exploring BBC programmes and music data using the Tabulator

The second one shows what you can do by using these programmes/artists and artists/programmes links. We built some very straight-forward programme to programme recommendation using them. On the right-hand side of the programme page, there are recommendations, based on artists played in common. The recommendations are scoped by the availability of the programme on iPlayer or by the fact it has an upcoming broadcast. If you hover over those recommendations, it will display what allowed us to derive it: here, a list of common artists played in the two programmes. This work is part of our investigations within the NoTube European project.

Artist-based programme to programme recommendations

Also, as Michael already posted on Radio Labs, we gave a presentation to the London Web Standards group on Linked Data. It was a very nice event, especially as mainly web developers turned up. Linked data events tend to be mostly about linked data evangelists talking to other linked data evangelists (which is great too!), so this was quite different :-) Lots of interesting questions about provenance and trustworthiness of data were asked, which are always a bit difficult to answer, apart from the usual it's just the Web, you can deal with it as you do (or don't) currently with Web data, e.g. by keeping track of provenance information and filtering based on that. Somebody raised that you could make some statistics on how many times a particular statement is repeated in order to derive its trustworthiness, but this sounds a bit harmful... Currently on the Linked Data cloud, lots of information gets repeated. For example, if a statement about an artist is available on DBpedia, there is a fair chance it will get repeated in BBC Music, just because we also use Wikipedia as an information source. The fact that this statements gets repeated doesn't make it more valid.

Skim-read introduction to linked data slides

Monday 13 July 2009

Music Hack Day and the MusicBore

This week end was the Music Hackday in London. The event was great, with everything a hack day need: pizza, beer, and technical glitches during the demos :-) And, of course, lots of awesome hacks (including that amazing visualisation which didn't make it into the list for some reason).

With Christopher, Patrick and Nick, we created the MusicBore. The MusicBore is a completely automated radio DJ, which, well, tends to be really boring :-) We actually won two prizes with it! The Last.fm one and the best hack one! We were really, really happy :-) But let the musicbore introduce itself:

Hello. I am the Music Bore. I play music and I like to tell you ALL about the music I play. I live on IRC. I get my information from BBC Music, BBC Programmes, last fm, the Echo Nest, Yahoo Weather and the web of Linked Data. To find out more, please visit bit.ly/musicbore. You can dissect my disgusting innards on github. Now let me play you some music.

Here it is in action, the first one walks through Soul-ish tunes, whereas the second one goes into French punk-rock:

The Music Bore - Video 2 from Nicholas Humfrey on Vimeo.

The Music Bore - Video 1 from Nicholas Humfrey on Vimeo.

The MusicBore is powered by a new and exciting messaging technology: IRC :-) Lots of bots sit around in an IRC channel and talk to each other to create a radio show. The show is entirely created live, each bot contributing a specific ability to it. Just before the hack presentation, we had 10 bots in the same channel (all the bots sources are on github):
  • controller: In charge of starting the show by playing an introduction and choose a new song drawn from the BBC Music charts. Also in charge of re-drawing a new song in case the other bots gets stuck.
  • thebore: Renders information about a particular artist from BBC Music, BBC Programmes, Wikipedia, Last.fm and other Linked Data.
  • connectionfinder: Given a seed artist, gives the next one in the playlist, along with an explanation of how it was chosen. Basically walks through Linked Data to discover new artists.
  • placefinder: Given a seed artist, gives the next one in the playlist, along with an explanation of how it was chosen. This bot is constrained to go through places, so will give connections like Did you know that David Guetta is born in Paris, and that Georges Garvarentz died in the same place?
  • musicfinder: Finds music content for an artist, using BBC Programmes segment data and the Surge Radio RDF.
  • trackfinder: Finds music content for an artist, using the Echonest API
  • irc2play: Say sentences and play tracks mentioned on IRC, mixing them using Jack, Madjack and JackMiniMix
  • weatherbot: Renders weather information from Yahoo
  • imagebot: Finds an image from BBC Music for a particular artist.
  • hotnessfinder: Find the hotness for an artist using the Echonest API, and constructs a sentence from it.

Here is a small diagram of how the different bots interact with each other (if you don't understand that diagram, that's fine, we don't either, but it does seem to work :-) ).


As I thought they were quite fun, here are the logs of the bbcmusicbore channel over the week-end. Here is a small best-of:

14:05 < yvesr> trackfinder: weezer
14:05 -!- trackfinder [n=trackfin@] has quit
[Remote closed the connection]
This happens a lot at the beginning of the log, bots just throwing exceptions and quitting irc.
11:25 < thebore> connectionfinder:http://dbpedia.org/resource/David_Guetta
11:25 < connectionfinder> say:Did you know that David Guetta was born in
France which has leader name Nicolas Sarkozy, which has spouse Carla
11:25 < thebore> say:Carla Bruni has released 3 records. No Promises was
just terrible.
God. Nicolas Sarkozy is really in every media, even the randomly generated ones! thebore got enlightened on that one.
10:41 < yvesr> connectionfinder:http://dbpedia.org/resource/Barry_White
10:42 < connectionfinder> say:Did you know that Barry White died in the
same city as Chico Hamilton, Los Angeles?
A morbid playlist generation :-)
11:07 < thebore> say:Colin Murray on BBC Radio One is a massive fan. They
have a myspace at myspace.com/metallica. They sound a bit like Megadeth.
They have released 524 records. I really liked 2008-11-01: Rose Garden,
Portland, OR, USA.
Bootlegs FTW!
11:43 < yvesr> hotnessfinder: michael jackson
11:43 < hotnessfinder> say: Wow, michael jackson is really, really hot
right now! Did he die recently?
No comments
12:50 < thebore> connectionfinder:http://dbpedia.org/resource/Snoop_Dogg
12:50 < connectionfinder> say:Did you know that Dogg, Snoop is signed on
Death Row Records which was founded by Dr. Dre, which used to be quite
close to World Class Wreckin' Cru?
No, I didn't!
13:56 < thebore> say:Beverley Knight sounds a bit like Eternal.
In September 2005 Knight joined the likes of Bob Dylan and Joni
Mitchell by being made a Doctor of Music after receiving the honorary
degree from the University of Wolverhampton.
No, I didn't know either!

Right now, all the bots are offline (they were running on our laptops during the week end), but we plan to use the Amazon Web Services vouchers we won to make them run for a while :-)

Monday 15 June 2009

And another fun BBC SPARQL query

This query returns BBC programmes featuring artists originating from France (this is just a straight adaptation of the last query in my previous post).

The results are quite fun! Apparently, the big French hits on the BBC are from Jean-Michel Jarre, Air, Modjo, Phoenix (are they known in France? I've only heard of them in the UK) and Vanessa Paradis.

Note that the tracklisting data we expose in our RDF just goes back a couple of months, so that might explain why the list is not bigger.

Thursday 11 June 2009

BBC SPARQL end-points

We recently announced on the BBC backstage blog the availability of two SPARQL end-points, one hosted by Talis and one by OpenLink. These two companies aggregated the RDF data we publish at http://www.bbc.co.uk/programmes and http://www.bbc.co.uk/music. This opens up quite a lot of fascinating SPARQL queries. Talis already compiled a small list, and here are a couple I just designed:

  • Give me programmes that deal with the fictional character James Bond - results
PREFIX po: <http://purl.org/ontology/po/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?uri ?label
  ?uri po:person 
    <http://www.bbc.co.uk/programmes/people/bmFtZS9ib25kLCBqYW1lcyAobm8gcXVhbGlmaWVyKQ#person> ; rdfs:label ?label
  • GIve me artists that were featured in the same programme as the Foo Fighters - results
PREFIX po: <http://purl.org/ontology/po/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX mo: <http://purl.org/ontology/mo/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX event: <http://purl.org/NET/c4dm/event.owl#>
PREFIX tl: <http://purl.org/NET/c4dm/timeline.owl#>
SELECT DISTINCT ?artist2 ?label2
  ?event1 po:track ?track1 .
  ?track1 foaf:maker <http://www.bbc.co.uk/music/artists/67f66c07-6e61-4026-ade5-7e782fad3a5d#artist> .
  ?event2 po:track ?track2 .
  ?track2 foaf:maker ?artist2 .
  ?artist2 rdfs:label ?label2 .
  ?event1 po:time ?t1 .
  ?event2 po:time ?t2 .
  ?t1 tl:timeline ?tl .
  ?t2 tl:timeline ?tl .
  FILTER (?t1 != ?t2)
  • Give me programmes that featured both Al Green and the Foo Fighters (yes! there is one result!!) - results
PREFIX po: <http://purl.org/ontology/po/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX mo: <http://purl.org/ontology/mo/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX event: <http://purl.org/NET/c4dm/event.owl#>
PREFIX tl: <http://purl.org/NET/c4dm/timeline.owl#>
SELECT DISTINCT ?programme ?label
  ?event1 po:track ?track1 .
  ?track1 foaf:maker <http://www.bbc.co.uk/music/artists/67f66c07-6e61-4026-ade5-7e782fad3a5d#artist> .
  ?event2 po:track ?track2 .
  ?track2 foaf:maker <http://www.bbc.co.uk/music/artists/fb7272ba-f130-4f0a-934d-6eeea4c18c9a#artist> .
  ?event1 po:time ?t1 .
  ?event2 po:time ?t2 .
  ?t1 tl:timeline ?tl .
  ?t2 tl:timeline ?tl .
  ?version po:time ?t .
  ?t tl:timeline ?tl .
  ?programme po:version ?version .
  ?programme rdfs:label ?label .
  • All programmes that featured an artist originating from Northern Ireland - results
PREFIX po: <http://purl.org/ontology/po/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX mo: <http://purl.org/ontology/mo/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX event: <http://purl.org/NET/c4dm/event.owl#>
PREFIX tl: <http://purl.org/NET/c4dm/timeline.owl#>
PREFIX dbprop: <http://dbpedia.org/property/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
SELECT DISTINCT ?programme ?label ?artistlabel ?dbpmaker
  ?event1 po:track ?track1 .
  ?track1 foaf:maker ?maker .
  ?maker rdfs:label ?artistlabel .
  ?maker owl:sameAs ?dbpmaker .
  ?dbpmaker dbprop:origin <http://dbpedia.org/resource/Northern_Ireland> .
  ?event1 po:time ?t1 .
  ?t1 tl:timeline ?tl .
  ?version po:time ?t .
  ?t tl:timeline ?tl .
  ?programme po:version ?version .
  ?programme rdfs:label ?label .

(Note that we just need the owl:sameAs in the above query as the Talis end-point doesn't support inference)

Let us know what kind of query you can come up with this data! :-)

Tuesday 12 May 2009

Yahoo Hackday 2009


We went to the Yahoo Hackday this week end, with a couple of people from the C4DM and the BBC. Apart from a flaky wireless connection on the Saturday, it was a really great event, with lots of interesting talks and interesting hacks.

On the Saturday, we learned about Searchmonkey. I tried to create a small searchmonkey application during the talk, but eventually got frustrated. Apparently, Searchmonkey indexes RDFa and eRDF , but doesn't follow <link rel="alternate"/> links towards RDF representations (neither does it try to do content negotiation). So in order to create a searchmonkey application for BBC Programmes, I needed to either include RDFa in all the pages (which, hem, was difficult to do in an hour :-) ) or write an XSLT against our RDF/XML representations, which would just be Wrong, as there are lots of different ways to serialise the same RDF in an RDF/XML document.

We also learned about the Guardian Open Platform and Data Store, which holds a huge amount of interesting information. The license terms are also really permissive, even allowing commercial uses of this data. I can't even imagine how useful this data would be if it were linked to other open datasets, e.g. DBpedia, Geonames or Eurostat.

I got also a bit confused by YQL, which seems to be really similar to SPARQL, at least in the underlying concept ("a query language for the web"). However, it seems to be backed by lots of interesting data: almost all of Yahoo services, and a few third-party wrappers, e.g. for Last.fm. I wonder how hard it would be to write a SPARQL end-point that would wrap YQL queries?

Finally, on Saturday evening and Sunday morning, we got some time to actually hack :-) Kurt made a nice MySpace hack, which does an artist lookup on MySpace using BOSS and exposes relevant information extracted using the DBTune RDF wrapper, without having to look at an overloaded MySpace page. It uses the Yahoo Media Player to play the audio files this page links to.

At the same time, we got around to try out some of the things that can be built using the linked data we publish at the BBC, especially the segment RDF I announced on the linked data mailing list a couple of weeks ago. We built a small application which, from a place, gives you BBC programmes that feature an artist that is related in some way to that place. For example, Cardiff, Bristol, London or Lancashire. It might be bit slow (and the number of results are limited) as I didn't have time to implement any sort of caching. The application is crawling from DBpedia to BBC Music to BBC Programmes at each request. I just put the (really hacky) code online.

And we actually won the Backstage price with these hacks! :-)

This last hack illustrates to some extent the things we are investigating as part of the BBC use-cases of the NoTube project. Using these rich connections between things (programmes, artists, events, locations, etc.), it begins to be possible to provide data-rich recommendations backed by real stories (and not only "if you like this, you may like that"). I mentioned these issues in the last chapter of my thesis, and will try to follow up on that here!

Friday 17 April 2009

Brands, series, categories and tracklists on the new BBC Programmes

I just posted a small article on the BBC Radio Labs blog about the new features of the BBC Programmes website. Hopefully that makes some sense and highlights some of the things we've been working on over the last six months! Spoiler: lots of nice nice RDF :-)

Tuesday 24 March 2009

A sneak peek at the BBC Music RDF

The new BBC Music website was launched yesterday, with a lot of Linked Data and RDF goodness. BBC Music provides a truly REST API. Congratulations to the whole team, they did an amazing work! In short, that means that you can easily build applications on top of BBC music data quite easily.

For example, each artist in BBC Music has an RDF representation. For example, Nirvana has an RDF representation, which exposes the aggregated BBC data about this band. The site also supports content negotiation, so doing

$ curl -L -H "Accept: application/rdf+xml" http://www.bbc.co.uk/music/artists/5b11f4ce-a62d-471e-81fc-a69a8278c7da

will lead you to the RDF representation.

Note that this representation includes links to further URIs, allowing you to discover more data, e.g. about members of that band. It also includes a owl:sameAs link to the corresponding DBpedia resource, allowing you to aggregate more data about that band, extracted from Wikipedia's infoboxes.

As an example of a "linked data journey", you can get from Nirvana to Krist Novoselic to the corresponding Krist Novoselic in DBpedia to Compton, California to N.W.A. Lots of really rich data to do interesting thing, like, say, a music recommender :-)

BBC Music also includes RDF representation of reviews, e.g. that one. It also includes an RDF representation of the A to Z, and a search interface returning RDF links to matched artists. For example, here are the results of a search for "Bad Religion", which include a link to an RDF document about it on BBC Music.

Congrats again to Patrick and Nicholas, who did this work on the RDF side of BBC Music!

Tuesday 10 February 2009

Thesis uploaded!

I just uploaded my PhD thesis entitled A Distributed Music Information System, which I defended on the 22nd of January. My examiners were David de Roure from University of Southampton and Nicolas Gold from King's College. My PhD supervisor was Mark Sandler.

Here is the abstract:

Information management is an important part of music technologies today, covering the man- agement of public and personal collections, the construction of large editorial databases and the storage of music analysis results. The information management solutions that have emerged for these use-cases are still isolated from each other. The information one of these solutions manages does not benefit from the information another holds.

In this thesis, we develop a distributed music information system that aims at gathering music- related information held by multiple databases or applications. To this end, we use Semantic Web technologies to create a unified information environment. Web identifiers correspond to any items in the music domain: performance, artist, musical work, etc. These web identifiers have structured representations permitting sophisticated reuse by applications, and these representations can quote other web identifiers leading to more information.

We develop a formal ontology for the music domain. This ontology allows us to publish and interlink a wide range of structured music-related data on the Web. We develop an ontology evaluation methodology and use it to evaluate our music ontology. We develop a knowledge representation framework for combining structured web data and analysis tools to derive more information. We apply these different technologies to publish a large amount of pre-existing music-related datasets on the Web. We develop an algorithm to automatically relate such datasets among each other. We create automated music-related Semantic Web agents, able to aggregate musical resources, structured web data and music processing tools to derive and publish new information. Finally, we describe three of our applications using this distributed information environment. These applications deal with personal collection management, enhanced access to large audio streams available on the Web and music recommendation.

So far, just a PDF is available, as I am still fighting with LaTeX2HTML, but there will be an HTML version some time soon :-) I am also planning to upload, at the same place, some extra annexes and extra results I didn't include in the main document. I think I will also blog here about some of the things included in this thesis.

In case you just want to jump to a particular chapter, I will just give some keywords to the different thesis chapters below:

  1. Introduction
  2. Knowledge Representation and Semantic Web technologies: FOL, Description Logics, RDF, Linked Data, OWL, N3.
  3. Conceptualisation of music-related information: web ontologies, music ontology, time ontology, event ontology, workflow-based modelling
  4. Evaluation of the Music Ontology framework: ontology evaluation, data-driven evaluation, task-based evaluation, latent dirichlet allocation
  5. Music processing workflows on the Web: workflows, concurrent transaction logic, N3, N3-Tr, DLP, publication of dynamically generated results, Semantic Web Services
  6. A web of music-related data: linking open data, dbtune, automated interlinking, quantification of structured web data
  7. Automated music processing agents: N3-Tr, Henry, music analysis, workflows, prolog
  8. Case studies: gnat, gnarql, personal music collection management, zempod, music recommendation
  9. Conclusion

- page 1 of 4