During the summer we (the Internet Research & Future Services team in BBC R&D) built a prototype around the BBC World Service Archive, combining automated metadata extraction and corwdsourcing through various Semantic Web technologies and datasets (namely, 4store and DBpedia). We just finished three blog posts describing various aspects of it:
Friday 19 August 2011
4Store stuff
By Yves on Friday 19 August 2011, 10:29
Update: The repository below is not maintained anymore, as official packages have been pushed into Debian. They are not yet available for Ubuntu 11.04 though. In order install 4store on Natty you'd have to install the following packages from the Oneiric repository, in order:
- libyajl1
- libgmp10
- libraptor2
- librasqal3
- lib4store0
- 4store
And you should have a running 4store (1.1.3).
Old post, for reference: I've been playing a lot with
Garlik's 4store recently, and I have been
building a few things around it. I just finished building packages for Ubuntu Jaunty, which you can get by adding the
following lines in your /etc/apt/sources.list:
deb http://moustaki.org/apt jaunty main deb-src http://moustaki.org/apt jaunty main
And then, an apt-get update && apt-get install 4store
should do the trick. The packages are available for i386 and amd64. It is also
one of my first packages, so feedback is welcomed (I may have gotten it
completely wrong). After being installed, you can create a database and start a SPARQL server.
I've also been writing two client libraries for 4store, all available on Github:
- 4store-php, a PHP library to interact with 4store over HTTP (so not exactly similar to Alexandre's PHP library, which interacts with 4store through the command-line tools);
- 4store-ruby, a Ruby library to interact with 4store over HTTP or HTTPS.
Tuesday 28 June 2011
Using RDFa for testing templates
By Yves on Tuesday 28 June 2011, 11:52
I promised a bunch of people from the BBC that I would write about this, so here it is! This is my first Software Engineering-related post, bear with me :)
We recently released a new iteration of /programmes, built on top of a completely new technology stack. As part of that move, we decided we wanted to review our template testing strategy. In our old application, the process for a new feature would basically be:
- Software Engineer writes models, controllers and data views;
- Web Developer writes XHTML templates;
- Software Engineer writes controller tests, which would actually test the routes, the controllers and the templates (such controller 'unit tests' are actually fairly standard across MVC frameworks, for some reason - e.g. in Zend)
Those controller tests were based on CSS selectors or XPaths. Therefore, any time a small front-end tweak needed to be done, the controller tests would break, which is very annoying for everyone.
We had two problems:
- Our controller 'unit tests' were not really unit tests - front-end developers shouldn't have to understand the whole routing, controllers, models for making a front end change.
- Using CSS selectors or XPaths for template tests is brittle. We don't want our tests to break every time we change the name of a CSS class.
In order to solve the first problem, we divided our controller tests in route tests (here is a route, assert that the application forwards it to the right controller/action with the right parameters), real unit controller tests (mock the models, call an action with some request parameters, check that the right data is sent to the view), and template tests.
In order to solve our second problem, we made those template tests rely on RDFa markup embedded within the page. In order to test a template, we create some mock data, generate the template using this data, and check the right RDF triples can be extracted from the resulting page. It ensures that tests are actually based on data - front-end changes won't have an impact on them. We just want to make sure we present the right data to the user. As the tests are not relying on other application code, it also means that someone writing the templates can maintain his own test suite.
A simple example of one of these tests is the following one:
public function testLetterSearch()
{
$this->setDefaultComponent('/components/atoz/letters');
$data = (object) array(
"by" => "by",
"search" => "b",
"slice" => "player",
"letters" => array('@', 'a', 'b', 'c'),
);
$this->assertTriples($data, array(
array('/programmes/a-z/by/b', 'rdfs:seeAlso', '/programmes/a-z/by/%40/player'),
array('/programmes/a-z/by/b', 'rdfs:seeAlso', '/programmes/a-z/by/a/player'),
array('/programmes/a-z/by/b', 'rdfs:seeAlso', '/programmes/a-z/by/c/player'),
));
}
which can be read as if you are displaying a list of letters on an A-Z
page and you have selected one, you shouldn't link to that one
Another very nice side-effect is that developers have a motivation to put lots of RDFa inside our pages! Compare the 168 triples extracted from a new /programmes page, including full tracklist information, programme metadata and broadcasts, to the 18 triples extracted from an old one. And as we add new components to this page, more RDFa will become available.
Also, the speed at which our developers picked up RDFa (1.0, not even 1.1, which is apparently simpler) defeats the eternal argument about RDFa being too complicated, but that's just my opinion :) The RDFa cheat sheet has proved immensely helpful.
Thursday 19 May 2011
Music and the Semantic Web workshop
By Yves on Thursday 19 May 2011, 17:30
Quite a big week for Semantic Web and music last week in London. On Thursday, there was the MusicNet workshop, with (among others) a talk from the BBC, given by Nick Humfrey. Sadly, I could not attend the workshop due to other BBC duties.
On the Friday, David De Roure and I organised a Music and the Semantic Web workshop at the AES, which David already blogged about. We had four panelists:
- David Bretherton, from Southampton University, representing the MusicNet project;
- Gregg Kellogg, representing the Connected Media Experience;
- Evan Stein, from Decibel
- Alexandre Passant, from Seevl (which I am also involved in - more details on this blog soon!)

We started with four presentations from each of the panelists. David started by presenting MusicNet, aiming at creating canonical Web identifiers for classical music composers. He demonstrated a new tool for merging identifiers for music composers - finding common properties between groups of composers, and providing an interface to review those groups.
Alexandre then presented Seevl. He explained how it worked, aggregating and consolidating structured data about music artists from a range of different places, and generating recommendations using this data, as well as explanations of these recommendations (those two artists played together, they had the same producer for their first album, etc.). I wrote quite a lot about this kind of things on this blog - it's really nice to finally see it taking shape!
Gregg presented his experience within the Connected Media Experience (CME) project. His presentation was supported by a position paper, which is extremely interesting. CME (a large consortium of key music industry players) worked for a couple of years on a RDF/Music Ontology format for online releases. Sadly, they recently abandoned this format for simple HTML5+CSS - structured data about those releases is not a priority anymore. Gregg gave us insights on what went wrong, and what were the lessons to be learned by both the Semantic Web community and the Music Industry.
Evan then presented Decibel, giving us very interesting insights about music metadata, and a demonstration of their service. It was interesting to see semantic technologies used in a completely different model. The richness of the data they hold is truly amazing (Evan demoed their internationalisation feature as well - all their data is available in a variety of languages), but sadly not available under an open license.
After that, we had a number of questions from David and I, as well as from the audience, about ease of editing and owning of music metadata (who should own it? third parties? artists? record labels? who should host the canonical URI for an artist?), about relationships of Semantic Web standards with industry standards like ISRCs, ISNIs, MPEG etc.
Overall, a very interesting workshop - I hope we can do it again next year!
Tuesday 21 December 2010
Named Graphs and quad stores
By Yves on Tuesday 21 December 2010, 10:53
I was recently looking again at the Named Graphs paper, and was wondering why the consensus for implementing Named Graphs support in triple stores was to store quads.
Named Graphs are defined as follows. We consider U as being the set of all web identifiers, L as being the set of all literals and B the set of all blank nodes. U , L and B are pairwise disjoint. Let N = U ∪L∪B, then the set T = N × U × N is the set of all RDF triples. The set of RDF graphs G is the power set of T, and a named graph is a pair ng = (u, g), where u ∈ U and g ∈ G.
As mentioned in one my
previous post, one thing that I am really keen on is the idea of having
triples that belong to multiple graphs. This situation is already happening
in the wild
a fair bit. If you look at a /programmes
aggregation feed (all available Radio 4 programmes starting with g
),
it mentions multiple programmes. All of the statements about each of these
programmes also belong to the corresponding programme graph (e.g. Gardener's Question).
Nothing in the above definition forbids that from happening - each graph and sub-graph can be named. ( (a,b,c), (d,f,g) ) can be named x and ( (a,b,c) ) can be named y.
Now, most triple stores implement Named Graphs by, in fact, storing quads. They store tuples of the form (subject, predicate, object, graph). Which means that when you try to store the two graphs above, they will in fact store (a,b,c,x), (d,f,g,x) and (a,b,c,y) --- the triples will be replicated in each graph they are mentioned in. For the above /programmes example, it is terribly inefficient. If you crawl /programmes, you will end up hitting lots of such aggregation URLs, leading in a tremendous amount of data duplication - lots of triples will be repeated across graphs. Perhaps it would be better to store something like (a,b,c,(x,y)) and (d,f,g,(x))?
I would be curious to know if anyone else is hitting this issue. I can think of multiple use-cases for triples belonging to multiple graphs: access control, versioning, attribution at different granularities...
Wednesday 20 October 2010
Linking Enterprise Data
By Yves on Wednesday 20 October 2010, 10:23
Just a quick note to mention that the Linking Enterprise Data book is now available, for free, online. Along with Tom Scott, Silver Oliver, Patrick Sinclair and Michael Smethurst, we wrote a chapter on the use of Semantic Web technologies at the BBC, which expands on the W3C Case Study we wrote at the beginning of the year. If you're interested in how Semantic Web technologies were used to build BBC Programmes, BBC Music and Wildlife Finder, make sure you read it (I also noticed it was available for pre-order on Amazon).

Tuesday 7 September 2010
Geocaching for music wins the 7digital prize at London Music Hack Day!
By Yves on Tuesday 7 September 2010, 09:34
We had a really good time last week-end at the London Music Hack Day. If you haven't seen the list of hacks that were done over the week-end yet, go take a look! I found Speakatron, BumbleTab and Earth Destroyers to be particularly funny :)
Andrew, Chris and myself (as
well as our beloved video producer Patrick)
created an Android application to create and find 'musical
treasures'. Think geocaching for
music. You wander around and can drop
tracks from your personal music
collection, and you see what tracks people have dropped nearby you. If you're
close enough, you can fetch and play the tracks. And we won the 7digital
prize!
Here is a small video demonstration, and below are some screenshots:


On the data side, we use the recently linked 7digital and Echonest APIs. The back-end was written using Rails (check out the Cucumber tests!), and the Android application using the Android SDK. A list of clues for recently dropped tracks appear on the main website and on a twitter feed. And of course, here is the APK for the Android application.
Tuesday 13 July 2010
First BBC microsite powered by a triple-store
By Yves on Tuesday 13 July 2010, 15:46
Jem Rayfield wrote a very interesting post on the technologies used by the World Cup BBC web site, which also got covered by Read Write Web.
All this is very exciting, the World Cup Website proved that triple store technologies can be used to drive a production website with significant traffic. I am expecting lots more parts of the BBC web infrastructure to evolve in the same way :-)
There are two issues we are still currently trying to solve though:
- We need to be able to cluster our triples in several dimension. For
example, we may want to have a graph for a particular programme, and a much
larger graph for a particular dataset (e.g. programme data, wildlife finder
data, world cup data). The smaller graph is used to make our updates relatively
cheap (we replace the whole graph whenever we receive an update). The bigger
graph is used to give some degree of isolations between the different sources
of data. For that, we need
graphs within graphs
. It can be done with N3-type graph literals, but is impossible to achieve in a standard quad-store setup, where one single triple can't be part of several graphs. - With regards to programme data, the main bottleneck we're facing is the number of updates per second we need to be able to process, which most of available triple stores struggle to keep up. The 4store instance on DBTune does keep up, but it has a negative impact on the querying performances, as the write operations are blocking the reads. We were quite surprised to see that the available triple store benchmarks do not take the write throughput into account!
Wednesday 19 May 2010
DBpedia and BBC Programmes
By Yves on Wednesday 19 May 2010, 12:18
We just put live a new exciting feature on BBC Programmes: programme aggregations powered by DBpedia. For example, you can look at:
- Programmes about adolescence
- Programmes about California
- Programmes about personal computers
- Programmes about collectables
- Programmes about Winnie Mandela
Of course, the RDF representations are linked up to DBpedia. Try loading adolescence in the Tabulator, for example - you will get an immediate mashup of BBC data, DBpedia data, and Freebase data. Or if you're not afraid of getting overloaded with data, try the California one.
One of the most interesting things about using web identifiers as tags for our programmes (apart from being able to automatically generate those aggregation pages, of course), is that we can use ancillary information about those tags to create new sorts of aggregations, and new visualisations of our data. We could for example plot all our Radio 3 programmes on a map, depending on the geolocation of the people associated to these programmes. Or we could create an aggregation of BBC programmes featuring artists living in the cities with the highest rainfall (why not?). And, of course, this will be a fantastic new source of data for the MusicBore! The possibilities are basically endless, and we are very excited about it!
Friday 29 January 2010
BBC Semantic Web use-case
By Yves on Friday 29 January 2010, 15:37
After a very long time writing it, we finally have a BBC Semantic Web use-case on the W3C website! It describes work we did around BBC Programmes, BBC Music, BBC Wildlife Finder and Search+. I hope it all makes a bit of sense :-) For a more detailed writeup about these issues, Patrick's Linked Data on the BBC are very good.
Thursday 14 January 2010
Live SPARQL end-point for BBC Programmes
By Yves on Thursday 14 January 2010, 12:30
Update: We seem to have an issue with the 4store hosting the dataset, so the data is stale since the end of February. Update 2: All should be back to normal and in sync. Please comment on this post if you spot any issue, or general slowliness.
Last year, we got OpenLink and Talis to crawl BBC Programmes and provide two SPARQL end-points on top of the aggregated data. However, getting the data by crawling it means that the end-points did not have all the data, and that the data can get quite outdated -- especially as our programme data changes a lot.
At the moment, our data comes from two sources: PIPs (the central programme
database at the BBC) and PIT (our content mangement system for programme
information). In order to populate the /programmes database, we monitor changes
on these two sources and replicate them on our database. We have a small piece
of Ruby/ActiveRecord
software (that we call the Tapp
) which handles this process.
I made a small experiment, converting our ActiveRecord objects to RDF and hooking an HTTP POST or an HTTP DELETE request to a 4store instance for each change we receive. This means that this 4store instance is kept in sync with upstream data sources.
It took a while to backfill, but it is now up-to-date. Check out the SPARQL end-point, a test SPARQL query form and the size of the endpoint (currently about 44 million triples).
The end-point holds all information about services, programmes, categories, versions, broadcasts, ondemands, time intervals and segments, as defined within the Programme Ontology. All of these resources are held within their own named graph, which means we have a very large number of graphs (about 5 million). It makes it far easier to update the endpoint, as we can just replace the whole graph whenever something changes for a resource.
This is still highly experimental though, and and I already found a few bugs: some episodes seem to be missing (for example, some Strictly Come Dancing episodes are missing, for some reason). I've also encountered some really weird crashes of the machine hosting the end-point when concurrently pushing a large number of RDF documents at it - I still didn't succeed to identify the cause of it. To summarise: it might die without notice :-)
Here are some example SPARQL queries:
- All programmes related to James Bond:
PREFIX po: <http://purl.org/ontology/po/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?uri ?label
WHERE {
?uri po:category
<http://www.bbc.co.uk/programmes/people/bmFtZS9ib25kLCBqYW1lcyAobm8gcXVhbGlmaWVyKQ#person> ; rdfs:label ?label
}
- FInd all Eastenders broadcast dates after 2009-01-01, along with the type of the version that was broadcast:
PREFIX event: <http://purl.org/NET/c4dm/event.owl#>
PREFIX tl: <http://purl.org/NET/c4dm/timeline.owl#>
PREFIX po: <http://purl.org/ontology/po/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
SELECT ?version_type ?broadcast_start
WHERE
{ <http://www.bbc.co.uk/programmes/b006m86d#programme> po:episode ?episode .
?episode po:version ?version .
?version a ?version_type .
?broadcast po:broadcast_of ?version .
?broadcast event:time ?time .
?time tl:start ?broadcast_start .
FILTER ((?version_type != <http://purl.org/ontology/po/Version>) && (?broadcast_start > "2009-01-01T00:00:00Z"^^xsd:dateTime))}
- Find all programmes that featured both the Foo Fighters and Al Green:
PREFIX po: <http://purl.org/ontology/po/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX mo: <http://purl.org/ontology/mo/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX event: <http://purl.org/NET/c4dm/event.owl#>
PREFIX tl: <http://purl.org/NET/c4dm/timeline.owl#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
SELECT DISTINCT ?programme ?label
WHERE {
?event1 po:track ?track1 .
?track1 foaf:maker ?maker1 . ?maker1 owl:sameAs <http://www.bbc.co.uk/music/artists/67f66c07-6e61-4026-ade5-7e782fad3a5d#artist> .
?event2 po:track ?track2 .
?track2 foaf:maker ?maker2 . ?maker2 owl:sameAs <http://www.bbc.co.uk/music/artists/fb7272ba-f130-4f0a-934d-6eeea4c18c9a#artist> .
?event1 event:time ?t1 .
?event2 event:time ?t2 .
?t1 tl:timeline ?tl .
?t2 tl:timeline ?tl .
?version po:time ?t .
?t tl:timeline ?tl .
?programme po:version ?version .
?programme rdfs:label ?label .
}
Tuesday 27 October 2009
Music recommendation and Linked Data
By Yves on Tuesday 27 October 2009, 02:43
We just presented yesterday at ISMIR a tutorial about Linked Data for music-related information. More information on the tutorial is available on the tutorial website, and the slides are also available.
In particular, we had two sets of slides dealing with the relationship between music recommendation and linked data. As this is something we're investigating within the NoTube project, I thought I would write up a bit more about it.
Let's focus on artist to artist recommendation for now. If we look at last.fm for recommendations for New Order, here is what we get.
Similarly, using the Echonest API for similar artists, we get back an ordered list of artists similar to New Order, including Orchestral Manoeuvres in the Dark, Depeche Mode, etc.
Now, let's play word associations for a few bands and musical genres. My colleague Michael Smethurst took the Sex Pistols, Acid House and Public Enemy, and draw the following associations:
We can see that among the different terms in these diagrams, some refer to people, to TV programmes, to fashion styles, to drugs, to music hardware, to places, to laws, to political groups, to record labels, etc. Just a couple of these terms are actually other bands or tracks. If you were to describe these artists just in musical terms, you'd probably be missing the point. And all these things are also linked to each other: you could play word associations for any of them and see what are the connections between Public Enemy and the Sex Pistols. So how does that relate to recommendations? When recommending an artist from another artist, the context is key. You need to provide an explanation of why they actually relate to each other, whether it's through common members, drugs, belonging to the same independent record label, acoustically similar (if so, how exactly), etc. The main hypothesis here being that users are much more likely to be accepting a recommendation that is explicitly backed by some contextual information.
On the BBC website, we cover quite a few domains, and we try to create as much links as possible between these domains, by following the Linked Data principles. From our BBC Music site, we can explore much more information, from other BBC content (programmes, news etc.) to other Linked Data sources, e.g. DBpedia, Freebase and Musicbrainz. This provides us with a wealth of structured information that we would ultimately want to use for driving and backing up our recommendations.
The MusicBore I've described earlier on this blog kind of uses the same approach. Playlists are generated by following paths in Linked Data. Introduction of each artists is done by generating a sentence from the path leading from the seed artist to the target artist. The prototype described in that paper from the SDOW workshop last year also illustrates that approach.
So we developed a small prototype of these kind of ideas, rqommend (and when I say small, it is very small :) ). Basically, we define "relatedness rules" in the form of SPARQL queries, like "Two artists born in Detroit in the 60s are related". We could go for very general rules, e.g. "Any paths between two artists make them related", but it would be very hard to generate an accurate textual explanation for it, and might give some, hem, not very interesting connections. Then, we just go through these rules on an aggregation of Linked Data, and generate recommendations from them. Here is a greasemonkey script injecting such recommendations with BBC Music (see for example the Fugazi page). It injects Linked Data based recommendations, along with the associated explanation, within BBC artist pages. For example, for New Order:
To conclude, I think there is a really strong influence of traditional
information retrieval systems on the music information retrieval community. But
what makes Google, for example, particularly successful is to exploit links,
not the documents themselves. We definitely need to go towards the same sort of
model. Exploiting links surrounding music, and all the cross-domain information
that makes it so rich, to create better music recommendation systems which
combine the what
is recommended with the why
it is
recommended.
Thursday 10 September 2009
Linked Data London event screencasts and London Web Standards meetup
By Yves on Thursday 10 September 2009, 12:03
With Tom Scott, we presented a talk on contextualising BBC programmes using linked data for the Linked Data London event. For the occasion, I made a couple of screencasts.
The first one shows some browsing of the linked data we expose on the BBC website, using the Tabulator Firefox extension. I start by getting to a Radio 2 programme, to get to its segmentation in musical tracks, to get to another programme featuring one of the tracks, to get to another artist featured in that programme. The Tabulator ends up displaying data aggregated from BBC Programmes, BBC Music and DBpedia.
Exploring BBC programmes and music data using the Tabulator
The second one shows what you can do by using these programmes/artists and artists/programmes links. We built some very straight-forward programme to programme recommendation using them. On the right-hand side of the programme page, there are recommendations, based on artists played in common. The recommendations are scoped by the availability of the programme on iPlayer or by the fact it has an upcoming broadcast. If you hover over those recommendations, it will display what allowed us to derive it: here, a list of common artists played in the two programmes. This work is part of our investigations within the NoTube European project.
Artist-based programme to programme recommendations
Also, as
Michael already posted on Radio Labs, we gave a presentation to the
London Web Standards group on
Linked Data. It was a very nice event,
especially as mainly web developers turned up. Linked data events tend to be
mostly about linked data evangelists talking to other linked data evangelists
(which is great too!), so this was quite different :-) Lots of interesting
questions about provenance and trustworthiness of data were asked, which are
always a bit difficult to answer, apart from the usual it's just the Web,
you can deal with it as you do (or don't) currently with Web data, e.g. by
keeping track of provenance information and filtering based on that
.
Somebody raised that you could make some statistics on how many times a
particular statement is repeated in order to derive its trustworthiness, but
this sounds a bit harmful... Currently on the Linked Data cloud, lots of
information gets repeated. For example, if a statement about an artist is
available on DBpedia, there is a fair chance it will get repeated in BBC Music,
just because we also use Wikipedia as an information source. The fact that this
statements gets repeated doesn't make it more valid.
Monday 13 July 2009
Music Hack Day and the MusicBore
By Yves on Monday 13 July 2009, 10:58
This week end was the Music Hackday in London. The event was great, with everything a hack day need: pizza, beer, and technical glitches during the demos :-) And, of course, lots of awesome hacks (including that amazing visualisation which didn't make it into the list for some reason).
With Christopher, Patrick and Nick, we created the MusicBore. The MusicBore is a completely automated radio DJ, which, well, tends to be really boring :-) We actually won two prizes with it! The Last.fm one and the best hack one! We were really, really happy :-) But let the musicbore introduce itself:
Hello. I am the Music Bore. I play music and I like to tell you ALL about
the music I play. I live on IRC. I get my information from BBC Music, BBC
Programmes, last fm, the Echo Nest, Yahoo Weather and the web of Linked Data.
To find out more, please visit bit.ly/musicbore. You can dissect my disgusting
innards on github.
Now let me play you some music.
Here it is in action, the first one walks through Soul-ish tunes, whereas the second one goes into French punk-rock:
The Music Bore - Video 2 from Nicholas Humfrey on Vimeo.
The Music Bore - Video 1 from Nicholas Humfrey on Vimeo.
The MusicBore is powered by a new and exciting messaging technology: IRC :-) Lots of bots sit around in an IRC channel and talk to each other to create a radio show. The show is entirely created live, each bot contributing a specific ability to it. Just before the hack presentation, we had 10 bots in the same channel (all the bots sources are on github):- controller: In charge of starting the show by playing an introduction and choose a new song drawn from the BBC Music charts. Also in charge of re-drawing a new song in case the other bots gets stuck.
- thebore: Renders information about a particular artist from BBC Music, BBC Programmes, Wikipedia, Last.fm and other Linked Data.
- connectionfinder: Given a seed artist, gives the next one in the playlist, along with an explanation of how it was chosen. Basically walks through Linked Data to discover new artists.
- placefinder: Given a seed artist, gives the next one in
the playlist, along with an explanation of how it was chosen. This bot is
constrained to go through places, so will give connections like
Did you know that David Guetta is born in Paris, and that Georges Garvarentz died in the same place?
- musicfinder: Finds music content for an artist, using BBC Programmes segment data and the Surge Radio RDF.
- trackfinder: Finds music content for an artist, using the Echonest API
- irc2play: Say sentences and play tracks mentioned on IRC, mixing them using Jack, Madjack and JackMiniMix
- weatherbot: Renders weather information from Yahoo
- imagebot: Finds an image from BBC Music for a particular artist.
- hotnessfinder: Find the hotness for an artist using the Echonest API, and constructs a sentence from it.
Here is a small diagram of how the different bots interact with each other (if you don't understand that diagram, that's fine, we don't either, but it does seem to work :-) ).

As I thought they were quite fun, here are the logs of the bbcmusicbore channel over the week-end. Here is a small best-of:
14:05 < yvesr> trackfinder: weezerThis happens a lot at the beginning of the log, bots just throwing exceptions and quitting irc.
14:05 -!- trackfinder [n=trackfin@77.91.248.124] has quit
[Remote closed the connection]
11:25 < thebore> connectionfinder:http://dbpedia.org/resource/David_GuettaGod. Nicolas Sarkozy is really in every media, even the randomly generated ones! thebore got enlightened on that one.
11:25 < connectionfinder> say:Did you know that David Guetta was born in
France which has leader name Nicolas Sarkozy, which has spouse Carla
Bruni?
11:25 < thebore> say:Carla Bruni has released 3 records. No Promises was
just terrible.
10:41 < yvesr> connectionfinder:http://dbpedia.org/resource/Barry_WhiteA morbid playlist generation :-)
10:42 < connectionfinder> say:Did you know that Barry White died in the
same city as Chico Hamilton, Los Angeles?
11:07 < thebore> say:Colin Murray on BBC Radio One is a massive fan. TheyBootlegs FTW!
have a myspace at myspace.com/metallica. They sound a bit like Megadeth.
They have released 524 records. I really liked 2008-11-01: Rose Garden,
Portland, OR, USA.
11:43 < yvesr> hotnessfinder: michael jacksonNo comments
11:43 < hotnessfinder> say: Wow, michael jackson is really, really hot
right now! Did he die recently?
12:50 < thebore> connectionfinder:http://dbpedia.org/resource/Snoop_DoggNo, I didn't!
12:50 < connectionfinder> say:Did you know that Dogg, Snoop is signed on
Death Row Records which was founded by Dr. Dre, which used to be quite
close to World Class Wreckin' Cru?
13:56 < thebore> say:Beverley Knight sounds a bit like Eternal.No, I didn't know either!
In September 2005 Knight joined the likes of Bob Dylan and Joni
Mitchell by being made a Doctor of Music after receiving the honorary
degree from the University of Wolverhampton.
Right now, all the bots are offline (they were running on our laptops during the week end), but we plan to use the Amazon Web Services vouchers we won to make them run for a while :-)
Monday 15 June 2009
And another fun BBC SPARQL query
By Yves on Monday 15 June 2009, 13:53
This query returns BBC programmes featuring artists originating from France (this is just a straight adaptation of the last query in my previous post).
The results are quite fun! Apparently, the big French hits on the BBC are from Jean-Michel Jarre, Air, Modjo, Phoenix (are they known in France? I've only heard of them in the UK) and Vanessa Paradis.
Note that the tracklisting data we expose in our RDF just goes back a couple of months, so that might explain why the list is not bigger.
Thursday 11 June 2009
BBC SPARQL end-points
By Yves on Thursday 11 June 2009, 00:15
We recently announced on the BBC backstage blog the availability of two SPARQL end-points, one hosted by Talis and one by OpenLink. These two companies aggregated the RDF data we publish at http://www.bbc.co.uk/programmes and http://www.bbc.co.uk/music. This opens up quite a lot of fascinating SPARQL queries. Talis already compiled a small list, and here are a couple I just designed:
- Give me programmes that deal with the fictional character
James Bond
- results
PREFIX po: <http://purl.org/ontology/po/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?uri ?label
WHERE {
?uri po:person
<http://www.bbc.co.uk/programmes/people/bmFtZS9ib25kLCBqYW1lcyAobm8gcXVhbGlmaWVyKQ#person> ; rdfs:label ?label
}
- GIve me artists that were featured in the same programme as the Foo Fighters - results
PREFIX po: <http://purl.org/ontology/po/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX mo: <http://purl.org/ontology/mo/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX event: <http://purl.org/NET/c4dm/event.owl#>
PREFIX tl: <http://purl.org/NET/c4dm/timeline.owl#>
SELECT DISTINCT ?artist2 ?label2
WHERE {
?event1 po:track ?track1 .
?track1 foaf:maker <http://www.bbc.co.uk/music/artists/67f66c07-6e61-4026-ade5-7e782fad3a5d#artist> .
?event2 po:track ?track2 .
?track2 foaf:maker ?artist2 .
?artist2 rdfs:label ?label2 .
?event1 po:time ?t1 .
?event2 po:time ?t2 .
?t1 tl:timeline ?tl .
?t2 tl:timeline ?tl .
FILTER (?t1 != ?t2)
}
- Give me programmes that featured both Al Green and the Foo Fighters (yes! there is one result!!) - results
PREFIX po: <http://purl.org/ontology/po/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX mo: <http://purl.org/ontology/mo/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX event: <http://purl.org/NET/c4dm/event.owl#>
PREFIX tl: <http://purl.org/NET/c4dm/timeline.owl#>
SELECT DISTINCT ?programme ?label
WHERE {
?event1 po:track ?track1 .
?track1 foaf:maker <http://www.bbc.co.uk/music/artists/67f66c07-6e61-4026-ade5-7e782fad3a5d#artist> .
?event2 po:track ?track2 .
?track2 foaf:maker <http://www.bbc.co.uk/music/artists/fb7272ba-f130-4f0a-934d-6eeea4c18c9a#artist> .
?event1 po:time ?t1 .
?event2 po:time ?t2 .
?t1 tl:timeline ?tl .
?t2 tl:timeline ?tl .
?version po:time ?t .
?t tl:timeline ?tl .
?programme po:version ?version .
?programme rdfs:label ?label .
}
- All programmes that featured an artist originating from Northern Ireland - results
PREFIX po: <http://purl.org/ontology/po/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX mo: <http://purl.org/ontology/mo/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX event: <http://purl.org/NET/c4dm/event.owl#>
PREFIX tl: <http://purl.org/NET/c4dm/timeline.owl#>
PREFIX dbprop: <http://dbpedia.org/property/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
SELECT DISTINCT ?programme ?label ?artistlabel ?dbpmaker
WHERE {
?event1 po:track ?track1 .
?track1 foaf:maker ?maker .
?maker rdfs:label ?artistlabel .
?maker owl:sameAs ?dbpmaker .
?dbpmaker dbprop:origin <http://dbpedia.org/resource/Northern_Ireland> .
?event1 po:time ?t1 .
?t1 tl:timeline ?tl .
?version po:time ?t .
?t tl:timeline ?tl .
?programme po:version ?version .
?programme rdfs:label ?label .
}
(Note that we just need the owl:sameAs in the above query as
the Talis end-point doesn't support inference)
Let us know what kind of query you can come up with this data! :-)
Tuesday 12 May 2009
Yahoo Hackday 2009
By Yves on Tuesday 12 May 2009, 16:56

We went to the Yahoo Hackday this week end, with a couple of people from the C4DM and the BBC. Apart from a flaky wireless connection on the Saturday, it was a really great event, with lots of interesting talks and interesting hacks.
On the Saturday, we learned about Searchmonkey. I tried to create
a small searchmonkey application during the talk, but eventually got
frustrated. Apparently, Searchmonkey indexes RDFa and eRDF , but
doesn't follow <link rel="alternate"/> links towards RDF
representations (neither does it try to do content negotiation). So
in order to create a searchmonkey application for BBC Programmes, I needed to either
include RDFa in all the pages (which, hem, was difficult to do in an hour :-) )
or write an XSLT against our RDF/XML representations, which would just be
Wrong, as there are lots of different
ways to serialise the same RDF in an RDF/XML document.
We also learned about the Guardian Open Platform and Data Store, which holds a huge amount of interesting information. The license terms are also really permissive, even allowing commercial uses of this data. I can't even imagine how useful this data would be if it were linked to other open datasets, e.g. DBpedia, Geonames or Eurostat.
I got also a bit confused by YQL, which seems to be really similar to SPARQL, at least in the underlying concept ("a query language for the web"). However, it seems to be backed by lots of interesting data: almost all of Yahoo services, and a few third-party wrappers, e.g. for Last.fm. I wonder how hard it would be to write a SPARQL end-point that would wrap YQL queries?
Finally, on Saturday evening and Sunday morning, we got some time to actually hack :-) Kurt made a nice MySpace hack, which does an artist lookup on MySpace using BOSS and exposes relevant information extracted using the DBTune RDF wrapper, without having to look at an overloaded MySpace page. It uses the Yahoo Media Player to play the audio files this page links to.
At the same time, we got around to try out some of the things that can be built using the linked data we publish at the BBC, especially the segment RDF I announced on the linked data mailing list a couple of weeks ago. We built a small application which, from a place, gives you BBC programmes that feature an artist that is related in some way to that place. For example, Cardiff, Bristol, London or Lancashire. It might be bit slow (and the number of results are limited) as I didn't have time to implement any sort of caching. The application is crawling from DBpedia to BBC Music to BBC Programmes at each request. I just put the (really hacky) code online.
And we actually won the Backstage price with these hacks! :-)
This last hack illustrates to some extent the things we are investigating as part of the BBC use-cases of the NoTube project. Using these rich connections between things (programmes, artists, events, locations, etc.), it begins to be possible to provide data-rich recommendations backed by real stories (and not only "if you like this, you may like that"). I mentioned these issues in the last chapter of my thesis, and will try to follow up on that here!
Friday 17 April 2009
Brands, series, categories and tracklists on the new BBC Programmes
By Yves on Friday 17 April 2009, 17:04
I just posted a small article on the BBC Radio Labs blog about the new features of the BBC Programmes website. Hopefully that makes some sense and highlights some of the things we've been working on over the last six months! Spoiler: lots of nice nice RDF :-)
Tuesday 24 March 2009
A sneak peek at the BBC Music RDF
By Yves on Tuesday 24 March 2009, 10:15
The new BBC Music website was launched yesterday, with a lot of Linked Data and RDF goodness. BBC Music provides a truly REST API. Congratulations to the whole team, they did an amazing work! In short, that means that you can easily build applications on top of BBC music data quite easily.
For example, each artist in BBC Music has an RDF representation. For example, Nirvana has an RDF representation, which exposes the aggregated BBC data about this band. The site also supports content negotiation, so doing
$ curl -L -H "Accept: application/rdf+xml" http://www.bbc.co.uk/music/artists/5b11f4ce-a62d-471e-81fc-a69a8278c7da
will lead you to the RDF representation.
Note that this representation includes links to further URIs, allowing you
to discover more data, e.g. about members of that band. It also includes a
owl:sameAs link to the corresponding DBpedia resource, allowing you to aggregate more data
about that band, extracted from Wikipedia's infoboxes.
As an example of a "linked data journey", you can get from Nirvana to Krist Novoselic to the corresponding Krist Novoselic in DBpedia to Compton, California to N.W.A. Lots of really rich data to do interesting thing, like, say, a music recommender :-)
BBC Music also includes RDF representation of reviews, e.g. that one. It also includes an RDF representation of the A to Z, and a search interface returning RDF links to matched artists. For example, here are the results of a search for "Bad Religion", which include a link to an RDF document about it on BBC Music.
Congrats again to Patrick and Nicholas, who did this work on the RDF side of BBC Music!
Tuesday 10 February 2009
Thesis uploaded!
By Yves on Tuesday 10 February 2009, 11:30
I just uploaded my PhD thesis
entitled A Distributed Music Information System
, which I defended on the
22nd of January. My examiners were David de Roure from University of
Southampton and Nicolas
Gold from King's College. My PhD supervisor was Mark Sandler.
Here is the abstract:
Information management is an important part of music technologies today, covering the man- agement of public and personal collections, the construction of large editorial databases and the storage of music analysis results. The information management solutions that have emerged for these use-cases are still isolated from each other. The information one of these solutions manages does not benefit from the information another holds.
In this thesis, we develop a distributed music information system that aims at gathering music- related information held by multiple databases or applications. To this end, we use Semantic Web technologies to create a unified information environment. Web identifiers correspond to any items in the music domain: performance, artist, musical work, etc. These web identifiers have structured representations permitting sophisticated reuse by applications, and these representations can quote other web identifiers leading to more information.
We develop a formal ontology for the music domain. This ontology allows us to publish and interlink a wide range of structured music-related data on the Web. We develop an ontology evaluation methodology and use it to evaluate our music ontology. We develop a knowledge representation framework for combining structured web data and analysis tools to derive more information. We apply these different technologies to publish a large amount of pre-existing music-related datasets on the Web. We develop an algorithm to automatically relate such datasets among each other. We create automated music-related Semantic Web agents, able to aggregate musical resources, structured web data and music processing tools to derive and publish new information. Finally, we describe three of our applications using this distributed information environment. These applications deal with personal collection management, enhanced access to large audio streams available on the Web and music recommendation.
So far, just a PDF is available, as I am still fighting with LaTeX2HTML, but there will be an HTML version some time soon :-) I am also planning to upload, at the same place, some extra annexes and extra results I didn't include in the main document. I think I will also blog here about some of the things included in this thesis.
In case you just want to jump to a particular chapter, I will just give some keywords to the different thesis chapters below:
- Introduction
- Knowledge Representation and Semantic Web technologies: FOL, Description Logics, RDF, Linked Data, OWL, N3.
- Conceptualisation of music-related information: web ontologies, music ontology, time ontology, event ontology, workflow-based modelling
- Evaluation of the Music Ontology framework: ontology evaluation, data-driven evaluation, task-based evaluation, latent dirichlet allocation
- Music processing workflows on the Web: workflows, concurrent transaction logic, N3, N3-Tr, DLP, publication of dynamically generated results, Semantic Web Services
- A web of music-related data: linking open data, dbtune, automated interlinking, quantification of structured web data
- Automated music processing agents: N3-Tr, Henry, music analysis, workflows, prolog
- Case studies: gnat, gnarql, personal music collection management, zempod, music recommendation
- Conclusion
« previous entries - page 1 of 4





