DBTune blog

To content | To menu | To search

Tag - linking-open-data

Entries feed

Wednesday 19 May 2010

DBpedia and BBC Programmes

We just put live a new exciting feature on BBC Programmes: programme aggregations powered by DBpedia. For example, you can look at:

Of course, the RDF representations are linked up to DBpedia. Try loading adolescence in the Tabulator, for example - you will get an immediate mashup of BBC data, DBpedia data, and Freebase data. Or if you're not afraid of getting overloaded with data, try the California one.

One of the most interesting things about using web identifiers as tags for our programmes (apart from being able to automatically generate those aggregation pages, of course), is that we can use ancillary information about those tags to create new sorts of aggregations, and new visualisations of our data. We could for example plot all our Radio 3 programmes on a map, depending on the geolocation of the people associated to these programmes. Or we could create an aggregation of BBC programmes featuring artists living in the cities with the highest rainfall (why not?). And, of course, this will be a fantastic new source of data for the MusicBore! The possibilities are basically endless, and we are very excited about it!

Thursday 14 January 2010

Live SPARQL end-point for BBC Programmes

Update: We seem to have an issue with the 4store hosting the dataset, so the data is stale since the end of February. Update 2: All should be back to normal and in sync. Please comment on this post if you spot any issue, or general slowliness.

Last year, we got OpenLink and Talis to crawl BBC Programmes and provide two SPARQL end-points on top of the aggregated data. However, getting the data by crawling it means that the end-points did not have all the data, and that the data can get quite outdated -- especially as our programme data changes a lot.

At the moment, our data comes from two sources: PIPs (the central programme database at the BBC) and PIT (our content mangement system for programme information). In order to populate the /programmes database, we monitor changes on these two sources and replicate them on our database. We have a small piece of Ruby/ActiveRecord software (that we call the Tapp) which handles this process.

I made a small experiment, converting our ActiveRecord objects to RDF and hooking an HTTP POST or an HTTP DELETE request to a 4store instance for each change we receive. This means that this 4store instance is kept in sync with upstream data sources.

It took a while to backfill, but it is now up-to-date. Check out the SPARQL end-point, a test SPARQL query form and the size of the endpoint (currently about 44 million triples).

The end-point holds all information about services, programmes, categories, versions, broadcasts, ondemands, time intervals and segments, as defined within the Programme Ontology. All of these resources are held within their own named graph, which means we have a very large number of graphs (about 5 million). It makes it far easier to update the endpoint, as we can just replace the whole graph whenever something changes for a resource.

This is still highly experimental though, and and I already found a few bugs: some episodes seem to be missing (for example, some Strictly Come Dancing episodes are missing, for some reason). I've also encountered some really weird crashes of the machine hosting the end-point when concurrently pushing a large number of RDF documents at it - I still didn't succeed to identify the cause of it. To summarise: it might die without notice :-)

Here are some example SPARQL queries:

  • All programmes related to James Bond:
PREFIX po: <http://purl.org/ontology/po/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?uri ?label
WHERE {
  ?uri po:category 
    <http://www.bbc.co.uk/programmes/people/bmFtZS9ib25kLCBqYW1lcyAobm8gcXVhbGlmaWVyKQ#person> ; rdfs:label ?label
}
  • FInd all Eastenders broadcast dates after 2009-01-01, along with the type of the version that was broadcast:
PREFIX event: <http://purl.org/NET/c4dm/event.owl#> 
PREFIX tl: <http://purl.org/NET/c4dm/timeline.owl#> 
PREFIX po: <http://purl.org/ontology/po/> 
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
SELECT ?version_type ?broadcast_start
WHERE
{ <http://www.bbc.co.uk/programmes/b006m86d#programme> po:episode ?episode .
  ?episode po:version ?version .
  ?version a ?version_type .
  ?broadcast po:broadcast_of ?version .
  ?broadcast event:time ?time .
  ?time tl:start ?broadcast_start .
  FILTER ((?version_type != <http://purl.org/ontology/po/Version>) && (?broadcast_start > "2009-01-01T00:00:00Z"^^xsd:dateTime))}
PREFIX po: <http://purl.org/ontology/po/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX mo: <http://purl.org/ontology/mo/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX event: <http://purl.org/NET/c4dm/event.owl#>
PREFIX tl: <http://purl.org/NET/c4dm/timeline.owl#>
PREFIX owl: <http://www.w3.org/2002/07/owl#> 
SELECT DISTINCT ?programme ?label
WHERE {
  ?event1 po:track ?track1 .
  ?track1 foaf:maker ?maker1 . ?maker1 owl:sameAs <http://www.bbc.co.uk/music/artists/67f66c07-6e61-4026-ade5-7e782fad3a5d#artist> .
  ?event2 po:track ?track2 .
  ?track2 foaf:maker ?maker2 . ?maker2 owl:sameAs <http://www.bbc.co.uk/music/artists/fb7272ba-f130-4f0a-934d-6eeea4c18c9a#artist> .
  ?event1 event:time ?t1 .
  ?event2 event:time ?t2 .
  ?t1 tl:timeline ?tl .
  ?t2 tl:timeline ?tl .
  ?version po:time ?t .
  ?t tl:timeline ?tl .
  ?programme po:version ?version .
  ?programme rdfs:label ?label .
}

Thursday 10 September 2009

Linked Data London event screencasts and London Web Standards meetup

With Tom Scott, we presented a talk on contextualising BBC programmes using linked data for the Linked Data London event. For the occasion, I made a couple of screencasts.

The first one shows some browsing of the linked data we expose on the BBC website, using the Tabulator Firefox extension. I start by getting to a Radio 2 programme, to get to its segmentation in musical tracks, to get to another programme featuring one of the tracks, to get to another artist featured in that programme. The Tabulator ends up displaying data aggregated from BBC Programmes, BBC Music and DBpedia.

Exploring BBC programmes and music data using the Tabulator

The second one shows what you can do by using these programmes/artists and artists/programmes links. We built some very straight-forward programme to programme recommendation using them. On the right-hand side of the programme page, there are recommendations, based on artists played in common. The recommendations are scoped by the availability of the programme on iPlayer or by the fact it has an upcoming broadcast. If you hover over those recommendations, it will display what allowed us to derive it: here, a list of common artists played in the two programmes. This work is part of our investigations within the NoTube European project.

Artist-based programme to programme recommendations

Also, as Michael already posted on Radio Labs, we gave a presentation to the London Web Standards group on Linked Data. It was a very nice event, especially as mainly web developers turned up. Linked data events tend to be mostly about linked data evangelists talking to other linked data evangelists (which is great too!), so this was quite different :-) Lots of interesting questions about provenance and trustworthiness of data were asked, which are always a bit difficult to answer, apart from the usual it's just the Web, you can deal with it as you do (or don't) currently with Web data, e.g. by keeping track of provenance information and filtering based on that. Somebody raised that you could make some statistics on how many times a particular statement is repeated in order to derive its trustworthiness, but this sounds a bit harmful... Currently on the Linked Data cloud, lots of information gets repeated. For example, if a statement about an artist is available on DBpedia, there is a fair chance it will get repeated in BBC Music, just because we also use Wikipedia as an information source. The fact that this statements gets repeated doesn't make it more valid.

Skim-read introduction to linked data slides

Friday 14 November 2008

Reuters OpenCalais joins the linked data cloud

Still more fancy linked data to play with - just a couple of weeks after Freebase announced that they publish linked data, OpenCalais just announced that they are going to publish linked data as well, by joining up the results of their entity extraction service to DBpedia URIs.

Wednesday 29 October 2008

Freebase does linked data!

Just a small post, live from ISWC: Freebase does linked data!

You can try it there, and you can try this instance, for example.

Freebase linked data

Added to the wonderful David Huynh's Parallax, that's a lot of great news coming from the other side of the Atlantic :-)

Now, to see whether their linked data actually use the Web! Do they link to other web identifiers, available outside Freebase?

I just noticed something weird, also: the read/write permissions are attached to the tracks/films/whatever resources, instead of being attached to the RDF document itself.

Thursday 31 July 2008

Semantic search on aggregated music data

I just moved the semantic search demo to a faster server, so it should hopefully be a lot more reliable. This demo uses the amazing ClioPatria on top of an aggregation of music-related data. This aggregation was simply constructed by taking a bunch of Creative Commons MP3s, running GNAT on them, and crawling linked data starting from the web identifiers outputted by GNAT.

I also set up the search tab to work correctly. For example, when you search for "punk", you get the following results.

Punk search 1

Punk search 2

Note that the results are explained: "punk" might be related to the title, the biography, a tag, the lyrics, content-based similarity to something tagged as punk (although it looks like Henry crashed in the middle of the aggregation, so not a lot of such data is available yet), etc. Moreover, you get back different types of resources: artists, records, tracks, lyrics, performances etc.

For example, if you click on one of the records, you get the following.

Punk search 3

This record is available under a Creative Commons license, so you can get a direct access to the corresponding XSPF playlist, Bittorrent items etc., by following the Music Ontology "available as" property. For example, you can click on an XSPF playlist, and listen to the selected record.

Punk search 4

Of course, you can still do the previous things - plotting music artists (or search results, just take a look at the "view" drop-down box) on a map, on a time-line, browse using facets, etc.

Btw, if you like DBTune, please vote for it in the Triplify Challenge! :-)

Monday 28 July 2008

Music Ontology linked data on BBC.co.uk/music

Just a couple of minutes ago on the Music Ontology mailing list, Nicholas Humfrey from the BBC announced the availability of linked data on BBC Music.

$ rapper -o turtle \
   http://www.bbc.co.uk/music/artists/cc197bad-dc9c-440d-a5b5-d52ba2e14234

[...]
<http://www.bbc.co.uk/music/artists/cc197bad-dc9c-440d-a5b5-d52ba2e14234#artist>
   a mo:MusicGroup;
   foaf:name "Coldplay";
   owl:sameAs <http://dbpedia.org/resource/Coldplay>;
   mo:member
<http://www.bbc.co.uk/music/artists/18690715-59fa-4e4d-bcf3-8025cf1c23e0#artist>,
<http://www.bbc.co.uk/music/artists/d156ceb2-fd90-4e82-baea-829bbdf1c127#artist>,
<http://www.bbc.co.uk/music/artists/6953c4db-7214-4724-a140-e87550bde420#artist>,
<http://www.bbc.co.uk/music/artists/98d1ec5a-dd97-4c0b-9c83-7928aac89bca#artist>
[...]

This is just really, really, really great... Congratulations to the /music team!

Update: Tom Scott just wrote a really nice post about the new BBC music site, explaining what the BBC is trying to achieve by going down the linked data path.

Sunday 27 July 2008

Musicbrainz RDF updated

Well, I guess everything is in the title :-) The dump used is now of the 26th of July. I also moved everything to a much faster server. Also, the D2R mapping is still not 100% complete - I am really slowly getting through it, as PhD writing takes almost all my time these days. I added recently owl:sameAs links to the DBTune Myspace service, so you can easily get from Musicbrainz artists to the corresponding MP3s available on MySpace and their social networks. See for example Madonna, linked through owl:sameAs to the corresponding DBpedia artist and to the corresponding Myspace artist.

Wednesday 9 July 2008

Nominated!

We learned yesterday that DBTune was nominated for the Triplify Challenge! The other seven projects are really interesting as well, so I guess the competition will be really high! The final results will be given at the I-Semantics conference in early September.

Also, Tim Berners-Lee made a great talk about linked data and the semantic web on Radio 4 earlier today. The first use-case he mentions sounds quite familiar: finding bands based on geo-location data. He already mentioned that in one of his blog posts, linking to this screencast.

An interesting discussion took place on the Linking Open Data mailing list just afterwards, to gather use-cases for explaining to a general public what linked data can be useful for.

Wednesday 25 June 2008

Linking Open Data: BBC playcount data as linked data

For the Mashed event this week end, the BBC released some really interesting data. This includes playcount data, stating how much an artist is featured within a particular BBC programmes (at the brand or episode level).

During the event, I wrote some RDF translators for this data, linking web identifiers in the DBTune Musicbrainz linked data to web identifiers in the BBC Programmes linked data. We used it with Kurt and Ben in our hack. Ben made a nice write-up about it. By finding web identifiers for tracks in a collection and following links to the BBC Programmes data, and finally connecting this Programmes data to the box holding all recorded BBC radio programmes over a year that was available at the event, we can quite easily generate playlists from an audio collection. Two python scripts implementing this mechanism are available there. The first one uses solely brands data, whereas the second one uses episodes data (and therefore helps to get fewer and more accurate items in the resulting playlist). Finally, the thing we spent the most time on was the SQLite storage for our RDF cache :-)

This morning, I published the playcount data as linked data. I wrote a new DBTune service for that. It publishes a set of web identifiers for playcount data, interlinking Musicbrainz and BBC Programmes. I also put online a SPARQL end-point holding all this playcount data along with aggregated data from Musicbrainz and the BBC Programmes linked data (around 2 million triples overall).

For example, you can try the following SPARQL query:

SELECT ?brand ?title ?count
WHERE {
   ?artist a mo:MusicArtist;
      foaf:name "The Beatles". 
   ?pc pc:object ?artist;
       pc:count ?count.
   ?brand a po:Brand;
       pc:playcount ?pc;
       dc:title ?title 
    FILTER (?count>10)}

This will return every BBC brand that has featured The Beatles more than 10 times.

Thanks to Nicholas and Patrick for their help!

Thursday 12 June 2008

Describing the content of RDF datasets

There seems to be an overall consensus in the Linking Open Data community that we need a way to describe in RDF the different datasets published and interlinked within the project. There is already a Wiki page detailing some aspects of the corresponding vocabulary, called voiD (vocabulary of interlinked datasets).

One thing I would really like this vocabulary to do would be to describe exactly the inner content of a dataset - what could we find in this SPARQL end-point or in this RDF document? I thought quite a lot about this recently, as I begin to really need that. Indeed, when you have RDF documents describing lots of audio annotations, and which generation is really computation intensive, you want to pick just the one that fits your request. There have been quite a lot of similar efforts in the past. However, most of them rely on one or another sort of reification, which makes it quite hard to actually use.

After some failed tries, I came up with the following, which I hope is easy and expressive enough :-)

It relies on a single property void:example, which links a resource identifying a particular dataset to a small RDF document holding an example of what you could find in that dataset. Then, with just a bit of SPARQL magic, you can easily query for datasets having a particular capability. Easy, isn't it? :-)

Here is a real-world example of that. A first RDF document describes one of the DBtune dataset:

:ds1
        a void:Dataset;
        rdfs:label "Jamendo end-point on DBtune";
        dc:source <http://jamendo.com/>;
        foaf:maker <http://moustaki.org/foaf.rdf#moustaki>;
        void:sparql_end_point <http://dbtune.org:2105/sparql/>;
        void:example <http://moustaki.org/void/jamendo_example.n3>;
        .

The void:example property points towards a small RDF file, giving an example of what you can find within this dataset.

Then, the following SPARQL query asks whether this dataset has a SPARQL end-point and holds information about music records, associated tags, and places to download them.

PREFIX void: <http://purl.org/ontology/void#>
PREFIX mo: <http://purl.org/ontology/mo/>
PREFIX tags: <http://www.holygoat.co.uk/owl/redwood/0.1/tags/>

ASK
FROM NAMED <http://moustaki.org/void/void.n3>
FROM NAMED <http://moustaki.org/void/jamendo_example.n3>
{
        GRAPH <http://moustaki.org/void/void.n3> {
                ?ds a void:Dataset;
                        void:sparql_end_point ?sparql;
                        void:example ?ex.
        }
        GRAPH ?ex {
                ?r a mo:Record;
                        mo:available_as ?l;
                        tags:taggedWithTag ?t.
        }
}

I tried this query with ARQ, and it works perfectly :-)

$ sparql --query=void.sparql
Ask => Yes

Update: It also works with ARC2. Although it does not load automatically the SPARQL FROM clause. You can try the same query on this SPARQL end-point, which previously loaded the two documents (the voiD description and the example).

Update 2: A nice blog post about automatically generating the data you need for describing an end-point - thanks shellac for the pointer!

Update 3: Following discussion on the #swig IRC channel.

Tuesday 20 May 2008

Ceriese: RDF translator for Eurostat data

Riese logo

Some time ago, I did a bit of work on the RIESE project, aiming at publishing the Eurostat dataset on the Semantic Web, and interlinking it with further datasets (eg. Geonames and DBpedia). This can look a bit far from my interests in music data, but there is a connexion which illustrates the power of linked data, as explained at the end of this post.

Original data

There are three distinct things we consider in the Eurostat dataset:

  • A table of content in HTML defining the hierarchical structure of the Eurostat datasets;
  • Tab-separated values dictionary files defining the ~80 000 data codes used in the dataset (eg. "eu27" for the European Union of 27 countries);
  • The actual statistical data, in tab-separated values files. Around 4000 datasets for roughly 350 million statistical items.

Ontology

The first thing we need to figure out when exposing data on the Semantic Web is the model we'll link to. This lead into the design of SCOVO (Statistical Core Vocabulary). The concepts in this ontology can be depicted as follows:

SCOVO ontology

The interesting thing about this model is that the statistical item is considered as a primary entity. We used as a basis the Event ontology - a statistical item is a particular classification of a space/time region. This allows to be really flexible and extensible. We can for example attach multiple dimensions to a particular item, resources pertaining to its creation, etc.

RDF back-end

I wanted to see how to publish such large amounts of RDF data and how my publication tools perform, so I designed the Ceriese software to handle that.

The first real concern when dealing with such large amounts of data is, of course, scalability. The overall Eurostat dataset is over 3 billion triples. Given that we don't have high-end machines with lots of memory, using the core SWI RDF store was out of the question (I wonder if any in-memory triple store can handle 1 billion triples, btw).

So there are three choices at this point:

  • Use a database-backed triple store;
  • Dump static RDF file in a file-system served through Apache;
  • Generate the RDF on-the-fly.

We don't have the sort of money it takes (for both the hardware and the software) for the first choice to really scale, so we tried the second and the third solution. I took my old Prolog-2-RDF software that I am using to publish the Jamendo dataset and we wrote some P2R mapping files converting the tab-separated value files. Then, we made P2R dump small RDF files in a file-system hierarchy, corresponding to the description of the different web resources we wanted to publish. Then, some Apache tweaks and Michael and Wolfgang's work on XHTML/RDFa publishing were enough to make the dataset available in the web of data.

But this approach had two main problems. First, it took ages to run this RDF dump, so we never actually succeeded to complete it once. Also, it was impossible to provide a SPARQL querying facility. No aggregation of data was made available.

So we eventually settled on the third solution. I took my Prolog hacker hat, and tried to optimise P2R to make it fast enough. I did it by using the same trick I used in my small N3 reasoner, Henry. P2R mappings are compiled as native Prolog clauses (rdf(S,P,O) :- ... ), which cut down the search space a lot. TSV files are accessed within those rules and parsed on-the-fly. The location of the TSV file to access is derived from a small in-memory index. Parsed TSV files are cached for a whole query, to avoid parsing the same file for different triple patterns in the query.

Same mechanisms are applied to derive a SKOS hierarchy from the HTML table of content.

Now, a DESCRIBE query takes less than 0.5 seconds for any item, on my laptop. Not perfect, but, still...

A solution to improve the access time a lot would be to dump the TSV file in a relational database, and access this database in our P2R mappings instead of the raw TSV files.

Trying it out and creating your own SPARQL end-point from Eurostat data is really easy.

  • Get SWI-Prolog;
  • Get the software from there;
  • Get the raw Eurostat data (or get it from the Eurostat web-site, as this one can be slightly out-dated);
  • Put it in data/, in your Ceriese directory;
  • Launch start.pl;
  • Go to http://localhost:3020/

Now what?

Michael and Wolfgang did an amazing work at putting together a really nice web interface, publishing the data in XHTML+RDFa. They also included some interlinking, especially for geographical locations, which are now linked to Geonames.

So what's the point from my music geek point of view?? Well, now, after aggregating Semantic Web data about my music collection (using these tools), I can sort hip-hop artists by murder rates in their city :-) This is quite fun as it is (especially as the Eurostat dataset holds a really diverse range of statistics), but it would be really interesting to mine that to get some associations between statistical data and musical facts. This would surely lead to interesting sociological results (eg. how does musical "genre" associate with particular socio-economic indicators?)

Wednesday 14 May 2008

Data-rich music collection management

I just put a live demo of something I showed earlier on this blog.

You can explore the Creative Commons-licensed part of my music collection (mainly coming from Jamendo) using aggregated Semantic Web data.

For example, here is what you get after clicking on "map" on the right-hand side and "MusicArtist" on the left-hand side:

Data-rich music collection management

The aggregation is done using the GNAT and GNARQL tools available in the motools sourceforge project. The data comes from datasets within the Linking Open Data project.The UI is done by the amazing ClioPatria software, with a really low amount of configuration.

An interesting thing is to load this demo into Songbird, as it can aggregate and play the audio as you crawl around.

Check the demo!

Update: It looks like it doesn't work with IE, but it is fine with Opera and FF2 or FF3. If the map doesn't load at first, just try again and it should be ok.

Wednesday 2 April 2008

13.1 billion triples

After a rough estimation, it looks like the services hosted on DBTune provide access to 13.1 billion triples, therefore making a significant addition to the data web!

Here is the break-down of such an estimation:

  • MySpace: 250 million people * 50 triples (in average) = 12.5 billion triples ;
  • AudioScrobbler: 1.5 million users (only Europe?) * 400 triples = 600 million ;
  • Jamendo: 1.1 million triples + 5000 links to other data sources ;
  • Magnatune: 322 000 triples + 233 links ;
  • BBC John Peel sessions: 277 000 triples + 2100 links ;
  • Chord URI service: I don't count it, as it is potentially infinite (the RDF descriptions are generated from the chord symbol in the URI).

However, SPARQL end-points are not available for AudioScrobbler and MySpace, as the RDF is generated on-the-fly, from the XML feeds for the earlier, and from scraping for the latter.

Now, I wish linked data could be provided directly by the data sources themselves :-) (Again, all the code used to run the DBTune services is available in the motools project on Sourceforge).

Wednesday 12 March 2008

MySpace RDF service

Thanks to the amazing work of Kurt and Ben on the MyPySpace project, members of the MySpace social network can have a Semantic Web URI!

This small service provides such URIs and corresponding FOAF (top friends, depiction, name) and Music Ontology (URIs of available tracks in the streaming audio cache) RDF representations.

That means I can add such statements to my FOAF profile:

<http://moustaki.org/foaf.rdf#moustaki> foaf:knows <http://dbtune.org/myspace/lesversaillaisesamoustache>.

And then, using the Tabulator Firefox extension:

MySpace friends

PS: The service is still a bit slow and can be highly unstable, though - it is slightly faster with URIs using MySpace UIDs.

PS2: We don't host any data - everything is scraped on the fly using the MyPySpace tools.

Thursday 6 March 2008

Interlinking music datasets on the Web

The paper we wrote with Christopher Sutton and Mark Sandler has been accepted to the Linking Data on the Web workshop at WWW, and it is now available from the workshop website.

This paper explains in a bit more details (including pseudocode and the evaluation of two implementations of it) the algorithm I already described briefly in previous posts (oh yeah, this one too).

The problem is: how can you automatically derive owl:sameAs links between two different, previously unconnected, web datasets? For example, how can I state that Both in Jamendo is the same as Both in Musicbrainz?

The paper studies three different approaches. The first one is just a simple literal lookup (so in the previous example, it just fails, because there are two artists and a gazillion of tracks/records holding Both in their titles, in Musicbrainz). The second one is a constrained literal lookup (we specifically look for an artist, a record, a track, etc.). Our previous example also fails, because there are two matching artists in Musicbrainz for Both.

The algorithm we describe in the paper can intuitively be described as: if two artists made albums with the same title, they have better chances to be similar. It will browse linked data in order to aggregate further clues and be confident enough to disambiguate among several matching resources.

Although not specific to the music domain, we did evaluate it in two music-related contexts:

For the second one, we tried to cover a wide range of possible metadata mistakes, and checked how well our algorithm was copping with such bad metadata. A week ago, I compared the results with the Picard Musicbrainz tagger 0.9.0, and here are the results (you also have to keep in mind that our algorithm is quite a bit slower, as the Musicbrainz API is not really designed for the sort of things we do with it), for the track I Want to Hold Your Hand by the Beatles, in the Meet the Beatles! album:

  • Artist field missing:
    • GNAT: Correct
    • Picard: Matches the same track, but on the And Now: The Beatles compilation
  • Artist set to random string:
    • GNAT: Correct
    • Picard: Matches the same track, but on another release (track 1 of The Capitol Albums, Volume 1 (disc 1: Meet the Beatles!))
  • Artist set to Beetles:
    • GNAT: Correct
    • Picard: Matches the same track, but on another release (track 1 of The Capitol Albums, Volume 1 (disc 1: Meet the Beatles!))
  • Artist set to Al Green (who actually made a cover of that song):
    • GNAT: Mapped to Al Green's cover version on Soul Tribute to the Beatles
    • Picard: Same
  • Album field missing:
    • GNAT: Matches the same track, but on another release (track 1 of The Capitol Albums, Volume 1 (disc 1: Meet the Beatles!))
    • Picard: Matches the same track, but on the single
  • Album set to random string:
    • GNAT: Matches the same track, but on another release (track 1 of The Capitol Albums, Volume 1 (disc 1: Meet the Beatles!))
    • Picard: Matches the same track, but on the single
  • Album set to Meat the Beatles:
    • GNAT: Matches the same track, but on another release (track 1 of The Capitol Albums, Volume 1 (disc 1: Meet the Beatles!))
    • Picard: Matches the same track, but on the compilation The Beatles Beat
  • Track set to random string:
    • GNAT: Correct
    • Picard: No results
  • Track set to I Wanna Hold Your Hand:
    • GNAT: Correct
    • Picard: Matches the same track, but on the compilation The Beatles Beat
  • Perfect metadata:
    • GNAT: Correct
    • Picard: Matches the same track, but on the compilation The Beatles Beat

Most of the compilation results of Picard are actually not wrong, as the track length of our test file is closer to the track length on the compilation than the track length on the Meet the Beatles album.

Of course, this is not an extensive evaluation of how the Picard lookup mechanism compares with GNAT. And GNAT is not able to compete at all with Picard, as it was clearly not designed for the same reasons (GNAT is meant to interlink RDF datasets).

The python implementation of our algorithm is under the BSD license, and available in the motools sourceforge project. The Prolog implementation (working on RDF datasets) is also available in the motools sourceforge project.

Tuesday 22 January 2008

Pushing your Last.FM friends in the FOAF-O-Sphere

I just committed some changes to the last.fm linked data service. It now spits out, as well as your last scrobbled tracks linked to corresponding Musicbrainz URIs, your list of last.fm friends (using their URI on this service)

This is quite nice to explore the last scrobbles of the friends of your friends (hello Kurt and Ben!) :)

The friends of my friends on last.fm

Friday 11 January 2008

Your AudioScrobbler data as Linked Data

I just put online a small service, which converts your AudioScrobbler data to RDF, designed using the Music Ontology: it exposes your last 10 scrobbled tracks.

The funny thing is that it links the track, records, and artists to corresponding dereferencable URIs in the Musicbrainz dataset. So the tracks you were last listening to are part of the Data Web!

Just try it by getting this URI:

http://dbtune.org/last-fm/<last.fm username>

For example, mine is:

http://dbtune.org/last-fm/moustaki

Of course, by being linked to dereferencable URIs in Musicbrainz, you are able to access the birth dates of the artists you last listened to, or use the links published by DBPedia to plot your last artists played on map, by just crawling the Data Web a little.

Then, you can link that to your FOAF URI. Mine now holds the following statement:

<http://moustaki.org/foaf.rdf#moustaki> owl:sameAs <http://dbtune.org/last-fm/moustaki>.

Now, my URI looks quite nice, in the Tabulator generic data browser!

Me and my scrobbles

Tuesday 4 December 2007

Linking open data: interlinking the BBC John Peel sessions and the DBPedia datasets

For the last Hackday, the BBC released data about the John Peel sessions. In June, I did publish them as Linked Data, using a SWI-Prolog representation of this data, bundled with a custom P2R mapping. But so far, it was not interlinked with any dataset, making it a small island :-)

In order to enrich my interlinking experiences, I wanted to tackle a dataset I never really tried to link to: DBPedia, holding structured data extracted from Wikipedia (well, the Magnatune RDF dataset is linked to it, but just for geographical locations, which was fairly easy).

And my conclusion is... it is not that easy :-)

Try 1 - Matching on labels:

The first thing I tried was matching directly on labels, using an algorithm which might be summarized by:

1 - For all items I_i in the John Peel sessions dataset we want to link (instance of foaf:Agent or mo:MusicalWork, take label L_i ({I_i rdfs:label L_i})

2 - Issue the following SPARQL query to the DBPedia dataset:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> 
SELECT ?u
WHERE 
{       
?u rdfs:label L_i
}

3 - For all results R_i_j of this query, assert I_i owl:sameAs R_i_j

Just for fun, here are the results of such a query. You can't imagine how many people are called exactly the same... For example Jules Verne in the BBC John Peel sessions dataset is quite different from Jules Verne... Also, Jules Verne is quite different from the Jules Verne category, though they share the same label.

Try 2 - Still matching on labels, but with restrictions:

In the agent case, the disambiguation appears easy to achieve, by just expressing ''I am actually looking for someone which could be somehow related to the John Peel sessions''. But, err, Wikipedia (hence, DBPedia) is a bit messy some times, and it is quite difficult to find a reliable and consistent way of expressing this criteria. So I had to sample the John Peel data (taking some producers, some engineers, some artists, some bands) and look out manually how I could restrict the range of resources I was looking for in DBPedia and still be able to retrieve all my linked agents. This leads to the following SPARQL query (involving L_i as defined earlier):

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> 
SELECT ?u
WHERE 
{       
   {
      {?u <http://dbpedia.org/property/name> L_i} UNION 
      {?u rdfs:label L_i} UNION 
      {?u <http://dbpedia.org/property/bandName> L_i} 
    } 
    {
       {?u <http://dbpedia.org/property/wikiPageUsesTemplate> <http://dbpedia.org/resource/Template:infobox_musical_artist>} UNION 
       {?u a <http://dbpedia.org/class/yago/Group100031264>} UNION 
       {?u a ?mus.?mus rdfs:subClassOf <http://dbpedia.org/class/yago/Musician110340312>} UNION 
       {?u a ?artist. ?artist rdfs:subClassOf <http://dbpedia.org/class/yago/Creator109614315>} 
    } 
}

And for musical works?

Of course, this does not hold as soon as I broaden the range of resources I want to link. I first tried to use exactly the same methodology (basically restricting the resources I was looking for to be related to the Yago song concept). But, err, it did not work that well :-) You can't imagine how many songs have the same name! Just look at the results - this is enlightening! So far, the best I found was Walked Away, which appears to be the most popular title :-)

So what did I do to disambiguate? I took these results, got the RDF corresponding to the DBPedia resources, and made a literal search using the SWI literal index module on the abstract, looking for the name of the artist involved in the performance of this work. That's a bit hacky, but well, it did well even with cover songs (like Nirvana's Love Buzz).

Results

Surprisingly, the results do not seem to be too bad! I still have to check them more, but nothing seems obviously wrong, at first glance (I went through all the links manually, looking for something that would not make sense). The links are available here.

The SWI-Prolog code doing the trick is available here. Sorry, the code is a bit messy, and got increasingly hacky...

Thursday 30 August 2007

GNAT 0.1 released

Chris Sutton and I did some work since the first release of GNAT, and it is now in a releasable state!

You can get it here.

What does it do?

As mentioned in my previous blog post, GNAT is a small software able to link your personal music collection to the Semantic Web. It will find dereferencable identifiers available somewhere on the web for tracks in your collection. Basically, GNAT crawls through your collection, and try by several means to find the corresponding Musicbrainz identifier, which is then used to find the corresponding dereferencable URI in Zitgist. Then, RDF/XML files are put in the corresponding folder:

/music
/music/Artist1
/music/Artist1/AlbumA/info_metadata.rdf
/music/Artist1/AlbumA/info_fingerprint.rdf
/music/Artist1/AlbumB/info_metadata.rdf
/music/Artist1/AlbumB/info_fingerprint.rdf

What next?

These files hold a number of <http://zitgist.com/music/track/...> mo:available_as <local file> statements. These files can then be used by a tool such as GNARQL (which will be properly released next week), which swallows them, exposes a SPARQL end point, and provides some linked data crawling facilities (to gather more information about the artists in our collection, for example), therefore allowing to use the links pictured here (yes, sorry, I didn't know how to introduce properly the new linking-open-data schema - it looks good! and keeps on growing!:-) ):

Linking-Open-Data

Two identification strategies

GNAT can use two different identification strategies:

  • Metadata lookup: in this mode, only available tags are used to identify the track. We chose an identification algorithm which is slower (although if you try to identify, for example, a collection with lots of releases, you won't notice it too much, as only the first track of a release will be slower to identify), but works a bit better than Picard' metadata lookup. Basically, the algorithm used is the same as the one I used to link the Jamendo dataset to the Musicbrainz one.
  • Fingerprinting: in this mode, the Music IP fingerprinting client is used in order to find a PUID for the track, which is then used to get back to the Musicbrainz ID. This mode is obviously better when the tags are crap :-)
  • The two strategies can be run in parallel, and most of the times, the best identification results are obtained when combining the two...

Usage

  • To perform a metadata lookup for the music collection available at /music:

./AudioCollection.py metadata /music

  • To perform a fingerprint-based lookup for the music collection available at /music:

./AudioCollection.py fingerprint /music

  • To clean every previously performed identifications:

./AudioCollection.py clean /music

Dependencies

  • MOPY (included) - Music Ontology PYthon interface
  • genpuid (included) - MusicIP fingerprinting client
  • rdflib - easy_install rdflib
  • mutagen - easy_install mutagen
  • Musicbrainz2 - You need a version later than 02.08.2007 (sorry)

- page 1 of 2