For the last Hackday, the BBC released data about the John Peel sessions. In June, I did publish them as Linked Data, using a SWI-Prolog representation of this data, bundled with a custom P2R mapping. But so far, it was not interlinked with any dataset, making it a small island :-)

In order to enrich my interlinking experiences, I wanted to tackle a dataset I never really tried to link to: DBPedia, holding structured data extracted from Wikipedia (well, the Magnatune RDF dataset is linked to it, but just for geographical locations, which was fairly easy).

And my conclusion is... it is not that easy :-)

Try 1 - Matching on labels:

The first thing I tried was matching directly on labels, using an algorithm which might be summarized by:

1 - For all items I_i in the John Peel sessions dataset we want to link (instance of foaf:Agent or mo:MusicalWork, take label L_i ({I_i rdfs:label L_i})

2 - Issue the following SPARQL query to the DBPedia dataset:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> 
SELECT ?u
WHERE 
{       
?u rdfs:label L_i
}

3 - For all results R_i_j of this query, assert I_i owl:sameAs R_i_j

Just for fun, here are the results of such a query. You can't imagine how many people are called exactly the same... For example Jules Verne in the BBC John Peel sessions dataset is quite different from Jules Verne... Also, Jules Verne is quite different from the Jules Verne category, though they share the same label.

Try 2 - Still matching on labels, but with restrictions:

In the agent case, the disambiguation appears easy to achieve, by just expressing ''I am actually looking for someone which could be somehow related to the John Peel sessions''. But, err, Wikipedia (hence, DBPedia) is a bit messy some times, and it is quite difficult to find a reliable and consistent way of expressing this criteria. So I had to sample the John Peel data (taking some producers, some engineers, some artists, some bands) and look out manually how I could restrict the range of resources I was looking for in DBPedia and still be able to retrieve all my linked agents. This leads to the following SPARQL query (involving L_i as defined earlier):

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> 
SELECT ?u
WHERE 
{       
   {
      {?u <http://dbpedia.org/property/name> L_i} UNION 
      {?u rdfs:label L_i} UNION 
      {?u <http://dbpedia.org/property/bandName> L_i} 
    } 
    {
       {?u <http://dbpedia.org/property/wikiPageUsesTemplate> <http://dbpedia.org/resource/Template:infobox_musical_artist>} UNION 
       {?u a <http://dbpedia.org/class/yago/Group100031264>} UNION 
       {?u a ?mus.?mus rdfs:subClassOf <http://dbpedia.org/class/yago/Musician110340312>} UNION 
       {?u a ?artist. ?artist rdfs:subClassOf <http://dbpedia.org/class/yago/Creator109614315>} 
    } 
}

And for musical works?

Of course, this does not hold as soon as I broaden the range of resources I want to link. I first tried to use exactly the same methodology (basically restricting the resources I was looking for to be related to the Yago song concept). But, err, it did not work that well :-) You can't imagine how many songs have the same name! Just look at the results - this is enlightening! So far, the best I found was Walked Away, which appears to be the most popular title :-)

So what did I do to disambiguate? I took these results, got the RDF corresponding to the DBPedia resources, and made a literal search using the SWI literal index module on the abstract, looking for the name of the artist involved in the performance of this work. That's a bit hacky, but well, it did well even with cover songs (like Nirvana's Love Buzz).

Results

Surprisingly, the results do not seem to be too bad! I still have to check them more, but nothing seems obviously wrong, at first glance (I went through all the links manually, looking for something that would not make sense). The links are available here.

The SWI-Prolog code doing the trick is available here. Sorry, the code is a bit messy, and got increasingly hacky...