Linking open data: interlinking the BBC John Peel sessions and the DBPedia datasets
By Yves on Tuesday 4 December 2007, 10:32 - Permalink
For the last Hackday, the BBC released data about the John Peel sessions. In June, I did publish them as Linked Data, using a SWI-Prolog representation of this data, bundled with a custom P2R mapping. But so far, it was not interlinked with any dataset, making it a small island :-)
In order to enrich my interlinking experiences, I wanted to tackle a dataset I never really tried to link to: DBPedia, holding structured data extracted from Wikipedia (well, the Magnatune RDF dataset is linked to it, but just for geographical locations, which was fairly easy).
And my conclusion is... it is not that easy :-)
Try 1 - Matching on labels:
The first thing I tried was matching directly on labels, using an algorithm which might be summarized by:
1 - For all items I_i in the John Peel sessions dataset we want
to link (instance of foaf:Agent
or mo:MusicalWork, take
label L_i ({I_i rdfs:label L_i})
2 - Issue the following SPARQL query to the DBPedia dataset:
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?u
WHERE
{
?u rdfs:label L_i
}
3 - For all results R_i_j of this query, assert I_i owl:sameAs
R_i_j
Just for fun, here are the results of such a query. You can't imagine how many people are called exactly the same... For example Jules Verne in the BBC John Peel sessions dataset is quite different from Jules Verne... Also, Jules Verne is quite different from the Jules Verne category, though they share the same label.
Try 2 - Still matching on labels, but with restrictions:
In the agent case, the disambiguation appears easy to achieve, by just
expressing ''I am actually looking for someone which could be somehow related
to the John Peel sessions''. But, err, Wikipedia (hence, DBPedia) is a bit
messy some times, and it is quite difficult to find a reliable and consistent
way of expressing this criteria. So I had to sample the John Peel data (taking
some
producers, some
engineers, some
artists, some bands)
and look out manually how I could restrict the range of resources I was looking
for in DBPedia and still be able to retrieve all my linked agents. This leads
to the following SPARQL query (involving L_i as defined
earlier):
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?u
WHERE
{
{
{?u <http://dbpedia.org/property/name> L_i} UNION
{?u rdfs:label L_i} UNION
{?u <http://dbpedia.org/property/bandName> L_i}
}
{
{?u <http://dbpedia.org/property/wikiPageUsesTemplate> <http://dbpedia.org/resource/Template:infobox_musical_artist>} UNION
{?u a <http://dbpedia.org/class/yago/Group100031264>} UNION
{?u a ?mus.?mus rdfs:subClassOf <http://dbpedia.org/class/yago/Musician110340312>} UNION
{?u a ?artist. ?artist rdfs:subClassOf <http://dbpedia.org/class/yago/Creator109614315>}
}
}
And for musical works?
Of course, this does not hold as soon as I broaden the range of resources I
want to link. I first tried to use exactly the same methodology (basically
restricting the resources I was looking for to be related to the Yago song concept).
But, err, it did not work that well :-) You can't imagine how many songs have
the same name! Just look at the results - this
is enlightening! So far, the best I found was Walked Away, which
appears to be the most popular title :-)
So what did I do to disambiguate? I took these results, got the RDF corresponding to the DBPedia resources, and made a literal search using the SWI literal index module on the abstract, looking for the name of the artist involved in the performance of this work. That's a bit hacky, but well, it did well even with cover songs (like Nirvana's Love Buzz).
Results
Surprisingly, the results do not seem to be too bad! I still have to check them more, but nothing seems obviously wrong, at first glance (I went through all the links manually, looking for something that would not make sense). The links are available here.
The SWI-Prolog code doing the trick is available here. Sorry, the code is a bit messy, and got increasingly hacky...