This post deals with further interlinking experiences based on the Jamendo dataset, in particular equivalence mining - that is, stating that a resource in the Jamendo dataset is the same as a resource in the Musicbrainz dataset.

For example, we want to derive automatically that http://dbtune.org/jamendo/artist/5 is the same as http://musicbrainz.org/artist/0781a... (I will use this example throughout this post, as it illustrates many of the problems I had to overcome).

Independent artists and the failure of literal lookup

In my previous post, I detailed a linking example which was basically a literal lookup, to get back from a particular string (such as Paris, France) to an URI identifying this geographical location, through a web-service (in this case, the Geonames one). This relies on the hypothesis that one literal can be associated to exactly one URI. For example, if the string is just Paris, the linking process fails: should we link to an URI identifying Paris, Texas or Paris,France?

For mainstream artists, having at most one URI in the Musicbrainz dataset associated to a given string seems like a fair assumption. There is no way I could start a band called Metallica, I think :-)

But, for independent artist, this is not true... For example, the French band Both has exactly the same name as a US band. We therefore need a disambiguation process here.

Another problem arises when a band in the Jamendo dataset, like NoU, is not in the Musicbrainz dataset, but there is another band called Nou there. We need to throw away such wrong matchings.

Disambiguation and propagation

Now, let's try to identify whether http://dbtune.org/jamendo/artist/5 is equivalent to http://zitgist.com/music/artist/078... or http://zitgist.com/music/artist/5f9..., and that http://dbtune.org/jamendo/artist/10... is not equivalent to http://zitgist.com/music/artist/7c4....

By GETting these URIs, we can access their RDF description, which are designed according to the Music Ontology. We can use these descriptions in order to express that, if two artists have produced records with similar names, they are more likely to be the same. This also implies that the matched records are likely to be the same. So, at the same time, we disambiguate and we propagate the equivalence relationships.

Algorithm

This leads us to the following equivalence mining algorithm. We define a predicate similar_to(+Resource1,?Resource2,-Confidence), which captures the notion of similarity between two objects. In our Jamendo/Musicbrainz mapping example, we define this predicate as follows (we use a Prolog-like notation---variables start with an upper case characters, the mode is given in the head: ?=in or out, +=in, -=out):

     similar_to(+Resource1, -Resource2, -Confidence) is true if
               Resource1 is a mo:MusicArtist
               Resource1 has a name Name
               The musicbrainz web service, when queried with Name, returns ID associated with Confidence
               Resource2 is the concatenation of 'http://zitgist.com/music/artist/' and ID

and

     similar_to(+Resource1, +Resource2, -Confidence) is true if
               Resource1 is a mo:Record or a mo:Track
               Resource2 is a mo:Record or a mo:Track
               Resource1 and Resource2 have a similar title, with a confidence Confidence

Moreover, in the other cases, similar_to is always true, but the confidence is then 0.

Now, we define a path (a set of predicates), which will be used to propagate the equivalence. In our example, it is {foaf:made,mo:has_track}: we are starting from a MusicArtist resource, which made some records, and these records have tracks.

The equivalence mining algorithm is defined as follows. We first run the process depicted here:

Equivalence Mining algorithm

Every newly appearing resource is dereferenced, so the algorithm works in a linked data environment. It just uses one start URI as an input.

Then, we define a mapping as a set of tuples {Uri1,Uri2}, associated with a confidence C, which is the sum of the confidences associated to every tuple. The result mapping is the one with the highest confidence (and higher than a threshold in order to drop wrong matchings, such as the one mentioned earlier, for NoU).

Implementation

I wrote an implementation of such an algorithm, using SWI-Prolog (everything is GPL'd). In order to make it run, you need the CVS version of SWI, compiled with the http, the semweb and the nlp packages. You can test it by loading ldmapper.pl in SWI, and then, run:

?- mapping('http://dbtune.org/jamendo/artist/5',Mapping).

To adapt it to other datasets, you just have to add some similar_to clauses, and define which path you want to follow. Or, if you are not a Prolog geek, just give me a list of URI you want to map, along with a path, and the sort of similarity you want to introduce: I'll be happy to do it!

Results

I experimented with this implementation, in order to automatically link together the Jamendo and the Musicbrainz dataset. As the current implementation is not multi-threaded (it runs the algorithm on one artist after another), it is a bit slow (one day to link the entire dataset). It derived 1702 equivalence statements (these statements are available there), distributed over tracks, artists and records, and it spotted with a good confidence that every other artist/track/record in Jamendo are not referenced within Musicbrainz.

Here are some examples: