The paper we wrote with Christopher Sutton and Mark Sandler has been accepted to the Linking Data on the Web workshop at WWW, and it is now available from the workshop website.
This paper explains in a bit more details (including pseudocode and the evaluation of two implementations of it) the algorithm I already described briefly in previous posts (oh yeah, this one too).
The problem is: how can you automatically derive owl:sameAs
links between two different, previously unconnected, web datasets? For example,
how can I state that Both in
Jamendo is the same as Both
in Musicbrainz?
The paper studies three different approaches. The first one is just a simple
literal lookup (so in the previous example, it just fails, because there are
two artists and a gazillion of tracks/records holding Both in
their titles, in Musicbrainz). The second one is a constrained literal lookup
(we specifically look for an artist, a record, a track, etc.). Our previous
example also fails, because there are two matching artists in Musicbrainz for
Both.
The algorithm we describe in
the paper can intuitively be described as: if two artists made albums
with the same title, they have better chances to be similar
. It will browse
linked data in order to
aggregate further clues and be confident enough to disambiguate among several
matching resources.
Although not specific to the music domain, we did evaluate it in two music-related contexts:
- Interlinking of Jamendo and Musicbrainz, using a SWI-Prolog implementation ;
- Interlinking of personal music collections and Musicbrainz (using the GNAT software, SVN version).
For the second one, we tried to cover a wide range of possible metadata
mistakes, and checked how well our algorithm was copping with such bad
metadata. A week ago, I compared the results with the
Picard Musicbrainz tagger 0.9.0, and here are the results (you also have to
keep in mind that our algorithm is quite a bit slower, as the Musicbrainz API
is not really designed for the sort of things we do with it), for the track
I Want to Hold Your Hand
by the Beatles, in the Meet the Beatles!
album:
- Artist field missing:
- GNAT: Correct
- Picard: Matches the same track, but on the
And Now: The Beatles
compilation
- Artist set to random string:
- GNAT: Correct
- Picard: Matches the same track, but on another release (track 1 of
The Capitol Albums, Volume 1 (disc 1: Meet the Beatles!)
)
- Artist set to
Beetles
:- GNAT: Correct
- Picard: Matches the same track, but on another release (track 1 of
The Capitol Albums, Volume 1 (disc 1: Meet the Beatles!)
)
- Artist set to
Al Green
(who actually made a cover of that song):- GNAT: Mapped to Al Green's cover version on
Soul Tribute to the Beatles
- Picard: Same
- GNAT: Mapped to Al Green's cover version on
- Album field missing:
- GNAT: Matches the same track, but on another release (track 1 of
The Capitol Albums, Volume 1 (disc 1: Meet the Beatles!)
) - Picard: Matches the same track, but on the single
- GNAT: Matches the same track, but on another release (track 1 of
- Album set to random string:
- GNAT: Matches the same track, but on another release (track 1 of
The Capitol Albums, Volume 1 (disc 1: Meet the Beatles!)
) - Picard: Matches the same track, but on the single
- GNAT: Matches the same track, but on another release (track 1 of
- Album set to
Meat the Beatles
:- GNAT: Matches the same track, but on another release (track 1 of
The Capitol Albums, Volume 1 (disc 1: Meet the Beatles!)
) - Picard: Matches the same track, but on the compilation
The Beatles Beat
- GNAT: Matches the same track, but on another release (track 1 of
- Track set to random string:
- GNAT: Correct
- Picard: No results
- Track set to
I Wanna Hold Your Hand
:- GNAT: Correct
- Picard: Matches the same track, but on the compilation
The Beatles Beat
- Perfect metadata:
- GNAT: Correct
- Picard: Matches the same track, but on the compilation
The Beatles Beat
Most of the compilation results of Picard are actually not wrong, as the
track length of our test file is closer to the track length on the compilation
than the track length on the Meet the Beatles
album.
Of course, this is not an extensive evaluation of how the Picard lookup
mechanism compares with GNAT. And GNAT is not able to compete
at all
with Picard, as it was clearly not designed for the same reasons (GNAT is meant
to interlink RDF
datasets).
The python implementation of our algorithm is under the BSD license, and available in the motools sourceforge project. The Prolog implementation (working on RDF datasets) is also available in the motools sourceforge project.
