The Quest for Canonical Spelling in music metadata
By Yves on Thursday 27 March 2008, 11:18 - Permalink
Last.fm recently unveiled their new fingerprinting lookup mechanism. They did aggregate quite a lot of fingerprints (650 million) using their fingerprinting software, which is a nice basis for such a lookup, perhaps bringing a viable alternative to Music DNS. I gave it a try (I just had to build a Linux 64 version of the lookup software), and was quite surprised by the results. The quality of the fingerprinting looks indeed good, but here are the results for a particular song:
<?xml version="1.0"?>
<!DOCTYPE metadata SYSTEM "http://fingerprints.last.fm/xml/metadata.dtd">
<metadata fid="281948" lastmodified="1205776219">
<track confidence="0.622890">
<artist>Leftover Crack</artist>
<title>Operation: M.O.V.E.</title>
<url>http://www.last.fm/music/Leftover+Crack/_/Operation%3A+M.O.V.E.</url>
</track>
<track confidence="0.327927">
<artist>Leftöver Crack</artist>
<title>Operation: M.O.V.E.</title>
<url>http://www.last.fm/music/Left%C3%B6ver+Crack/_/Operation%3A+M.O.V.E.</url>
</track>
<track confidence="0.007860">
<artist>Leftover Crack</artist>
<title>Operation MOVE</title>
<url>http://www.last.fm/music/Leftover+Crack/_/Operation+MOVE</url>
</track>
<track confidence="0.006180">
<artist>Leftover Crack</artist>
<title>Operation M.O.V.E.</title>
<url>http://www.last.fm/music/Leftover+Crack/_/Operation+M.O.V.E.</url>
</track>
<track confidence="0.004883">
<artist>Leftover Crack</artist>
<title>Operation; M.O.V.E.</title>
<url>http://www.last.fm/music/Leftover+Crack/_/Operation%3B+M.O.V.E.</url>
</track>
<track confidence="0.004826">
<artist>Leftöver Crack</artist>
<title>Operation M.O.V.E.</title>
<url>http://www.last.fm/music/Left%C3%B6ver+Crack/_/Operation+M.O.V.E.</url>
</track>
<track confidence="0.004717">
<artist>Leftöver Crack</artist>
<title>13 - operation m.o.v.e</title>
<url>http://www.last.fm/music/Left%C3%B6ver+Crack/_/13+-+operation+m.o.v.e</url>
</track>
....
</metadata>
And it goes on and on... There are 21 results for this single track, which all actually correspond to this track.
So, what is disturbing me here? After all, the first result holds textual metadata that I could consider as somehow correct (even if that's not the way I would spell this band's name, but they plan to put a voting system to solve this sort of issues).
The real problem is that there are 21 URI in last.fm for the same thing. The emphasis of the last.fm metadata system is then probably on the textual metadata: two different ways of spelling the name of a band = two bands. But I do think it is wrong: for example, how would you handle the fact that the Russian band Ария is spelled Aria in English? The two spellings are correct, and they correspond to one unique band.
In my opinion, the important thing is the identifier. As long as you have
one identifier for one single thing (an artist, an album, a track), you're saved. The relationship
between a band, an artist, a track, etc. and its label
is clearly a
one-to-many one: the quest for a canonical spelling will never end... And what
worries me even more is that it tends to kill the spellings in all languages
but English (especially if a voting system is in place?).
Once you have a single identifier for a single thing within your system, you can start attaching labels to it, perhaps with a language tag. Then, it is up to the presentation layer to show you the label matching your preferences. And if you tend for such a model, Musicbrainz (centralised and moderated) or RDF and the Music Ontology (decentralised and not moderated) are probably the way to go.
I guess this emphasis on textual metadata is mainly due to the ID3 legacy and other embedded metadata format, which allowed just one single title for the track, the album and the artist to be associated with an audio-file?
I think that the real problem for last.fm will now be to match all the different identifiers they have for a single thing in their system, which is known as the record linkage problem in the database/Semantic Web community. But I also think this is not too far-fetched, as they already began to link their database to the Musicbrainz one?
Comments
It is true that currently we have 21 different artist pages for each (mi)spelling but our idea (thanks also to the fingerprint technology) is to merge them into a single one. The identifier you're talking about is clearly central. Indeed you can get such ID it if you specify -nometadata as option to our program! :)
As for different languages, our goal is to return whatever is specified in the preferences of the user. If he/she is japanese we will return the kanji spelling unless specified otherwise.
Hi Norman!
Oh, cool, it actually finds an ID with -nometadata: 281948 for the same example as above. Is there a way to associate this ID with an URI that I can dereference to get all information about this ID within your database (including all alternate labels)? A quick look at the AudioScrobbler API seems to key such a request on the artist name.
And that's really good news for the merging!! Also, do you have any plans on providing some structured data from the artist/track/album URI (RDF, RDFa or microformats) ?
Cheers, and thanks :-)
y
Actually, I just figured out that:
http://ws.audioscrobbler.com/finger...
gives access to the results corresponding to <id>.
It then looks like <id> identifies the actual fingerprint within your db. So I guess the merging you're planning will "propagate" this identification to further resources (artists, albums, etc.).
Cheers, and thanks again for the hint!
y
Hi Yves!
Or, alternative use the option -url instead of -nometadata.. ;-)
alternativelY
Doesn't xml:lang solve some of the problem here?
Hi Daniel!
It can, to handle a different label per language - but I don't think it can help for the merging of the different URIs, and also for cases where you need to handle several labels in one single language.
I guess my point was more pro-merging of all the different URIs for a single artist/track/album (the ID you get with the fingerprinting lookup software corresponds to the fingerprint). Then, all the different alternate spellings can be attached to it, and picking one up that you consider as canonical is another problem (slightly less important, in my opinion).
On a side-note, it'd also be neat if the voting system handled voting for a label in a particular language :-)
Cheers!
y