Well, I guess everything is in the title :-) The dump used is now of the
26th of July. I also moved everything to a much faster server. Also, the D2R
mapping is still not 100% complete - I am really slowly getting through it, as
PhD writing takes almost all my time these days. I added recently
owl:sameAs links to the DBTune Myspace service, so you can easily get
from Musicbrainz artists to the corresponding MP3s available on MySpace and
their social networks. See for example
Madonna, linked through owl:sameAs to the corresponding
DBpedia artist and to the corresponding
Myspace artist.
Tag - musicbrainz
Sunday 27 July 2008
Musicbrainz RDF updated
By Yves on Sunday 27 July 2008, 20:02
Monday 7 April 2008
D2RQ mapping for Musicbrainz
By Yves on Monday 7 April 2008, 16:27
I just started a D2R mapping for Musicbrainz, which allows to create a SPARQL end-point and to provide linked data access out of Musicbrainz fairly easily. A D2R instance loaded with the mapping as it is now is also available (be gentle, it is running on a cheap computer :-) ).
Added to the things that are available within the Zitgist mapping:
- SPARQL end point ;
- Support for tags ;
- Supports a couple of advanced relationships (still working my way through it, though) ;
- Instrument taxonomy directly generated from the db, and related to performance events;
- Support for orchestras ;
- Linked with DBpedia for places and Lingvoj for languages
There is still a lot to do, though: it is really a start. The mapping is available on the motools sourceforge project. I hope to post a follow-up soon! (including examples of funny SPARQL queries :-) ).
Update: For some obscure port-forwarding reasons, the SNORQL interface to the SPARQL end point does not work on the test server.
Update 2: This is fixed. (thanks to the anonymous SPARQL crash tester which helped me find the bug, by the way :-) )
Thursday 27 March 2008
The Quest for Canonical Spelling in music metadata
By Yves on Thursday 27 March 2008, 11:18
Last.fm recently unveiled their new fingerprinting lookup mechanism. They did aggregate quite a lot of fingerprints (650 million) using their fingerprinting software, which is a nice basis for such a lookup, perhaps bringing a viable alternative to Music DNS. I gave it a try (I just had to build a Linux 64 version of the lookup software), and was quite surprised by the results. The quality of the fingerprinting looks indeed good, but here are the results for a particular song:
<?xml version="1.0"?>
<!DOCTYPE metadata SYSTEM "http://fingerprints.last.fm/xml/metadata.dtd">
<metadata fid="281948" lastmodified="1205776219">
<track confidence="0.622890">
<artist>Leftover Crack</artist>
<title>Operation: M.O.V.E.</title>
<url>http://www.last.fm/music/Leftover+Crack/_/Operation%3A+M.O.V.E.</url>
</track>
<track confidence="0.327927">
<artist>Leftöver Crack</artist>
<title>Operation: M.O.V.E.</title>
<url>http://www.last.fm/music/Left%C3%B6ver+Crack/_/Operation%3A+M.O.V.E.</url>
</track>
<track confidence="0.007860">
<artist>Leftover Crack</artist>
<title>Operation MOVE</title>
<url>http://www.last.fm/music/Leftover+Crack/_/Operation+MOVE</url>
</track>
<track confidence="0.006180">
<artist>Leftover Crack</artist>
<title>Operation M.O.V.E.</title>
<url>http://www.last.fm/music/Leftover+Crack/_/Operation+M.O.V.E.</url>
</track>
<track confidence="0.004883">
<artist>Leftover Crack</artist>
<title>Operation; M.O.V.E.</title>
<url>http://www.last.fm/music/Leftover+Crack/_/Operation%3B+M.O.V.E.</url>
</track>
<track confidence="0.004826">
<artist>Leftöver Crack</artist>
<title>Operation M.O.V.E.</title>
<url>http://www.last.fm/music/Left%C3%B6ver+Crack/_/Operation+M.O.V.E.</url>
</track>
<track confidence="0.004717">
<artist>Leftöver Crack</artist>
<title>13 - operation m.o.v.e</title>
<url>http://www.last.fm/music/Left%C3%B6ver+Crack/_/13+-+operation+m.o.v.e</url>
</track>
....
</metadata>
And it goes on and on... There are 21 results for this single track, which all actually correspond to this track.
So, what is disturbing me here? After all, the first result holds textual metadata that I could consider as somehow correct (even if that's not the way I would spell this band's name, but they plan to put a voting system to solve this sort of issues).
The real problem is that there are 21 URI in last.fm for the same thing. The emphasis of the last.fm metadata system is then probably on the textual metadata: two different ways of spelling the name of a band = two bands. But I do think it is wrong: for example, how would you handle the fact that the Russian band Ария is spelled Aria in English? The two spellings are correct, and they correspond to one unique band.
In my opinion, the important thing is the identifier. As long as you have
one identifier for one single thing (an artist, an album, a track), you're saved. The relationship
between a band, an artist, a track, etc. and its label
is clearly a
one-to-many one: the quest for a canonical spelling will never end... And what
worries me even more is that it tends to kill the spellings in all languages
but English (especially if a voting system is in place?).
Once you have a single identifier for a single thing within your system, you can start attaching labels to it, perhaps with a language tag. Then, it is up to the presentation layer to show you the label matching your preferences. And if you tend for such a model, Musicbrainz (centralised and moderated) or RDF and the Music Ontology (decentralised and not moderated) are probably the way to go.
I guess this emphasis on textual metadata is mainly due to the ID3 legacy and other embedded metadata format, which allowed just one single title for the track, the album and the artist to be associated with an audio-file?
I think that the real problem for last.fm will now be to match all the different identifiers they have for a single thing in their system, which is known as the record linkage problem in the database/Semantic Web community. But I also think this is not too far-fetched, as they already began to link their database to the Musicbrainz one?
Thursday 6 March 2008
Interlinking music datasets on the Web
By Yves on Thursday 6 March 2008, 11:21
The paper we wrote with Christopher Sutton and Mark Sandler has been accepted to the Linking Data on the Web workshop at WWW, and it is now available from the workshop website.
This paper explains in a bit more details (including pseudocode and the evaluation of two implementations of it) the algorithm I already described briefly in previous posts (oh yeah, this one too).
The problem is: how can you automatically derive owl:sameAs
links between two different, previously unconnected, web datasets? For example,
how can I state that Both in
Jamendo is the same as Both
in Musicbrainz?
The paper studies three different approaches. The first one is just a simple
literal lookup (so in the previous example, it just fails, because there are
two artists and a gazillion of tracks/records holding Both in
their titles, in Musicbrainz). The second one is a constrained literal lookup
(we specifically look for an artist, a record, a track, etc.). Our previous
example also fails, because there are two matching artists in Musicbrainz for
Both.
The algorithm we describe in
the paper can intuitively be described as: if two artists made albums
with the same title, they have better chances to be similar
. It will browse
linked data in order to
aggregate further clues and be confident enough to disambiguate among several
matching resources.
Although not specific to the music domain, we did evaluate it in two music-related contexts:
- Interlinking of Jamendo and Musicbrainz, using a SWI-Prolog implementation ;
- Interlinking of personal music collections and Musicbrainz (using the GNAT software, SVN version).
For the second one, we tried to cover a wide range of possible metadata
mistakes, and checked how well our algorithm was copping with such bad
metadata. A week ago, I compared the results with the
Picard Musicbrainz tagger 0.9.0, and here are the results (you also have to
keep in mind that our algorithm is quite a bit slower, as the Musicbrainz API
is not really designed for the sort of things we do with it), for the track
I Want to Hold Your Hand
by the Beatles, in the Meet the Beatles!
album:
- Artist field missing:
- GNAT: Correct
- Picard: Matches the same track, but on the
And Now: The Beatles
compilation
- Artist set to random string:
- GNAT: Correct
- Picard: Matches the same track, but on another release (track 1 of
The Capitol Albums, Volume 1 (disc 1: Meet the Beatles!)
)
- Artist set to
Beetles
:- GNAT: Correct
- Picard: Matches the same track, but on another release (track 1 of
The Capitol Albums, Volume 1 (disc 1: Meet the Beatles!)
)
- Artist set to
Al Green
(who actually made a cover of that song):- GNAT: Mapped to Al Green's cover version on
Soul Tribute to the Beatles
- Picard: Same
- GNAT: Mapped to Al Green's cover version on
- Album field missing:
- GNAT: Matches the same track, but on another release (track 1 of
The Capitol Albums, Volume 1 (disc 1: Meet the Beatles!)
) - Picard: Matches the same track, but on the single
- GNAT: Matches the same track, but on another release (track 1 of
- Album set to random string:
- GNAT: Matches the same track, but on another release (track 1 of
The Capitol Albums, Volume 1 (disc 1: Meet the Beatles!)
) - Picard: Matches the same track, but on the single
- GNAT: Matches the same track, but on another release (track 1 of
- Album set to
Meat the Beatles
:- GNAT: Matches the same track, but on another release (track 1 of
The Capitol Albums, Volume 1 (disc 1: Meet the Beatles!)
) - Picard: Matches the same track, but on the compilation
The Beatles Beat
- GNAT: Matches the same track, but on another release (track 1 of
- Track set to random string:
- GNAT: Correct
- Picard: No results
- Track set to
I Wanna Hold Your Hand
:- GNAT: Correct
- Picard: Matches the same track, but on the compilation
The Beatles Beat
- Perfect metadata:
- GNAT: Correct
- Picard: Matches the same track, but on the compilation
The Beatles Beat
Most of the compilation results of Picard are actually not wrong, as the
track length of our test file is closer to the track length on the compilation
than the track length on the Meet the Beatles
album.
Of course, this is not an extensive evaluation of how the Picard lookup
mechanism compares with GNAT. And GNAT is not able to compete
at all
with Picard, as it was clearly not designed for the same reasons (GNAT is meant
to interlink RDF
datasets).
The python implementation of our algorithm is under the BSD license, and available in the motools sourceforge project. The Prolog implementation (working on RDF datasets) is also available in the motools sourceforge project.
Thursday 30 August 2007
GNAT 0.1 released
By Yves on Thursday 30 August 2007, 13:39
Chris Sutton and I did some work since the first release of GNAT, and it is now in a releasable state!
You can get it here.
What does it do?
As mentioned in my previous blog post, GNAT is a small software able to link your personal music collection to the Semantic Web. It will find dereferencable identifiers available somewhere on the web for tracks in your collection. Basically, GNAT crawls through your collection, and try by several means to find the corresponding Musicbrainz identifier, which is then used to find the corresponding dereferencable URI in Zitgist. Then, RDF/XML files are put in the corresponding folder:
/music /music/Artist1 /music/Artist1/AlbumA/info_metadata.rdf /music/Artist1/AlbumA/info_fingerprint.rdf /music/Artist1/AlbumB/info_metadata.rdf /music/Artist1/AlbumB/info_fingerprint.rdf
What next?
These files hold a number of
<http://zitgist.com/music/track/...> mo:available_as <local
file> statements. These files can then be used by a tool such as
GNARQL
(which will be properly released next week), which swallows them, exposes a
SPARQL end point, and provides some linked
data crawling facilities (to gather more information about the artists in
our collection, for example), therefore allowing to use the links pictured here
(yes, sorry, I didn't know how to introduce properly the new linking-open-data schema - it looks good! and
keeps on growing!:-) ):

Two identification strategies
GNAT can use two different identification strategies:
- Metadata lookup: in this mode, only available tags are used to identify the track. We chose an identification algorithm which is slower (although if you try to identify, for example, a collection with lots of releases, you won't notice it too much, as only the first track of a release will be slower to identify), but works a bit better than Picard' metadata lookup. Basically, the algorithm used is the same as the one I used to link the Jamendo dataset to the Musicbrainz one.
- Fingerprinting: in this mode, the Music IP fingerprinting client is used in order to find a PUID for the track, which is then used to get back to the Musicbrainz ID. This mode is obviously better when the tags are crap :-)
- The two strategies can be run in parallel, and most of the times, the best identification results are obtained when combining the two...
Usage
- To perform a metadata lookup for the music collection available at
/music:
./AudioCollection.py metadata /music
- To perform a fingerprint-based lookup for the music collection available at
/music:
./AudioCollection.py fingerprint /music
- To clean every previously performed identifications:
./AudioCollection.py clean /music
Dependencies
- MOPY (included) - Music Ontology PYthon interface
- genpuid (included) - MusicIP fingerprinting client
- rdflib -
easy_install rdflib - mutagen -
easy_install mutagen - Musicbrainz2 - You need a version later than 02.08.2007 (sorry)
Thursday 23 August 2007
Small Musicbrainz library for SWI-Prolog
By Yves on Thursday 23 August 2007, 11:36
I just put online a really small SWI-Prolog module, allowing to do some queries on the Musicbrainz web service. It provides the following predicates:
find_artist_id(+Name,-ID,-Score), which find artist ids given a name, along with a Lucene scorefind_release_id(+Name,-ID,-Score), which provides the same thing for a releasefind_track_id(+Name,-ID,-Score), same thing for a track
I wrote only three predicates, because to identify a track, I often found the best way was not to do one single Musicbrainz query with the track name, the artist name, and the release name if it is available, but to do the following:
* Try to identify the artist * For each artist found, try to identify the release (if it's available) * For each release try to identify the track
(Which is in fact really similar to what I have done for linking the Jamendo dataset to the Musicbrainz one).
Indeed, when you do a single query, it seems like the Musicbrainz web service does an exact match on the extra arguments, which fails if the album or the artist is badly spelled. And I did not succeed to write a good Lucene query that was doing the identification with such accuracy... I will detail that a bit when the next generation GNAT is in a releasable state:) But well, take care you do not flood the Musicbrainz web service! No more than one query per second!
Monday 11 June 2007
Linking open data: interlinking the Jamendo and the Musicbrainz datasets
By Yves on Monday 11 June 2007, 14:33
This post deals with further interlinking experiences based on the Jamendo dataset, in particular equivalence mining - that is, stating that a resource in the Jamendo dataset is the same as a resource in the Musicbrainz dataset.
For example, we want to derive automatically that http://dbtune.org/jamendo/artist/5 is the same as http://musicbrainz.org/artist/0781a... (I will use this example throughout this post, as it illustrates many of the problems I had to overcome).
Independent artists and the failure of literal lookup
In my
previous post, I detailed a linking example which was basically a literal
lookup, to get back from a particular string (such as Paris, France
) to
an URI identifying this geographical location, through a web-service (in this
case, the Geonames one). This relies on the
hypothesis that one literal can be associated to exactly one URI. For example,
if the string is just Paris
, the linking process fails: should we link
to an URI identifying Paris, Texas
or Paris,France
?
For mainstream artists, having at most one URI in the Musicbrainz dataset
associated to a given string seems like a fair assumption. There is no way I
could start a band called Metallica
, I think :-)
But, for independent artist, this is not true... For example, the French band Both has exactly the same name as a US band. We therefore need a disambiguation process here.
Another problem arises when a band in the Jamendo dataset, like NoU, is not in the Musicbrainz dataset, but there is another band called Nou there. We need to throw away such wrong matchings.
Disambiguation and propagation
Now, let's try to identify whether http://dbtune.org/jamendo/artist/5 is equivalent to http://zitgist.com/music/artist/078... or http://zitgist.com/music/artist/5f9..., and that http://dbtune.org/jamendo/artist/10... is not equivalent to http://zitgist.com/music/artist/7c4....
By GETting these URIs, we can access their RDF description, which are designed according to the Music Ontology. We can use these descriptions in order to express that, if two artists have produced records with similar names, they are more likely to be the same. This also implies that the matched records are likely to be the same. So, at the same time, we disambiguate and we propagate the equivalence relationships.
Algorithm
This leads us to the following equivalence mining
algorithm. We
define a predicate similar_to(+Resource1,?Resource2,-Confidence),
which captures the notion of similarity between two objects. In our
Jamendo/Musicbrainz mapping example, we define this predicate as follows (we
use a Prolog-like notation---variables start with an upper case characters, the
mode is given in the head: ?=in or out, +=in, -=out):
similar_to(+Resource1, -Resource2, -Confidence) is true if
Resource1 is a mo:MusicArtist
Resource1 has a name Name
The musicbrainz web service, when queried with Name, returns ID associated with Confidence
Resource2 is the concatenation of 'http://zitgist.com/music/artist/' and ID
and
similar_to(+Resource1, +Resource2, -Confidence) is true if
Resource1 is a mo:Record or a mo:Track
Resource2 is a mo:Record or a mo:Track
Resource1 and Resource2 have a similar title, with a confidence Confidence
Moreover, in the other cases, similar_to is always true, but
the confidence is then 0.
Now, we define a path (a set of predicates), which will be used to
propagate the equivalence. In our example, it is
{foaf:made,mo:has_track}: we are starting from a
MusicArtist resource, which made some
records, and these records have tracks.
The equivalence mining algorithm is defined as follows. We first run the process depicted here:
Every newly appearing resource is dereferenced, so the algorithm works in a linked data environment. It just uses one start URI as an input.
Then, we define a mapping as a set of tuples {Uri1,Uri2}, associated with a confidence C, which is the sum of the confidences associated to every tuple. The result mapping is the one with the highest confidence (and higher than a threshold in order to drop wrong matchings, such as the one mentioned earlier, for NoU).
Implementation
I wrote an implementation of such an algorithm, using SWI-Prolog (everything is GPL'd). In order to make it run, you need the CVS version of SWI, compiled with the http, the semweb and the nlp packages. You can test it by loading ldmapper.pl in SWI, and then, run:
?- mapping('http://dbtune.org/jamendo/artist/5',Mapping).
To adapt it to other datasets, you just have to add some
similar_to clauses, and define which path you want to follow. Or,
if you are not a Prolog geek, just give me a list of URI you want to map, along
with a path, and the sort of similarity you want to introduce: I'll be happy to
do it!
Results
I experimented with this implementation, in order to automatically link together the Jamendo and the Musicbrainz dataset. As the current implementation is not multi-threaded (it runs the algorithm on one artist after another), it is a bit slow (one day to link the entire dataset). It derived 1702 equivalence statements (these statements are available there), distributed over tracks, artists and records, and it spotted with a good confidence that every other artist/track/record in Jamendo are not referenced within Musicbrainz.
Here are some examples:
- http://dbtune.org/jamendo/artist/5 (here, the different members comes from the Musicbrainz data set, whereas the bittorrent links come from the Jamendo dataset)
- http://dbtune.org/jamendo/artist/2683
- http://dbtune.org/jamendo/record/3365 (here, the tags come from the Jamendo dataset)
- http://dbtune.org/jamendo/track/22663
Wednesday 23 May 2007
Find dereferencable URIs for tracks in your personal music collection
By Yves on Wednesday 23 May 2007, 18:27
Things are moving fast, since my last post. Indeed, Frederick just put online the Musicbrainz RDF dump, with dereferencable URIs, SPARQL end-point, everything. Great job Fred!!
This data set will surely be a sort of hub
for music-related data on
the Semantic Web, as it gives URIs for a large number of
artists, tracks, albums, but also timelines, performances, recordings, etc.
Well, almost everything defined in the Music
Ontology.
I am happy to announce the first hack using this dataset:-) This is called
GNAT (for GNAT is
not a tagger
). It is just some lines of python code which, from an audio
file in your music collection, gives you the corresponding dereferencable
URI.
It also puts this URI into the ID3v2 Universal File Identifier (UFID) frame. I am not sure it is the right place to put such an information though, as it is an identifier of the manifestation, not the item iself. Maybe I should use the user-defined link frames in the ID3v2 header...
So it is actually the first step of the application mentioned here!
It is quite easy to use:
$ python trackuri.py 7-don\'t_look_back.mp3 - ID3 tags Artist: Artemis Title: Don't Look Back Album: Undone - Zitgist URI http://zitgist.com/music/track/2b78923b-c260-44c1-b333-2caa020df172
Then:
$ eyeD3 7-don\'t_look_back.mp3 7-don't_look_back.mp3 [ 3.23 MB ] -------------------------------------------------------------------------------- Time: 3:31 MPEG1, Layer III [ 128 kb/s @ 44100 Hz - Stereo ] -------------------------------------------------------------------------------- ID3 v2.4: title: Don't Look Back artist: Artemis album: Undone year: 2000 track: 7 genre: Trip-Hop (id 27) Unique File ID: [http://zitgist.com/music/] http://zitgist.com/music/track/2b78923b-c260-44c1-b333-2caa020df172 Comment: [Description: http] [Lang: ] //www.magnatune.com/artists/artemis Comment: [Description: ID3v1 Comment] [Lang: XXX] From www.magnatune.com
You can also output the corresponding RDF, in RDF/XML or N3:
$ python trackuri.py 1-i\'m_alive.mp3 xml
<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF
xmlns:_3="http://purl.org/ontology/mo/"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
>
<rdf:Description
rdf:about=
"http://zitgist.com/music/track/67a1fab6-aea4-47f4-891d-6d42bb856a40">
<_3:availableAs rdf:resource=""/>
</rdf:Description>
</rdf:RDF>
$ python trackuri.py 1-i\'m_alive.mp3 n3 @prefix _3: <http://zitgist.com/music/track/67>. @prefix _4: <http://purl.org/ontology/mo/>. _3:a1fab6-aea4-47f4-891d-6d42bb856a40 _4:availableAs <>.
... even though I still have to put the good Item URI, instead of <>.
Get it!
You can download the code here, and it is GPL licensed.
The dependencies are:
- python-id3
- python-musicbrainz2
- RDFLib (
easy_install -U rdflib) - mutagen (
easy_install -U mutagen)
