DBTune blog

To content | To menu | To search

Tag - jamendo

Entries feed

Thursday 17 July 2008

Literal search using the Jamendo SPARQL end-point

I just wrote a small SWI-Prolog module for literal search using the ClioPatria SPARQL end-point. It uses the rdf_litidex module, and performs a metaphone search on existing literals in the database. All of that is triggered through a built-in RDF predicate.

Here is an example query you can perform on the Jamendo SPARQL end-point (make sure you select lit as the entailment - it will be the default one soon):

SELECT ?o
WHERE
{"punk jazz" <http://purl.org/ontology/swi#soundslike> ?o}

This query binds ?o to all resources within the end-point that are associated with matching literals. For example, you would get back:

The module is available there.

Thursday 6 March 2008

Interlinking music datasets on the Web

The paper we wrote with Christopher Sutton and Mark Sandler has been accepted to the Linking Data on the Web workshop at WWW, and it is now available from the workshop website.

This paper explains in a bit more details (including pseudocode and the evaluation of two implementations of it) the algorithm I already described briefly in previous posts (oh yeah, this one too).

The problem is: how can you automatically derive owl:sameAs links between two different, previously unconnected, web datasets? For example, how can I state that Both in Jamendo is the same as Both in Musicbrainz?

The paper studies three different approaches. The first one is just a simple literal lookup (so in the previous example, it just fails, because there are two artists and a gazillion of tracks/records holding Both in their titles, in Musicbrainz). The second one is a constrained literal lookup (we specifically look for an artist, a record, a track, etc.). Our previous example also fails, because there are two matching artists in Musicbrainz for Both.

The algorithm we describe in the paper can intuitively be described as: if two artists made albums with the same title, they have better chances to be similar. It will browse linked data in order to aggregate further clues and be confident enough to disambiguate among several matching resources.

Although not specific to the music domain, we did evaluate it in two music-related contexts:

For the second one, we tried to cover a wide range of possible metadata mistakes, and checked how well our algorithm was copping with such bad metadata. A week ago, I compared the results with the Picard Musicbrainz tagger 0.9.0, and here are the results (you also have to keep in mind that our algorithm is quite a bit slower, as the Musicbrainz API is not really designed for the sort of things we do with it), for the track I Want to Hold Your Hand by the Beatles, in the Meet the Beatles! album:

  • Artist field missing:
    • GNAT: Correct
    • Picard: Matches the same track, but on the And Now: The Beatles compilation
  • Artist set to random string:
    • GNAT: Correct
    • Picard: Matches the same track, but on another release (track 1 of The Capitol Albums, Volume 1 (disc 1: Meet the Beatles!))
  • Artist set to Beetles:
    • GNAT: Correct
    • Picard: Matches the same track, but on another release (track 1 of The Capitol Albums, Volume 1 (disc 1: Meet the Beatles!))
  • Artist set to Al Green (who actually made a cover of that song):
    • GNAT: Mapped to Al Green's cover version on Soul Tribute to the Beatles
    • Picard: Same
  • Album field missing:
    • GNAT: Matches the same track, but on another release (track 1 of The Capitol Albums, Volume 1 (disc 1: Meet the Beatles!))
    • Picard: Matches the same track, but on the single
  • Album set to random string:
    • GNAT: Matches the same track, but on another release (track 1 of The Capitol Albums, Volume 1 (disc 1: Meet the Beatles!))
    • Picard: Matches the same track, but on the single
  • Album set to Meat the Beatles:
    • GNAT: Matches the same track, but on another release (track 1 of The Capitol Albums, Volume 1 (disc 1: Meet the Beatles!))
    • Picard: Matches the same track, but on the compilation The Beatles Beat
  • Track set to random string:
    • GNAT: Correct
    • Picard: No results
  • Track set to I Wanna Hold Your Hand:
    • GNAT: Correct
    • Picard: Matches the same track, but on the compilation The Beatles Beat
  • Perfect metadata:
    • GNAT: Correct
    • Picard: Matches the same track, but on the compilation The Beatles Beat

Most of the compilation results of Picard are actually not wrong, as the track length of our test file is closer to the track length on the compilation than the track length on the Meet the Beatles album.

Of course, this is not an extensive evaluation of how the Picard lookup mechanism compares with GNAT. And GNAT is not able to compete at all with Picard, as it was clearly not designed for the same reasons (GNAT is meant to interlink RDF datasets).

The python implementation of our algorithm is under the BSD license, and available in the motools sourceforge project. The Prolog implementation (working on RDF datasets) is also available in the motools sourceforge project.

Monday 25 February 2008

Playing with SPARQL and XMPP

Chatting with Dan Brickley at the Semantic Camp last week got me quite curious about mixing SPARQL and XMPP, so I decided to give it a try :-)

I first tried sparqlxmpp by Chris Schmidt, a Python implementation using redland as a back-end. Unfortunately, I did run into some troubles (a weird error attribute being inserted in the XML between the sending and the receiving, which made the whole thing crash).

So I decided to give a try at xOperator, which 0.1 version was announced last week. It is really easy to use and flexible (notably through the fact that you can specify third-party SPARQL end-points as new datasets, and the ability to template SPARQL queries (a bit like SparqlBot)).

Here is the result. You can add jamendo@jabber.org to your Jabber roster, which provides SPARQL/XMPP on top of the Jamendo SPARQL end-point. Then, here is an example session (moustaki is me, and jamendo is the xoperator instance):

<moustaki> list templates
<jamendo> These are the defined templates:
<jamendo> LIST ME * ARTISTS  select ?n where {?a rdf:type mo:MusicArtist; foaf:name ?n} LIMIT %%1%%
<jamendo> GIVE ME THE LOCATION OF *  select ?l where {?a a mo:MusicArtist;foaf:name '%%1%%'; foaf:based_near ?g. ?g 
                <http://www.geonames.org/ontology#name> ?l}
<jamendo> GIVE ME ALBUMS MADE BY *  select ?l where {?a a mo:MusicArtist; foaf:name '%%1%%'; foaf:made ?r. ?r a mo:Record; dc:title ?l}
<jamendo> WHERE CAN I GET *  select ?d where {?a dc:title '%%1%%'; mo:available_as ?d}
<jamendo> GIVE ME THE LOCATION OF *  select ?l where {?a a mo:MusicArtist;foaf:name '%%1%%'; foaf:based_near ?g. ?g 
                <http://www.geonames.org/ontology#name> ?l}

<moustaki> list me 5 artists
<jamendo> Store jamendo answered: 
<jamendo> "Cicada"
<jamendo> "Hace Soul"
<jamendo> "vincent j"
<jamendo> "NoU"
<jamendo> "Margin of Safety"

<moustaki> give me the location of Cicada
<jamendo> Store jamendo answered: 
<jamendo> "Département des Bouches-du-Rhône"

<moustaki> give me albums made by Cicada
<jamendo> Store jamendo answered: 
<jamendo> "Rebirth"   
<jamendo> "AuthentiK Yogourt"

<moustaki> where can I get AuthentiK Yogourt
<jamendo> Store jamendo answered: 
<jamendo> <http://www.jamendo.com/get/album/id/album/p2p/redirect/8309/?p2pnet=bittorrent&are=mp32>
<jamendo> <http://www.jamendo.com/get/album/id/album/p2p/redirect/8309/?p2pnet=bittorrent&are=ogg3>
<jamendo> <http://www.jamendo.com/get/album/id/album/p2p/redirect/8309/?p2pnet=ed2k&are=mp32>
<jamendo> <http://www.jamendo.com/get/album/id/album/p2p/redirect/8309/?p2pnet=ed2k&are=ogg3>
<jamendo> <http://www.jamendo.com/get/track/id/album/audio/play/8309/?item_o=track_no_asc&aue=ogg2&n=all>
<jamendo> <http://www.jamendo.com/get/track/id/album/audio/xspf/8309/?item_o=track_no_asc&aue=ogg2&n=all>

Now, making it interact with GNAT and GNARQL, two tools able to create a SPARQL end point holding information about your personal music collection, is the next step :)

Wednesday 6 February 2008

Playing with Linked Data, Jamendo, Geonames, Slashfacet and Songbird

Today, I made a small screencast about mixing the following ingredients:

All of that was extremely easy to set up (it actually took me more time to figure out how to make a screencast on a Linux box :-) which I finally did using vnc2swf). Basically, just some tweaked configuration files for ClioPatria, and a small CSS hack, and that was it...

The result is there:

Songbird, Linked Data, Mazzle and Jamendo

(Note that only a few Jamendo artists are displayed now... Otherwise, Google Maps would just crash my laptop :-) ).

Thursday 29 November 2007

Jamendo RDF updated

I finally updated the Jamendo dataset. Indeed, the previous version was based on a dump from about 5 months ago.

During these few months, their dataset increased a lot (Jamendo rocks... It is clearly my favorite music source)! The corresponding RDF is now just a bit more than one million triple (the whole RDF dump is available).

While updating the dataset, I also fixed a number of issues:

  • Added mo:available_as links towards playlists in XSPF and M3U formats - this is a really cool feature, and fixed the Bittorrent and ED2K links;
  • Fixed some bugs in the Geonames linking - now, almost every artist is linked to the corresponding Geonames URI ;
  • Fixed some Musicbrainz links, but there is still some work to do on that side (I would need to relaunch my record linkage algorithm, but it is a bit slow, and it is a bit late :) ) ;

Monday 11 June 2007

Linking open data: interlinking the Jamendo and the Musicbrainz datasets

This post deals with further interlinking experiences based on the Jamendo dataset, in particular equivalence mining - that is, stating that a resource in the Jamendo dataset is the same as a resource in the Musicbrainz dataset.

For example, we want to derive automatically that http://dbtune.org/jamendo/artist/5 is the same as http://musicbrainz.org/artist/0781a... (I will use this example throughout this post, as it illustrates many of the problems I had to overcome).

Independent artists and the failure of literal lookup

In my previous post, I detailed a linking example which was basically a literal lookup, to get back from a particular string (such as Paris, France) to an URI identifying this geographical location, through a web-service (in this case, the Geonames one). This relies on the hypothesis that one literal can be associated to exactly one URI. For example, if the string is just Paris, the linking process fails: should we link to an URI identifying Paris, Texas or Paris,France?

For mainstream artists, having at most one URI in the Musicbrainz dataset associated to a given string seems like a fair assumption. There is no way I could start a band called Metallica, I think :-)

But, for independent artist, this is not true... For example, the French band Both has exactly the same name as a US band. We therefore need a disambiguation process here.

Another problem arises when a band in the Jamendo dataset, like NoU, is not in the Musicbrainz dataset, but there is another band called Nou there. We need to throw away such wrong matchings.

Disambiguation and propagation

Now, let's try to identify whether http://dbtune.org/jamendo/artist/5 is equivalent to http://zitgist.com/music/artist/078... or http://zitgist.com/music/artist/5f9..., and that http://dbtune.org/jamendo/artist/10... is not equivalent to http://zitgist.com/music/artist/7c4....

By GETting these URIs, we can access their RDF description, which are designed according to the Music Ontology. We can use these descriptions in order to express that, if two artists have produced records with similar names, they are more likely to be the same. This also implies that the matched records are likely to be the same. So, at the same time, we disambiguate and we propagate the equivalence relationships.

Algorithm

This leads us to the following equivalence mining algorithm. We define a predicate similar_to(+Resource1,?Resource2,-Confidence), which captures the notion of similarity between two objects. In our Jamendo/Musicbrainz mapping example, we define this predicate as follows (we use a Prolog-like notation---variables start with an upper case characters, the mode is given in the head: ?=in or out, +=in, -=out):

     similar_to(+Resource1, -Resource2, -Confidence) is true if
               Resource1 is a mo:MusicArtist
               Resource1 has a name Name
               The musicbrainz web service, when queried with Name, returns ID associated with Confidence
               Resource2 is the concatenation of 'http://zitgist.com/music/artist/' and ID

and

     similar_to(+Resource1, +Resource2, -Confidence) is true if
               Resource1 is a mo:Record or a mo:Track
               Resource2 is a mo:Record or a mo:Track
               Resource1 and Resource2 have a similar title, with a confidence Confidence

Moreover, in the other cases, similar_to is always true, but the confidence is then 0.

Now, we define a path (a set of predicates), which will be used to propagate the equivalence. In our example, it is {foaf:made,mo:has_track}: we are starting from a MusicArtist resource, which made some records, and these records have tracks.

The equivalence mining algorithm is defined as follows. We first run the process depicted here:

Equivalence Mining algorithm

Every newly appearing resource is dereferenced, so the algorithm works in a linked data environment. It just uses one start URI as an input.

Then, we define a mapping as a set of tuples {Uri1,Uri2}, associated with a confidence C, which is the sum of the confidences associated to every tuple. The result mapping is the one with the highest confidence (and higher than a threshold in order to drop wrong matchings, such as the one mentioned earlier, for NoU).

Implementation

I wrote an implementation of such an algorithm, using SWI-Prolog (everything is GPL'd). In order to make it run, you need the CVS version of SWI, compiled with the http, the semweb and the nlp packages. You can test it by loading ldmapper.pl in SWI, and then, run:

?- mapping('http://dbtune.org/jamendo/artist/5',Mapping).

To adapt it to other datasets, you just have to add some similar_to clauses, and define which path you want to follow. Or, if you are not a Prolog geek, just give me a list of URI you want to map, along with a path, and the sort of similarity you want to introduce: I'll be happy to do it!

Results

I experimented with this implementation, in order to automatically link together the Jamendo and the Musicbrainz dataset. As the current implementation is not multi-threaded (it runs the algorithm on one artist after another), it is a bit slow (one day to link the entire dataset). It derived 1702 equivalence statements (these statements are available there), distributed over tracks, artists and records, and it spotted with a good confidence that every other artist/track/record in Jamendo are not referenced within Musicbrainz.

Here are some examples:

Saturday 26 May 2007

Linking open data: publishing and linking the Jamendo dataset

Some weeks ago, I released a linked data representation of the Jamendo dataset, a large collection of Creative Commons licensed songs, according to the Music Ontology.

I had some experience with publishing such datasets, through the dump of the Magnatune collection, which I have done through D2R Server, and this D2RQ mapping. The Magnatune dump, through the publishingLocation property, is linked to the dbpedia dataset. Well, it was in fact really easy: the geographical location in the Magnatune database is just a string: France, USA, etc. And the dbpedia URIs I am linking to are just a plain concatenation of such strings and http://dbpedia.org/resource/. All of that (pointing towards custom URI patterns) can be done quite easily through D2R.

However, it was a bit more difficult for the Jamendo dataset...

  • They release their dump in some custom XML schema, and their database is evolving quite fast, so in order to be up-to-date, you have to query their API, which makes it difficult to use a relational database publishing approach.
  • Geographical information is also represented as a string, but it could be France (75) (for Paris, France), Madrid, Spain, etc., which makes it difficult to find a canonical way of constructing dbpedia or Geonames URIs.

Therefore, I released a small program, P2R, making use of a declarative mapping to export a SWI-Prolog knowledge base on the Semantic Web.

With Prolog as a back-end, you can do a lot more stuff than with a plain relational database. I'll try to give an example of this, by describing how I have done to link the Jamendo dataset to the Geonames one.

Prolog-to-RDF

P2R handles declarative mappings associating a Prolog term (just a plain predicate, or a logical formulae combining some predicates) to a set of RDF triples. The resulting RDF is made available through a SPARQL end-point.

For example, the following example maps the predicate artist_dispname to {<artist uri> foaf:name "name"^^xsd:string.}:

match:
        (artist_dispname(Id,Name))
                eq
        [
                rdf(pattern(['http://dbtune.org/jamendo/resource/artist/',Id]),foaf:name,literal(type('http://www.w3.org/2001/XMLSchema#string',Name)))
        ].

Then, when the SPARQL end-point processes a triple pattern such as:

<http://dbtune.org/jamendo/resource/artist/5> foaf:name ?name.

It will bind the term ID to 5, and try to prove artist_dispname(5,Name). This predicate will in fact be defined by the following:

artist_dispname(Id,Name) IF 
        query Jamendo API for names associated to Id AND
        Name is one of these names

(or, instead of querying Jamendo API, it can just parse the XML dump).

Therefore, it will query the Jamendo API, bind Name to the name of the artist, and send back a binding between ?name and "both"^^xsd:string. If the subject was ?artist in our query, we would have retrieved every pairs of artist URI / name.

You then have a SPARQL end point able to answer such queries by asking Jamendo API.

UriSpace

Then, all you have to do is to redirect every URI in your URI space (here, http://dbtune.org/jamendo/resource/) to DESCRIBE queries on the SPARQL end-point that P2R exposes.

I published another piece of code that does the trick, UriSpace, also through a declarative mapping

Linking the Jamendo data set to the Geonames one

As we saw earlier, it is not possible to directly construct an URI from a string denoting a geographical location in the Jamendo dataset. But well, we are not limited on what we can do inside our mappings! Here is the part of the P2R mapping that exposes the foaf:based_near property:

match:
        (artist_geo(Id,GeoString),geonames(GeoString,URI))
                eq
        [
                rdf(pattern(['http://dbtune.org/jamendo/resource/artist/',Id]),foaf:based_near,URI)
        ].

Where, in fact, the geonames(GeoString,URI) predicate is defined as:

geonames(GeoString,URI) IF
        clean GeoString (remove "(" and ")", basically) AND
        query Geonames web service to retrieve the first matching URI with GeoString

And it is done! Now, you can see the link to the Geonames dataset, when getting a Jamendo artist URI:

$ curl -L -H "Accept: application/rdf+xml" http://dbtune.org/jamendo/resource/artist/5
<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE rdf:RDF [
    <!ENTITY foaf 'http://xmlns.com/foaf/0.1/'>
    <!ENTITY mo 'http://purl.org/ontology/mo/'>
    <!ENTITY rdf 'http://www.w3.org/1999/02/22-rdf-syntax-ns#'>
    <!ENTITY xsd 'http://www.w3.org/2001/XMLSchema#'>
]>
<rdf:RDF
    xmlns:foaf="&foaf;"
    xmlns:mo="&mo;"
    xmlns:rdf="&rdf;"
    xmlns:xsd="&xsd;"
>
<mo:MusicArtist rdf:about="http://dbtune.org/jamendo/resource/artist/5">
  <foaf:made rdf:resource="http://dbtune.org/jamendo/resource/record/174"/>
  <foaf:made rdf:resource="http://dbtune.org/jamendo/resource/record/33"/>
  <foaf:based_near rdf:resource="http://sws.geonames.org/2991627/"/>
  <foaf:homepage rdf:resource="http://www.both-world.com"/>
  <foaf:img rdf:resource="http://img.jamendo.com/artists/b/both.jpg"/>
  <foaf:name rdf:datatype="&xsd;string">Both</foaf:name>
</mo:MusicArtist>

<rdf:Description rdf:about="http://dbtune.org/jamendo/resource/record/174">
  <foaf:maker rdf:resource="http://dbtune.org/jamendo/resource/artist/5"/>
</rdf:Description>

<rdf:Description rdf:about="http://dbtune.org/jamendo/resource/record/33">
  <foaf:maker rdf:resource="http://dbtune.org/jamendo/resource/artist/5"/>
</rdf:Description>

</rdf:RDF>

And you can plot some Jamendo artists on a map, using the Tabulator generic data browser.

Some Jamendo artists on a map, using the Tabulator