DBTune blog

To content | To menu | To search

Wednesday 25 June 2008

Mashed!

I was at Mashed (the former Hack Day) this week-end - a really good and geeky event, organised by the BBC at Alexandra Palace. We arrived on the Saturday morning for some talks, detailing the different things we'd be able to play with over the week-end. Amongst these, a full DVB-T multiplex (apparently, it was the first time since 1956 that a TV signal was broadcasted from Alexandra Palace), lots of data from the BBC Programmes team and a box full of recorded radio content over the last year.

After these presentations, the 24 hours hacking session began. We sat down with Kurt and Ben and wrote a small hack which basically starts from a personal music collection and creates you a playlist of recorded BBC programmes. I will write a bit more about this later today

During the 24 hours hack, we had a Rock Band session on big screen, a real-world Tron game (basically, two guys running with GPS phones, guided by two persons watching their trail on a google satellite map :-) ), a rocket launching...

Finally, at 2pm on the Sunday, people presented their hacks. Almost 50 hacks were presented, all extremely interesting. Take a look at the complete list of hacks! On the music side, Patrick's recommender was particularly interesting. It used Latent Semantic Analysis on playcount data for artists in BBC brands and episodes to recommend brands from artists or artists from artists. It gave some surprising results :-) Jamie Munroe resurrected the FPFF Musicbrainz fingerprinting algorithm (which was apparently due to replace the old TRM one before MusicIP offered their services to Musicbrainz) to identify tracks played several times in BBC programmes. The WeDoID3 team talked about creating RSS feeds from embedded metadata in audio and video, but the demo didn't work.

My personal highlight was the hack (which actually won a prize) from Team Bob. Here is a screencast of it:


BBC Dylan - News 24 Revisited (Clip) from James Adam on Vimeo.

Thanks to Matthew Cashmore and the rest of the BBC backstage team for this great event! (and thanks to the sponsors for all the free stuff - I think I have enough T-shirts for about a year now :-))

Thursday 12 June 2008

Describing the content of RDF datasets

There seems to be an overall consensus in the Linking Open Data community that we need a way to describe in RDF the different datasets published and interlinked within the project. There is already a Wiki page detailing some aspects of the corresponding vocabulary, called voiD (vocabulary of interlinked datasets).

One thing I would really like this vocabulary to do would be to describe exactly the inner content of a dataset - what could we find in this SPARQL end-point or in this RDF document? I thought quite a lot about this recently, as I begin to really need that. Indeed, when you have RDF documents describing lots of audio annotations, and which generation is really computation intensive, you want to pick just the one that fits your request. There have been quite a lot of similar efforts in the past. However, most of them rely on one or another sort of reification, which makes it quite hard to actually use.

After some failed tries, I came up with the following, which I hope is easy and expressive enough :-)

It relies on a single property void:example, which links a resource identifying a particular dataset to a small RDF document holding an example of what you could find in that dataset. Then, with just a bit of SPARQL magic, you can easily query for datasets having a particular capability. Easy, isn't it? :-)

Here is a real-world example of that. A first RDF document describes one of the DBtune dataset:

:ds1
        a void:Dataset;
        rdfs:label "Jamendo end-point on DBtune";
        dc:source <http://jamendo.com/>;
        foaf:maker <http://moustaki.org/foaf.rdf#moustaki>;
        void:sparql_end_point <http://dbtune.org:2105/sparql/>;
        void:example <http://moustaki.org/void/jamendo_example.n3>;
        .

The void:example property points towards a small RDF file, giving an example of what you can find within this dataset.

Then, the following SPARQL query asks whether this dataset has a SPARQL end-point and holds information about music records, associated tags, and places to download them.

PREFIX void: <http://purl.org/ontology/void#>
PREFIX mo: <http://purl.org/ontology/mo/>
PREFIX tags: <http://www.holygoat.co.uk/owl/redwood/0.1/tags/>

ASK
FROM NAMED <http://moustaki.org/void/void.n3>
FROM NAMED <http://moustaki.org/void/jamendo_example.n3>
{
        GRAPH <http://moustaki.org/void/void.n3> {
                ?ds a void:Dataset;
                        void:sparql_end_point ?sparql;
                        void:example ?ex.
        }
        GRAPH ?ex {
                ?r a mo:Record;
                        mo:available_as ?l;
                        tags:taggedWithTag ?t.
        }
}

I tried this query with ARQ, and it works perfectly :-)

$ sparql --query=void.sparql
Ask => Yes

Update: It also works with ARC2. Although it does not load automatically the SPARQL FROM clause. You can try the same query on this SPARQL end-point, which previously loaded the two documents (the voiD description and the example).

Update 2: A nice blog post about automatically generating the data you need for describing an end-point - thanks shellac for the pointer!

Update 3: Following discussion on the #swig IRC channel.

Tuesday 3 June 2008

Sorted Sound at the Dana Centre

If you're around London on Thursday, a couple of people from the Centre for Digital Music in Queen Mary, University of London (including myself) will talk about the research we do in music technologies, at the Dana Centre in South Kensington.

The event description is mainly focused about search. Kurt will indeed demo Soundbite and Ben and Michela from Goldsmiths college will demo a fast content-based search on large music databases. However, Chris, Katy and Matthew will demo the Sonic Visualiser, a great open source software to analyse and visualise audio data. I will talk about Semantic Web technologies, in particular the Music Ontology and Linked Data. I will also be demoing some things related to organising music collection using Semantic Web data, and user interfaces to interact with them in unusual ways. As Tom puts it, the Semantic Web is not all about search :-)

Tuesday 20 May 2008

Ceriese: RDF translator for Eurostat data

Riese logo

Some time ago, I did a bit of work on the RIESE project, aiming at publishing the Eurostat dataset on the Semantic Web, and interlinking it with further datasets (eg. Geonames and DBpedia). This can look a bit far from my interests in music data, but there is a connexion which illustrates the power of linked data, as explained at the end of this post.

Original data

There are three distinct things we consider in the Eurostat dataset:

  • A table of content in HTML defining the hierarchical structure of the Eurostat datasets;
  • Tab-separated values dictionary files defining the ~80 000 data codes used in the dataset (eg. "eu27" for the European Union of 27 countries);
  • The actual statistical data, in tab-separated values files. Around 4000 datasets for roughly 350 million statistical items.

Ontology

The first thing we need to figure out when exposing data on the Semantic Web is the model we'll link to. This lead into the design of SCOVO (Statistical Core Vocabulary). The concepts in this ontology can be depicted as follows:

SCOVO ontology

The interesting thing about this model is that the statistical item is considered as a primary entity. We used as a basis the Event ontology - a statistical item is a particular classification of a space/time region. This allows to be really flexible and extensible. We can for example attach multiple dimensions to a particular item, resources pertaining to its creation, etc.

RDF back-end

I wanted to see how to publish such large amounts of RDF data and how my publication tools perform, so I designed the Ceriese software to handle that.

The first real concern when dealing with such large amounts of data is, of course, scalability. The overall Eurostat dataset is over 3 billion triples. Given that we don't have high-end machines with lots of memory, using the core SWI RDF store was out of the question (I wonder if any in-memory triple store can handle 1 billion triples, btw).

So there are three choices at this point:

  • Use a database-backed triple store;
  • Dump static RDF file in a file-system served through Apache;
  • Generate the RDF on-the-fly.

We don't have the sort of money it takes (for both the hardware and the software) for the first choice to really scale, so we tried the second and the third solution. I took my old Prolog-2-RDF software that I am using to publish the Jamendo dataset and we wrote some P2R mapping files converting the tab-separated value files. Then, we made P2R dump small RDF files in a file-system hierarchy, corresponding to the description of the different web resources we wanted to publish. Then, some Apache tweaks and Michael and Wolfgang's work on XHTML/RDFa publishing were enough to make the dataset available in the web of data.

But this approach had two main problems. First, it took ages to run this RDF dump, so we never actually succeeded to complete it once. Also, it was impossible to provide a SPARQL querying facility. No aggregation of data was made available.

So we eventually settled on the third solution. I took my Prolog hacker hat, and tried to optimise P2R to make it fast enough. I did it by using the same trick I used in my small N3 reasoner, Henry. P2R mappings are compiled as native Prolog clauses (rdf(S,P,O) :- ... ), which cut down the search space a lot. TSV files are accessed within those rules and parsed on-the-fly. The location of the TSV file to access is derived from a small in-memory index. Parsed TSV files are cached for a whole query, to avoid parsing the same file for different triple patterns in the query.

Same mechanisms are applied to derive a SKOS hierarchy from the HTML table of content.

Now, a DESCRIBE query takes less than 0.5 seconds for any item, on my laptop. Not perfect, but, still...

A solution to improve the access time a lot would be to dump the TSV file in a relational database, and access this database in our P2R mappings instead of the raw TSV files.

Trying it out and creating your own SPARQL end-point from Eurostat data is really easy.

  • Get SWI-Prolog;
  • Get the software from there;
  • Get the raw Eurostat data (or get it from the Eurostat web-site, as this one can be slightly out-dated);
  • Put it in data/, in your Ceriese directory;
  • Launch start.pl;
  • Go to http://localhost:3020/

Now what?

Michael and Wolfgang did an amazing work at putting together a really nice web interface, publishing the data in XHTML+RDFa. They also included some interlinking, especially for geographical locations, which are now linked to Geonames.

So what's the point from my music geek point of view?? Well, now, after aggregating Semantic Web data about my music collection (using these tools), I can sort hip-hop artists by murder rates in their city :-) This is quite fun as it is (especially as the Eurostat dataset holds a really diverse range of statistics), but it would be really interesting to mine that to get some associations between statistical data and musical facts. This would surely lead to interesting sociological results (eg. how does musical "genre" associate with particular socio-economic indicators?)

Wednesday 14 May 2008

Data-rich music collection management

I just put a live demo of something I showed earlier on this blog.

You can explore the Creative Commons-licensed part of my music collection (mainly coming from Jamendo) using aggregated Semantic Web data.

For example, here is what you get after clicking on "map" on the right-hand side and "MusicArtist" on the left-hand side:

Data-rich music collection management

The aggregation is done using the GNAT and GNARQL tools available in the motools sourceforge project. The data comes from datasets within the Linking Open Data project.The UI is done by the amazing ClioPatria software, with a really low amount of configuration.

An interesting thing is to load this demo into Songbird, as it can aggregate and play the audio as you crawl around.

Check the demo!

Update: It looks like it doesn't work with IE, but it is fine with Opera and FF2 or FF3. If the map doesn't load at first, just try again and it should be ok.

Monday 12 May 2008

Linked Data on the Web 2008

I just got back from Beijing (I did a two weeks trip around China after the actual conference), where I attended the Linked Data on the Web workshop and the WWW conference.

The workshop was really good, gathering lots of people from the Linking Open Data community (it was the first time I met most of these people, after more than one year working with them :-) ).

The attendance was much higher than expected, with around 100 people registered for the workshop.

C

It started well with this sentence by Tim Berners-Lee in the workshop introduction:

Linked Data is the Semantic Web done right, and the Web done right.

That's a pretty good way to start a day :-) Then, Chris Bizer did a good overview of what the community has achieved in one year, illustrated by the different versions of Richard's diagram:

C

All the talks and papers were extremely high quality. I got particularly interested by some of them, including Tim's presentation on the new SPARQL/Update capabilities of the Tabulator data browser. This allows easy interaction with data wikis, where everyone can add or correct information.

C

I really liked Alexandre Passant's presentation on the Flickr exporter, which is highlighting a mechanism that I used for the Last.fm linked data exporter: linking several identities on several web-sites is just a owl:sameAs link away. Alexandre also did another presentation on MOAT (Meaning of a Tag), a really interesting project allowing to relate tags to Semantic Web URIs. For example, it allows to easily draw a link between my tag "paris texas" to the movie Paris, Texas in DBpedia.

I got a bit confused by Paul Miller's presentation about licensing open data. I have been aware of these efforts mainly by the work of the Open Knowledge Foundation and the Open Data Commons project, and I think these are truly crucial issues: we need open data, and explicit licensing. But perhaps the audience was not so well chosen: most (if not all) of us in the Linking Open Data community do not own the data they publish as RDF and interlink. DBpedia exports data extracted from Wikipedia, DBTune exports data from different music-related sources such as Jamendo or Last.fm, etc. The only data that we can possibly explicitly license are links (the only thing we actually own), and it does not have any values without any data :-) So I guess the outreach should mainly be done to raw data publishers rather than Semantic Web translators? But hopefully, in a near future, the two communities will be the same!

C

One of my personal highlights was also Christian Becker's presentation about DBpedia mobile: a location-enabled linked data browser for mobile devices., giving you nearby sights and detailed descriptions, restaurants, hotels, etc. We chatted a bit after the workshop with Alexandre and Christian about adding Last.fm events to the DBtune exporter to also display nearby gigs (with optional filtering based on your foaf:interests, of course :-) ).

Jun Zhao's presentation about linked data and provenance for biological resources was extremely interesting: they are dealing with problems strongly similar to ours in a Music Information Retrieval context. How to trust a particular statement (for example, a structural segmentation of a particular track) found on the web? We need to know whether it was written by a human, or derived through a set of algorithms, and in this case, we might want to choose timbre-based instead of chroma-based workflows in the case of Rock music, for example. This is the sort of things we implemented within our Henry software (more to come on that later, including online demo as soon as I put it on better hardware, and (hopefully) a PhD :-D ).

Wolfgang Halb did a presentation about our Riese project, but more on that later as I wrote the back-end software powering it and I'd like to give it a full blog entry soon.

I did a presentation about automatic interlinking algorithms on the data web, with a focus on music-related datasets. I detailed an algorithm we developed for this purpose, propagating similarity measures around web data as long as we can't take an interlinking (creating a bunch of owl:sameAs links) decision. This algorithm is good in the sense that it gives a really low rate of false-positives. On the test-set detailed in the paper, it made no wrong decisions. I blogged about this algorithm earlier.

C

Some people expressed concerns about the proliferation of owl:sameAs links (highlighted in this presentation by Paolo Bouquet). But I truly think it is a necessary thing, as long as web identifiers are tied to their actual representation. I need to be able to have a web identifier for a song in Jamendo and a web identifier for the same song in Musicbrainz, and I need a way to link these together: owl:sameAs is perfect for that. I wouldn't trust a centralised "identity" system (what actually is identity anyway? :-) ), as it would break the nice decentralised information paradigm we're implementing within the Linking Open Data project.

Anyway, lots of great people, a great time, lots of interesting discussions and new ideas... I am really looking forward for WWW 2009 in Madrid and the next workshop!!!

Monday 7 April 2008

D2RQ mapping for Musicbrainz

I just started a D2R mapping for Musicbrainz, which allows to create a SPARQL end-point and to provide linked data access out of Musicbrainz fairly easily. A D2R instance loaded with the mapping as it is now is also available (be gentle, it is running on a cheap computer :-) ).

Added to the things that are available within the Zitgist mapping:

  • SPARQL end point ;
  • Support for tags ;
  • Supports a couple of advanced relationships (still working my way through it, though) ;
  • Instrument taxonomy directly generated from the db, and related to performance events;
  • Support for orchestras ;
  • Linked with DBpedia for places and Lingvoj for languages

There is still a lot to do, though: it is really a start. The mapping is available on the motools sourceforge project. I hope to post a follow-up soon! (including examples of funny SPARQL queries :-) ).

Update: For some obscure port-forwarding reasons, the SNORQL interface to the SPARQL end point does not work on the test server.

Update 2: This is fixed. (thanks to the anonymous SPARQL crash tester which helped me find the bug, by the way :-) )

Wednesday 2 April 2008

13.1 billion triples

After a rough estimation, it looks like the services hosted on DBTune provide access to 13.1 billion triples, therefore making a significant addition to the data web!

Here is the break-down of such an estimation:

  • MySpace: 250 million people * 50 triples (in average) = 12.5 billion triples ;
  • AudioScrobbler: 1.5 million users (only Europe?) * 400 triples = 600 million ;
  • Jamendo: 1.1 million triples + 5000 links to other data sources ;
  • Magnatune: 322 000 triples + 233 links ;
  • BBC John Peel sessions: 277 000 triples + 2100 links ;
  • Chord URI service: I don't count it, as it is potentially infinite (the RDF descriptions are generated from the chord symbol in the URI).

However, SPARQL end-points are not available for AudioScrobbler and MySpace, as the RDF is generated on-the-fly, from the XML feeds for the earlier, and from scraping for the latter.

Now, I wish linked data could be provided directly by the data sources themselves :-) (Again, all the code used to run the DBTune services is available in the motools project on Sourceforge).

Thursday 27 March 2008

The Quest for Canonical Spelling in music metadata

Last.fm recently unveiled their new fingerprinting lookup mechanism. They did aggregate quite a lot of fingerprints (650 million) using their fingerprinting software, which is a nice basis for such a lookup, perhaps bringing a viable alternative to Music DNS. I gave it a try (I just had to build a Linux 64 version of the lookup software), and was quite surprised by the results. The quality of the fingerprinting looks indeed good, but here are the results for a particular song:

<?xml version="1.0"?>
<!DOCTYPE metadata SYSTEM "http://fingerprints.last.fm/xml/metadata.dtd">
<metadata fid="281948" lastmodified="1205776219">
<track confidence="0.622890">
    <artist>Leftover Crack</artist>
    <title>Operation: M.O.V.E.</title>
    <url>http://www.last.fm/music/Leftover+Crack/_/Operation%3A+M.O.V.E.</url>
</track>
<track confidence="0.327927">
    <artist>Left&ouml;ver Crack</artist>
    <title>Operation: M.O.V.E.</title>
    <url>http://www.last.fm/music/Left%C3%B6ver+Crack/_/Operation%3A+M.O.V.E.</url>
</track>
<track confidence="0.007860">
    <artist>Leftover Crack</artist>
    <title>Operation MOVE</title>
    <url>http://www.last.fm/music/Leftover+Crack/_/Operation+MOVE</url>
</track>
<track confidence="0.006180">
    <artist>Leftover Crack</artist>
    <title>Operation M.O.V.E.</title>
    <url>http://www.last.fm/music/Leftover+Crack/_/Operation+M.O.V.E.</url>
</track>
<track confidence="0.004883">
    <artist>Leftover Crack</artist>
    <title>Operation; M.O.V.E.</title>
    <url>http://www.last.fm/music/Leftover+Crack/_/Operation%3B+M.O.V.E.</url>
</track>
<track confidence="0.004826">
    <artist>Left&ouml;ver Crack</artist>
    <title>Operation M.O.V.E.</title>
    <url>http://www.last.fm/music/Left%C3%B6ver+Crack/_/Operation+M.O.V.E.</url>
</track>
<track confidence="0.004717">
    <artist>Left&ouml;ver Crack</artist>
    <title>13 - operation m.o.v.e</title>
    <url>http://www.last.fm/music/Left%C3%B6ver+Crack/_/13+-+operation+m.o.v.e</url>
</track>
....
</metadata>

And it goes on and on... There are 21 results for this single track, which all actually correspond to this track.

So, what is disturbing me here? After all, the first result holds textual metadata that I could consider as somehow correct (even if that's not the way I would spell this band's name, but they plan to put a voting system to solve this sort of issues).

The real problem is that there are 21 URI in last.fm for the same thing. The emphasis of the last.fm metadata system is then probably on the textual metadata: two different ways of spelling the name of a band = two bands. But I do think it is wrong: for example, how would you handle the fact that the Russian band Ария is spelled Aria in English? The two spellings are correct, and they correspond to one unique band.

In my opinion, the important thing is the identifier. As long as you have one identifier for one single thing (an artist, an album, a track), you're saved. The relationship between a band, an artist, a track, etc. and its label is clearly a one-to-many one: the quest for a canonical spelling will never end... And what worries me even more is that it tends to kill the spellings in all languages but English (especially if a voting system is in place?).

Once you have a single identifier for a single thing within your system, you can start attaching labels to it, perhaps with a language tag. Then, it is up to the presentation layer to show you the label matching your preferences. And if you tend for such a model, Musicbrainz (centralised and moderated) or RDF and the Music Ontology (decentralised and not moderated) are probably the way to go.

I guess this emphasis on textual metadata is mainly due to the ID3 legacy and other embedded metadata format, which allowed just one single title for the track, the album and the artist to be associated with an audio-file?

I think that the real problem for last.fm will now be to match all the different identifiers they have for a single thing in their system, which is known as the record linkage problem in the database/Semantic Web community. But I also think this is not too far-fetched, as they already began to link their database to the Musicbrainz one?

Tuesday 18 March 2008

Describing a recording session in RDF

Danny Ayers just posted a new Talis Platform application idea, dealing with music/audio equipment. As I was wondering it would actually be nice to have a set of web identifiers and corresponding RDF representation for audio equipment, I remembered a small Music Ontology example I wrote about a year ago. In fact, the Music Ontology (along with the Event ontology) is expressive enough to handle the description of recording sessions. Here is a small excerpt of such a description:

@prefix mo: <http://purl.org/ontology/mo/>.
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.
@prefix event: <http://purl.org/NET/c4dm/event.owl#>.
@prefix rd: <http://example.org/audioequipment/>.
@prefix : <#>.

:rec a mo:Recording;
   rdfs:label “live recording of my band in studio”;
   event:sub_event :guitar1, :guitar2, :drums1, :kick1, :sing.

:sing a mo:Recording;
   rdfs:label “Voice recorded with a SM57″;
   event:factor rd:sm57;
   event:place [rdfs:label “Middle of the room-I could be more precise here”].

:kick1 a mo:Recording;
   rdfs:label “Kick drum using a Shure PG52″;
   event:factor rd:pg52;
   event:place [rdfs:label “Kick drum microphone location”].

Well, it would indeed by nice if the rd namespace could point to something real! Who would fancy RDFising Harmony Central? :-)

Wednesday 12 March 2008

MySpace RDF service

Thanks to the amazing work of Kurt and Ben on the MyPySpace project, members of the MySpace social network can have a Semantic Web URI!

This small service provides such URIs and corresponding FOAF (top friends, depiction, name) and Music Ontology (URIs of available tracks in the streaming audio cache) RDF representations.

That means I can add such statements to my FOAF profile:

<http://moustaki.org/foaf.rdf#moustaki> foaf:knows <http://dbtune.org/myspace/lesversaillaisesamoustache>.

And then, using the Tabulator Firefox extension:

MySpace friends

PS: The service is still a bit slow and can be highly unstable, though - it is slightly faster with URIs using MySpace UIDs.

PS2: We don't host any data - everything is scraped on the fly using the MyPySpace tools.

Monday 10 March 2008

Finding a new flat using Exhibit

You can't play around with music data and RDF all day. Sometimes, you also need to find accommodation in the real world :-) But, as it was kind of difficult to motivate myself to the task, I thought it'd be easier if there was something RDFy about it.

So I created a new project on github (thanks for the invite, Tom, github is great!!), able to scrape RDF data out of Gum Tree. This python hack is able to scrape single advertisements for flats on Gum Tree, geo-code the corresponding location to find the lat/long coordinates, and to output the corresponding data as RDF. It is also able to process a GumTree RSS feed to scrape multiple ads, and to produce a Exhibit 2.0 JSON file, so it becomes really easy to create an Exhibit out of Gum Tree.

For example, this Exhibit shows the last 40 posted Gum Tree ads for one bedroom flats, north of the river.

Gum Tree Exhibit

Thursday 6 March 2008

Interlinking music datasets on the Web

The paper we wrote with Christopher Sutton and Mark Sandler has been accepted to the Linking Data on the Web workshop at WWW, and it is now available from the workshop website.

This paper explains in a bit more details (including pseudocode and the evaluation of two implementations of it) the algorithm I already described briefly in previous posts (oh yeah, this one too).

The problem is: how can you automatically derive owl:sameAs links between two different, previously unconnected, web datasets? For example, how can I state that Both in Jamendo is the same as Both in Musicbrainz?

The paper studies three different approaches. The first one is just a simple literal lookup (so in the previous example, it just fails, because there are two artists and a gazillion of tracks/records holding Both in their titles, in Musicbrainz). The second one is a constrained literal lookup (we specifically look for an artist, a record, a track, etc.). Our previous example also fails, because there are two matching artists in Musicbrainz for Both.

The algorithm we describe in the paper can intuitively be described as: if two artists made albums with the same title, they have better chances to be similar. It will browse linked data in order to aggregate further clues and be confident enough to disambiguate among several matching resources.

Although not specific to the music domain, we did evaluate it in two music-related contexts:

For the second one, we tried to cover a wide range of possible metadata mistakes, and checked how well our algorithm was copping with such bad metadata. A week ago, I compared the results with the Picard Musicbrainz tagger 0.9.0, and here are the results (you also have to keep in mind that our algorithm is quite a bit slower, as the Musicbrainz API is not really designed for the sort of things we do with it), for the track I Want to Hold Your Hand by the Beatles, in the Meet the Beatles! album:

  • Artist field missing:
    • GNAT: Correct
    • Picard: Matches the same track, but on the And Now: The Beatles compilation
  • Artist set to random string:
    • GNAT: Correct
    • Picard: Matches the same track, but on another release (track 1 of The Capitol Albums, Volume 1 (disc 1: Meet the Beatles!))
  • Artist set to Beetles:
    • GNAT: Correct
    • Picard: Matches the same track, but on another release (track 1 of The Capitol Albums, Volume 1 (disc 1: Meet the Beatles!))
  • Artist set to Al Green (who actually made a cover of that song):
    • GNAT: Mapped to Al Green's cover version on Soul Tribute to the Beatles
    • Picard: Same
  • Album field missing:
    • GNAT: Matches the same track, but on another release (track 1 of The Capitol Albums, Volume 1 (disc 1: Meet the Beatles!))
    • Picard: Matches the same track, but on the single
  • Album set to random string:
    • GNAT: Matches the same track, but on another release (track 1 of The Capitol Albums, Volume 1 (disc 1: Meet the Beatles!))
    • Picard: Matches the same track, but on the single
  • Album set to Meat the Beatles:
    • GNAT: Matches the same track, but on another release (track 1 of The Capitol Albums, Volume 1 (disc 1: Meet the Beatles!))
    • Picard: Matches the same track, but on the compilation The Beatles Beat
  • Track set to random string:
    • GNAT: Correct
    • Picard: No results
  • Track set to I Wanna Hold Your Hand:
    • GNAT: Correct
    • Picard: Matches the same track, but on the compilation The Beatles Beat
  • Perfect metadata:
    • GNAT: Correct
    • Picard: Matches the same track, but on the compilation The Beatles Beat

Most of the compilation results of Picard are actually not wrong, as the track length of our test file is closer to the track length on the compilation than the track length on the Meet the Beatles album.

Of course, this is not an extensive evaluation of how the Picard lookup mechanism compares with GNAT. And GNAT is not able to compete at all with Picard, as it was clearly not designed for the same reasons (GNAT is meant to interlink RDF datasets).

The python implementation of our algorithm is under the BSD license, and available in the motools sourceforge project. The Prolog implementation (working on RDF datasets) is also available in the motools sourceforge project.

Monday 25 February 2008

Playing with SPARQL and XMPP

Chatting with Dan Brickley at the Semantic Camp last week got me quite curious about mixing SPARQL and XMPP, so I decided to give it a try :-)

I first tried sparqlxmpp by Chris Schmidt, a Python implementation using redland as a back-end. Unfortunately, I did run into some troubles (a weird error attribute being inserted in the XML between the sending and the receiving, which made the whole thing crash).

So I decided to give a try at xOperator, which 0.1 version was announced last week. It is really easy to use and flexible (notably through the fact that you can specify third-party SPARQL end-points as new datasets, and the ability to template SPARQL queries (a bit like SparqlBot)).

Here is the result. You can add jamendo@jabber.org to your Jabber roster, which provides SPARQL/XMPP on top of the Jamendo SPARQL end-point. Then, here is an example session (moustaki is me, and jamendo is the xoperator instance):

<moustaki> list templates
<jamendo> These are the defined templates:
<jamendo> LIST ME * ARTISTS  select ?n where {?a rdf:type mo:MusicArtist; foaf:name ?n} LIMIT %%1%%
<jamendo> GIVE ME THE LOCATION OF *  select ?l where {?a a mo:MusicArtist;foaf:name '%%1%%'; foaf:based_near ?g. ?g 
                <http://www.geonames.org/ontology#name> ?l}
<jamendo> GIVE ME ALBUMS MADE BY *  select ?l where {?a a mo:MusicArtist; foaf:name '%%1%%'; foaf:made ?r. ?r a mo:Record; dc:title ?l}
<jamendo> WHERE CAN I GET *  select ?d where {?a dc:title '%%1%%'; mo:available_as ?d}
<jamendo> GIVE ME THE LOCATION OF *  select ?l where {?a a mo:MusicArtist;foaf:name '%%1%%'; foaf:based_near ?g. ?g 
                <http://www.geonames.org/ontology#name> ?l}

<moustaki> list me 5 artists
<jamendo> Store jamendo answered: 
<jamendo> "Cicada"
<jamendo> "Hace Soul"
<jamendo> "vincent j"
<jamendo> "NoU"
<jamendo> "Margin of Safety"

<moustaki> give me the location of Cicada
<jamendo> Store jamendo answered: 
<jamendo> "Département des Bouches-du-Rhône"

<moustaki> give me albums made by Cicada
<jamendo> Store jamendo answered: 
<jamendo> "Rebirth"   
<jamendo> "AuthentiK Yogourt"

<moustaki> where can I get AuthentiK Yogourt
<jamendo> Store jamendo answered: 
<jamendo> <http://www.jamendo.com/get/album/id/album/p2p/redirect/8309/?p2pnet=bittorrent&are=mp32>
<jamendo> <http://www.jamendo.com/get/album/id/album/p2p/redirect/8309/?p2pnet=bittorrent&are=ogg3>
<jamendo> <http://www.jamendo.com/get/album/id/album/p2p/redirect/8309/?p2pnet=ed2k&are=mp32>
<jamendo> <http://www.jamendo.com/get/album/id/album/p2p/redirect/8309/?p2pnet=ed2k&are=ogg3>
<jamendo> <http://www.jamendo.com/get/track/id/album/audio/play/8309/?item_o=track_no_asc&aue=ogg2&n=all>
<jamendo> <http://www.jamendo.com/get/track/id/album/audio/xspf/8309/?item_o=track_no_asc&aue=ogg2&n=all>

Now, making it interact with GNAT and GNARQL, two tools able to create a SPARQL end point holding information about your personal music collection, is the next step :)

Sunday 17 February 2008

Yay for SemanticCamp!

I was at the SemanticCamp event this week-end - it was great fun! Lots and lots of Semantic-Web/Microformat geeks! We did a small C4DM session, on the Saturday, with Chris, Kurt and David, basically getting through the Music Ontology, interlinked music datasets (especially the new classical music composers one), softwares in the MOTOOLS sourceforge project (mainly GNAT, our music collection linker, and GNARQL, our music information aggregator, showing a live demo of this screencast). We finished on a SPARQL end-point providing access to content-based features, based on our Henry software.

Slides, code and demos are available here.

My personal highlights of the Saturday were the DBPedia presentation by Georgi, the Automatic indexing of science by Andrew, and the BBC /programmes presentation, where they finally unveiled their evil plans :-)

I did join them at the end to talk about one of the bubble on their pentagram of data, the RDF programmes data.

On the Sunday, we had a really great discussion about audio/video on the Semantic Web, with people from Joost, from the BBC, from Talis and from URIPlay. I guess one of the main achievement was the mapping of the URIPlay ontology and the BBC one (well, ok, it was just a owl:sameAs away :-) ). I did not actually play, but the Semantopoly game looked like great fun!

I really enjoyed Nicholas presentation about streaming RDF along with radio streams, also, with his neat hardware hacks to create a Wifi radio station out of a vintage Philips one! Then, I guess we had quite a geeky session with Tom, crawling the Linking Open Data cloud with CURL, and hand-editing a FOAF file to manage several online identities:

I found Premasagar presentation on compound microformats really interesting, as it made me realise a particular "limitation" of microformats (perhaps I am not using exactly the right word, here) that I really didn't get before.

Not to mention the great beers on Saturday, etc. etc. :-) It was a really great week-end! Thank you Tom and Daniel!!

Wednesday 6 February 2008

Playing with Linked Data, Jamendo, Geonames, Slashfacet and Songbird

Today, I made a small screencast about mixing the following ingredients:

All of that was extremely easy to set up (it actually took me more time to figure out how to make a screencast on a Linux box :-) which I finally did using vnc2swf). Basically, just some tweaked configuration files for ClioPatria, and a small CSS hack, and that was it...

The result is there:

Songbird, Linked Data, Mazzle and Jamendo

(Note that only a few Jamendo artists are displayed now... Otherwise, Google Maps would just crash my laptop :-) ).

Tuesday 22 January 2008

Pushing your Last.FM friends in the FOAF-O-Sphere

I just committed some changes to the last.fm linked data service. It now spits out, as well as your last scrobbled tracks linked to corresponding Musicbrainz URIs, your list of last.fm friends (using their URI on this service)

This is quite nice to explore the last scrobbles of the friends of your friends (hello Kurt and Ben!) :)

The friends of my friends on last.fm

Friday 11 January 2008

Your AudioScrobbler data as Linked Data

I just put online a small service, which converts your AudioScrobbler data to RDF, designed using the Music Ontology: it exposes your last 10 scrobbled tracks.

The funny thing is that it links the track, records, and artists to corresponding dereferencable URIs in the Musicbrainz dataset. So the tracks you were last listening to are part of the Data Web!

Just try it by getting this URI:

http://dbtune.org/last-fm/<last.fm username>

For example, mine is:

http://dbtune.org/last-fm/moustaki

Of course, by being linked to dereferencable URIs in Musicbrainz, you are able to access the birth dates of the artists you last listened to, or use the links published by DBPedia to plot your last artists played on map, by just crawling the Data Web a little.

Then, you can link that to your FOAF URI. Mine now holds the following statement:

<http://moustaki.org/foaf.rdf#moustaki> owl:sameAs <http://dbtune.org/last-fm/moustaki>.

Now, my URI looks quite nice, in the Tabulator generic data browser!

Me and my scrobbles

Wednesday 12 December 2007

HENRY: A small N3 parser/reasoner for SWI-Prolog

Yesterday, I finally took some time to package the little hack I've written last week, based on the SWI N3 parser I wrote back in September.

Update: A newer version with lots of bug fixes is available there.

It's called Henry, it is really small, hopefully quite understandable, and it does N3 reasoning. The good thing is that you can embed such reasoning in the SWI Semantic Web Server, and then access a N3-entailed RDF store using SPARQL.

How to use it?

Just get this tarball, extract it, and make sure you have SWI-Prolog installed, with its Semantic Web library.

Then, the small swicwm.pl script provides a small command-line tool to test it (roughly equivalent, in CWM terms, to cwm $1 -think -data -rdf).

Here is a simple example (shipped in the package, along other funnier examples).

The file uncle.n3 holds:

prefix : <http://example.org/> .
:yves :parent :fabienne.
:fabienne :brother :olivier.
{?c :parent ?f. ?f :brother ?u} => {?c :uncle ?u}.

And:

$ ./swicwm.pl examples/uncle.n3

<!DOCTYPE rdf:RDF [
    <!ENTITY ns1 'http://example.org/'>
    <!ENTITY rdf 'http://www.w3.org/1999/02/22-rdf-syntax-ns#'>
]>

<rdf:RDF
    xmlns:ns1="&ns1;"
    xmlns:rdf="&rdf;"
>
<rdf:Description rdf:about="&ns1;fabienne">
  <ns1:brother rdf:resource="&ns1;olivier"/>
</rdf:Description>

<rdf:Description rdf:about="&ns1;yves">
  <ns1:parent rdf:resource="&ns1;fabienne"/>
  <ns1:uncle rdf:resource="&ns1;olivier"/>
</rdf:Description>

</rdf:RDF>

How does it work?

Henry is built around my SWI N3 parser, which basically translates the N3 in a quad form, that can be stored in the SWI RDF store. The two tricks to reach such a representation are the following:

  • Each formulae is seen as a named graph, identified by a blank node (there exists a graph, where the following is true...);
  • Universal quantification is captured through a specific set of atoms (a bit like __bnode1 captures an existentially quantified variable).

For example:

prefix : <http://example.org/> .
{?c :parent ?f. ?f :brother ?u} => {?c :uncle ?u}.

would be translated to:

subject||predicate||object||graph
__uqvar_c||http://example.org/parent||__uqvar_f||__graph1
__uqvar_f||http://example.org/brother||__uqvar_u||__graph1
__uqvar_c||http://example.org/uncle||__uqvar_u||__graph2
__graph1||log:implies||__graph2||uncle.n3

Then, when running the compile predicate, such a representation is translated into a bunch of Prolog clauses, such as, in our example:

rdf(C,'http://example.org/uncle',U) :- rdf(C,'http://example.org/parent',F), rdf(F,'http://example.org/brother',U).

Such rules are defined in a particular entailment module, allowing it to be plugged in the SWI Semantic Web server. Of course, rules can get into an infinite loop, and this will make the whole thing crash :-)

I tried to make the handling of lists and builtins as clear and simple as possible. Defining a builtin is as simple as registering a new predicate, associating a particular URI to a prolog predicate (see builtins.pl for an example).

An example using both lists and builtins is the following one. By issuing this SPARQL query to a N3-entailed store:

PREFIX list: <http://www.w3.org/2000/10/swap/list#>

SELECT ?a
WHERE
{?a list:in ("a" "b" "c")}

You will get back a, b and c (you have no clue how much I struggled to make this thing work :-) ).

But take care!

Of course, there are lots of bugs and issues, and I am sure there are lots of cases where it'll crash and burn :-) Anyway, I hope it will be useful.

Tuesday 4 December 2007

Linking open data: interlinking the BBC John Peel sessions and the DBPedia datasets

For the last Hackday, the BBC released data about the John Peel sessions. In June, I did publish them as Linked Data, using a SWI-Prolog representation of this data, bundled with a custom P2R mapping. But so far, it was not interlinked with any dataset, making it a small island :-)

In order to enrich my interlinking experiences, I wanted to tackle a dataset I never really tried to link to: DBPedia, holding structured data extracted from Wikipedia (well, the Magnatune RDF dataset is linked to it, but just for geographical locations, which was fairly easy).

And my conclusion is... it is not that easy :-)

Try 1 - Matching on labels:

The first thing I tried was matching directly on labels, using an algorithm which might be summarized by:

1 - For all items I_i in the John Peel sessions dataset we want to link (instance of foaf:Agent or mo:MusicalWork, take label L_i ({I_i rdfs:label L_i})

2 - Issue the following SPARQL query to the DBPedia dataset:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> 
SELECT ?u
WHERE 
{       
?u rdfs:label L_i
}

3 - For all results R_i_j of this query, assert I_i owl:sameAs R_i_j

Just for fun, here are the results of such a query. You can't imagine how many people are called exactly the same... For example Jules Verne in the BBC John Peel sessions dataset is quite different from Jules Verne... Also, Jules Verne is quite different from the Jules Verne category, though they share the same label.

Try 2 - Still matching on labels, but with restrictions:

In the agent case, the disambiguation appears easy to achieve, by just expressing ''I am actually looking for someone which could be somehow related to the John Peel sessions''. But, err, Wikipedia (hence, DBPedia) is a bit messy some times, and it is quite difficult to find a reliable and consistent way of expressing this criteria. So I had to sample the John Peel data (taking some producers, some engineers, some artists, some bands) and look out manually how I could restrict the range of resources I was looking for in DBPedia and still be able to retrieve all my linked agents. This leads to the following SPARQL query (involving L_i as defined earlier):

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> 
SELECT ?u
WHERE 
{       
   {
      {?u <http://dbpedia.org/property/name> L_i} UNION 
      {?u rdfs:label L_i} UNION 
      {?u <http://dbpedia.org/property/bandName> L_i} 
    } 
    {
       {?u <http://dbpedia.org/property/wikiPageUsesTemplate> <http://dbpedia.org/resource/Template:infobox_musical_artist>} UNION 
       {?u a <http://dbpedia.org/class/yago/Group100031264>} UNION 
       {?u a ?mus.?mus rdfs:subClassOf <http://dbpedia.org/class/yago/Musician110340312>} UNION 
       {?u a ?artist. ?artist rdfs:subClassOf <http://dbpedia.org/class/yago/Creator109614315>} 
    } 
}

And for musical works?

Of course, this does not hold as soon as I broaden the range of resources I want to link. I first tried to use exactly the same methodology (basically restricting the resources I was looking for to be related to the Yago song concept). But, err, it did not work that well :-) You can't imagine how many songs have the same name! Just look at the results - this is enlightening! So far, the best I found was Walked Away, which appears to be the most popular title :-)

So what did I do to disambiguate? I took these results, got the RDF corresponding to the DBPedia resources, and made a literal search using the SWI literal index module on the abstract, looking for the name of the artist involved in the performance of this work. That's a bit hacky, but well, it did well even with cover songs (like Nirvana's Love Buzz).

Results

Surprisingly, the results do not seem to be too bad! I still have to check them more, but nothing seems obviously wrong, at first glance (I went through all the links manually, looking for something that would not make sense). The links are available here.

The SWI-Prolog code doing the trick is available here. Sorry, the code is a bit messy, and got increasingly hacky...

- page 3 of 4 -