DBTune blog

To content | To menu | To search

Tag - code

Entries feed

Friday 19 August 2011

4Store stuff

Update: The repository below is not maintained anymore, as official packages have been pushed into Debian. They are not yet available for Ubuntu 11.04 though. In order install 4store on Natty you'd have to install the following packages from the Oneiric repository, in order:

  • libyajl1
  • libgmp10
  • libraptor2
  • librasqal3
  • lib4store0
  • 4store

And you should have a running 4store (1.1.3).

Old post, for reference: I've been playing a lot with Garlik's 4store recently, and I have been building a few things around it. I just finished building packages for Ubuntu Jaunty, which you can get by adding the following lines in your /etc/apt/sources.list:

deb http://moustaki.org/apt jaunty main
deb-src http://moustaki.org/apt jaunty main

And then, an apt-get update && apt-get install 4store should do the trick. The packages are available for i386 and amd64. It is also one of my first packages, so feedback is welcomed (I may have gotten it completely wrong). After being installed, you can create a database and start a SPARQL server.

I've also been writing two client libraries for 4store, all available on Github:

  • 4store-php, a PHP library to interact with 4store over HTTP (so not exactly similar to Alexandre's PHP library, which interacts with 4store through the command-line tools);
  • 4store-ruby, a Ruby library to interact with 4store over HTTP or HTTPS.

Thursday 14 January 2010

Live SPARQL end-point for BBC Programmes

Update: We seem to have an issue with the 4store hosting the dataset, so the data is stale since the end of February. Update 2: All should be back to normal and in sync. Please comment on this post if you spot any issue, or general slowliness.

Last year, we got OpenLink and Talis to crawl BBC Programmes and provide two SPARQL end-points on top of the aggregated data. However, getting the data by crawling it means that the end-points did not have all the data, and that the data can get quite outdated -- especially as our programme data changes a lot.

At the moment, our data comes from two sources: PIPs (the central programme database at the BBC) and PIT (our content mangement system for programme information). In order to populate the /programmes database, we monitor changes on these two sources and replicate them on our database. We have a small piece of Ruby/ActiveRecord software (that we call the Tapp) which handles this process.

I made a small experiment, converting our ActiveRecord objects to RDF and hooking an HTTP POST or an HTTP DELETE request to a 4store instance for each change we receive. This means that this 4store instance is kept in sync with upstream data sources.

It took a while to backfill, but it is now up-to-date. Check out the SPARQL end-point, a test SPARQL query form and the size of the endpoint (currently about 44 million triples).

The end-point holds all information about services, programmes, categories, versions, broadcasts, ondemands, time intervals and segments, as defined within the Programme Ontology. All of these resources are held within their own named graph, which means we have a very large number of graphs (about 5 million). It makes it far easier to update the endpoint, as we can just replace the whole graph whenever something changes for a resource.

This is still highly experimental though, and and I already found a few bugs: some episodes seem to be missing (for example, some Strictly Come Dancing episodes are missing, for some reason). I've also encountered some really weird crashes of the machine hosting the end-point when concurrently pushing a large number of RDF documents at it - I still didn't succeed to identify the cause of it. To summarise: it might die without notice :-)

Here are some example SPARQL queries:

  • All programmes related to James Bond:
PREFIX po: <http://purl.org/ontology/po/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?uri ?label
WHERE {
  ?uri po:category 
    <http://www.bbc.co.uk/programmes/people/bmFtZS9ib25kLCBqYW1lcyAobm8gcXVhbGlmaWVyKQ#person> ; rdfs:label ?label
}
  • FInd all Eastenders broadcast dates after 2009-01-01, along with the type of the version that was broadcast:
PREFIX event: <http://purl.org/NET/c4dm/event.owl#> 
PREFIX tl: <http://purl.org/NET/c4dm/timeline.owl#> 
PREFIX po: <http://purl.org/ontology/po/> 
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
SELECT ?version_type ?broadcast_start
WHERE
{ <http://www.bbc.co.uk/programmes/b006m86d#programme> po:episode ?episode .
  ?episode po:version ?version .
  ?version a ?version_type .
  ?broadcast po:broadcast_of ?version .
  ?broadcast event:time ?time .
  ?time tl:start ?broadcast_start .
  FILTER ((?version_type != <http://purl.org/ontology/po/Version>) && (?broadcast_start > "2009-01-01T00:00:00Z"^^xsd:dateTime))}
PREFIX po: <http://purl.org/ontology/po/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX mo: <http://purl.org/ontology/mo/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX event: <http://purl.org/NET/c4dm/event.owl#>
PREFIX tl: <http://purl.org/NET/c4dm/timeline.owl#>
PREFIX owl: <http://www.w3.org/2002/07/owl#> 
SELECT DISTINCT ?programme ?label
WHERE {
  ?event1 po:track ?track1 .
  ?track1 foaf:maker ?maker1 . ?maker1 owl:sameAs <http://www.bbc.co.uk/music/artists/67f66c07-6e61-4026-ade5-7e782fad3a5d#artist> .
  ?event2 po:track ?track2 .
  ?track2 foaf:maker ?maker2 . ?maker2 owl:sameAs <http://www.bbc.co.uk/music/artists/fb7272ba-f130-4f0a-934d-6eeea4c18c9a#artist> .
  ?event1 event:time ?t1 .
  ?event2 event:time ?t2 .
  ?t1 tl:timeline ?tl .
  ?t2 tl:timeline ?tl .
  ?version po:time ?t .
  ?t tl:timeline ?tl .
  ?programme po:version ?version .
  ?programme rdfs:label ?label .
}

Tuesday 27 October 2009

Music recommendation and Linked Data

We just presented yesterday at ISMIR a tutorial about Linked Data for music-related information. More information on the tutorial is available on the tutorial website, and the slides are also available.

In particular, we had two sets of slides dealing with the relationship between music recommendation and linked data. As this is something we're investigating within the NoTube project, I thought I would write up a bit more about it.

Let's focus on artist to artist recommendation for now. If we look at last.fm for recommendations for New Order, here is what we get.

Artists similar to New Order, from last.fm

Similarly, using the Echonest API for similar artists, we get back an ordered list of artists similar to New Order, including Orchestral Manoeuvres in the Dark, Depeche Mode, etc.

Now, let's play word associations for a few bands and musical genres. My colleague Michael Smethurst took the Sex Pistols, Acid House and Public Enemy, and draw the following associations:

Sex Pistols associated words

Acid House word associations

Pubic Enemy word associations

We can see that among the different terms in these diagrams, some refer to people, to TV programmes, to fashion styles, to drugs, to music hardware, to places, to laws, to political groups, to record labels, etc. Just a couple of these terms are actually other bands or tracks. If you were to describe these artists just in musical terms, you'd probably be missing the point. And all these things are also linked to each other: you could play word associations for any of them and see what are the connections between Public Enemy and the Sex Pistols. So how does that relate to recommendations? When recommending an artist from another artist, the context is key. You need to provide an explanation of why they actually relate to each other, whether it's through common members, drugs, belonging to the same independent record label, acoustically similar (if so, how exactly), etc. The main hypothesis here being that users are much more likely to be accepting a recommendation that is explicitly backed by some contextual information.

On the BBC website, we cover quite a few domains, and we try to create as much links as possible between these domains, by following the Linked Data principles. From our BBC Music site, we can explore much more information, from other BBC content (programmes, news etc.) to other Linked Data sources, e.g. DBpedia, Freebase and Musicbrainz. This provides us with a wealth of structured information that we would ultimately want to use for driving and backing up our recommendations.

The MusicBore I've described earlier on this blog kind of uses the same approach. Playlists are generated by following paths in Linked Data. Introduction of each artists is done by generating a sentence from the path leading from the seed artist to the target artist. The prototype described in that paper from the SDOW workshop last year also illustrates that approach.

So we developed a small prototype of these kind of ideas, rqommend (and when I say small, it is very small :) ). Basically, we define "relatedness rules" in the form of SPARQL queries, like "Two artists born in Detroit in the 60s are related". We could go for very general rules, e.g. "Any paths between two artists make them related", but it would be very hard to generate an accurate textual explanation for it, and might give some, hem, not very interesting connections. Then, we just go through these rules on an aggregation of Linked Data, and generate recommendations from them. Here is a greasemonkey script injecting such recommendations with BBC Music (see for example the Fugazi page). It injects Linked Data based recommendations, along with the associated explanation, within BBC artist pages. For example, for New Order:

BBC Music recs for New Order

To conclude, I think there is a really strong influence of traditional information retrieval systems on the music information retrieval community. But what makes Google, for example, particularly successful is to exploit links, not the documents themselves. We definitely need to go towards the same sort of model. Exploiting links surrounding music, and all the cross-domain information that makes it so rich, to create better music recommendation systems which combine the what is recommended with the why it is recommended.

Tuesday 12 May 2009

Yahoo Hackday 2009

C

We went to the Yahoo Hackday this week end, with a couple of people from the C4DM and the BBC. Apart from a flaky wireless connection on the Saturday, it was a really great event, with lots of interesting talks and interesting hacks.

On the Saturday, we learned about Searchmonkey. I tried to create a small searchmonkey application during the talk, but eventually got frustrated. Apparently, Searchmonkey indexes RDFa and eRDF , but doesn't follow <link rel="alternate"/> links towards RDF representations (neither does it try to do content negotiation). So in order to create a searchmonkey application for BBC Programmes, I needed to either include RDFa in all the pages (which, hem, was difficult to do in an hour :-) ) or write an XSLT against our RDF/XML representations, which would just be Wrong, as there are lots of different ways to serialise the same RDF in an RDF/XML document.

We also learned about the Guardian Open Platform and Data Store, which holds a huge amount of interesting information. The license terms are also really permissive, even allowing commercial uses of this data. I can't even imagine how useful this data would be if it were linked to other open datasets, e.g. DBpedia, Geonames or Eurostat.

I got also a bit confused by YQL, which seems to be really similar to SPARQL, at least in the underlying concept ("a query language for the web"). However, it seems to be backed by lots of interesting data: almost all of Yahoo services, and a few third-party wrappers, e.g. for Last.fm. I wonder how hard it would be to write a SPARQL end-point that would wrap YQL queries?

Finally, on Saturday evening and Sunday morning, we got some time to actually hack :-) Kurt made a nice MySpace hack, which does an artist lookup on MySpace using BOSS and exposes relevant information extracted using the DBTune RDF wrapper, without having to look at an overloaded MySpace page. It uses the Yahoo Media Player to play the audio files this page links to.

At the same time, we got around to try out some of the things that can be built using the linked data we publish at the BBC, especially the segment RDF I announced on the linked data mailing list a couple of weeks ago. We built a small application which, from a place, gives you BBC programmes that feature an artist that is related in some way to that place. For example, Cardiff, Bristol, London or Lancashire. It might be bit slow (and the number of results are limited) as I didn't have time to implement any sort of caching. The application is crawling from DBpedia to BBC Music to BBC Programmes at each request. I just put the (really hacky) code online.

And we actually won the Backstage price with these hacks! :-)

This last hack illustrates to some extent the things we are investigating as part of the BBC use-cases of the NoTube project. Using these rich connections between things (programmes, artists, events, locations, etc.), it begins to be possible to provide data-rich recommendations backed by real stories (and not only "if you like this, you may like that"). I mentioned these issues in the last chapter of my thesis, and will try to follow up on that here!

Thursday 29 January 2009

Prolog message queue

It's been a long time since I last posted anything here, but things have been pretty hectic recently (I am a doctor, now!! I'll post my thesis here soon).

I've just hacked a really small implementation of an HTTP-driven SWI-Prolog message queue. I've often find myself doing quite expensive computation in Prolog and the best way to easily distribute it is to have a message queue on which you post messages to process (in that case, Prolog terms), and a pool of workers pick messages and process them. Then, if you find your program is still too slow, you can easily add a couple of workers to help going faster.

Monday 29 September 2008

D2R server, SNORQL and Firefox 3

In case it might be useful for someone else (I've had several requests for it offline), here is a small patch to make D2R server work with the latest versions of ARQ, in order to make the SNORQL SPARQL explorer work in Firefox 3. I sent the patch to Richard some time ago, so hopefully the newest D2R should work with latest versions of ARQ.

Oh, and I've finished (well, almost, just a couple more lines to add to the conclusion) writing up, and started this morning at the BBC!

Thursday 31 July 2008

Semantic search on aggregated music data

I just moved the semantic search demo to a faster server, so it should hopefully be a lot more reliable. This demo uses the amazing ClioPatria on top of an aggregation of music-related data. This aggregation was simply constructed by taking a bunch of Creative Commons MP3s, running GNAT on them, and crawling linked data starting from the web identifiers outputted by GNAT.

I also set up the search tab to work correctly. For example, when you search for "punk", you get the following results.

Punk search 1

Punk search 2

Note that the results are explained: "punk" might be related to the title, the biography, a tag, the lyrics, content-based similarity to something tagged as punk (although it looks like Henry crashed in the middle of the aggregation, so not a lot of such data is available yet), etc. Moreover, you get back different types of resources: artists, records, tracks, lyrics, performances etc.

For example, if you click on one of the records, you get the following.

Punk search 3

This record is available under a Creative Commons license, so you can get a direct access to the corresponding XSPF playlist, Bittorrent items etc., by following the Music Ontology "available as" property. For example, you can click on an XSPF playlist, and listen to the selected record.

Punk search 4

Of course, you can still do the previous things - plotting music artists (or search results, just take a look at the "view" drop-down box) on a map, on a time-line, browse using facets, etc.

Btw, if you like DBTune, please vote for it in the Triplify Challenge! :-)

Wednesday 30 July 2008

Last.fm events and DBpedia mobile

For a recent event at the Dana Centre, I was asked to make a small demo of some nice things you can do with Semantic Web technologies. As it is not funny to re-use demos, I decided to go for something new. So after two hours hacking and skyping with Christian Becker, we added to the last.fm linked data exporter a support for recommended events. I also implemented a bit of geo-coding on the server side (although, with the new last.fm API, I guess this part is becoming useless).

Then, thanks to RDF goodness, it was really straight-forward to make that work with DBpedia mobile. DBpedia mobile is a service getting your geo-location from your mobile device, and displaying you a map with nearby sights, using data from DBpedia. DBpedia mobile also uses the RDF cache of a really nice linked data browser called Marbles.

So, after browsing your DBTune last-fm URI in Marbles, you can go to DBpedia mobile and see recommended events alongside nearby sights. To do so, select the Performances (by moustaki) filter. Here is what I get for my profile, when at the university:

DBpedia mobile and last.fm events

Tuesday 1 July 2008

Echonest Analyze XML to Music Ontology RDF

I wrote a small XSL stylesheet to transform the XML results of the Echonest Analyze API to Music Ontology RDF. The Echonest Analyze API is a really great (and simple) web service to process audio files and get back an XML document describing some of their features (rhythm, structure, pitch, timbre, etc.). A lot of people already did really great things with it, from collection management to visualisation.

The XSL is available on that page. The resulting RDF can be queried using SPARQL. For example, the following query selects the boundaries of structural segments (chorus, verse, etc.):

PREFIX af: <http://purl.org/ontology/af/>
PREFIX event: <http://purl.org/NET/c4dm/event.owl#>
PREFIX tl: <http://purl.org/NET/c4dm/timeline.owl#>

SELECT ?start ?duration
FROM <http://dbtune.org/echonest/analyze-example.rdf>
WHERE
{
?e      a af:StructuralSegment;
        event:time ?time.
?time   tl:start ?start;
        tl:duration ?duration.
}

I also added on that page the small bit to add to the Echonest Analyze XML to make it GRDDL-ready. That means that the XML document can be automatically translated to actual RDF data (which can then be aggregated, stored, linked to, queried, etc.).

<Analysis    xmlns:grddl="http://www.w3.org/2003/g/data-view#" 
                grddl:transformation="http://dbtune.org/echonest/echonest.xsl">
...
</Analysis>

This provides a lot more data to aggregate for describing my music collection !

If there is one thing I really wish could be integrated in the Echonest API, it would be a Musicbrainz lookup... Right now, I have to manually link the data I get from it to the rest of my aggregated data. If the Echonest results could include a link to the corresponding Musicbrainz resource, it would really simplify this step :-)

Wednesday 25 June 2008

Linking Open Data: BBC playcount data as linked data

For the Mashed event this week end, the BBC released some really interesting data. This includes playcount data, stating how much an artist is featured within a particular BBC programmes (at the brand or episode level).

During the event, I wrote some RDF translators for this data, linking web identifiers in the DBTune Musicbrainz linked data to web identifiers in the BBC Programmes linked data. We used it with Kurt and Ben in our hack. Ben made a nice write-up about it. By finding web identifiers for tracks in a collection and following links to the BBC Programmes data, and finally connecting this Programmes data to the box holding all recorded BBC radio programmes over a year that was available at the event, we can quite easily generate playlists from an audio collection. Two python scripts implementing this mechanism are available there. The first one uses solely brands data, whereas the second one uses episodes data (and therefore helps to get fewer and more accurate items in the resulting playlist). Finally, the thing we spent the most time on was the SQLite storage for our RDF cache :-)

This morning, I published the playcount data as linked data. I wrote a new DBTune service for that. It publishes a set of web identifiers for playcount data, interlinking Musicbrainz and BBC Programmes. I also put online a SPARQL end-point holding all this playcount data along with aggregated data from Musicbrainz and the BBC Programmes linked data (around 2 million triples overall).

For example, you can try the following SPARQL query:

SELECT ?brand ?title ?count
WHERE {
   ?artist a mo:MusicArtist;
      foaf:name "The Beatles". 
   ?pc pc:object ?artist;
       pc:count ?count.
   ?brand a po:Brand;
       pc:playcount ?pc;
       dc:title ?title 
    FILTER (?count>10)}

This will return every BBC brand that has featured The Beatles more than 10 times.

Thanks to Nicholas and Patrick for their help!

Mashed!

I was at Mashed (the former Hack Day) this week-end - a really good and geeky event, organised by the BBC at Alexandra Palace. We arrived on the Saturday morning for some talks, detailing the different things we'd be able to play with over the week-end. Amongst these, a full DVB-T multiplex (apparently, it was the first time since 1956 that a TV signal was broadcasted from Alexandra Palace), lots of data from the BBC Programmes team and a box full of recorded radio content over the last year.

After these presentations, the 24 hours hacking session began. We sat down with Kurt and Ben and wrote a small hack which basically starts from a personal music collection and creates you a playlist of recorded BBC programmes. I will write a bit more about this later today

During the 24 hours hack, we had a Rock Band session on big screen, a real-world Tron game (basically, two guys running with GPS phones, guided by two persons watching their trail on a google satellite map :-) ), a rocket launching...

Finally, at 2pm on the Sunday, people presented their hacks. Almost 50 hacks were presented, all extremely interesting. Take a look at the complete list of hacks! On the music side, Patrick's recommender was particularly interesting. It used Latent Semantic Analysis on playcount data for artists in BBC brands and episodes to recommend brands from artists or artists from artists. It gave some surprising results :-) Jamie Munroe resurrected the FPFF Musicbrainz fingerprinting algorithm (which was apparently due to replace the old TRM one before MusicIP offered their services to Musicbrainz) to identify tracks played several times in BBC programmes. The WeDoID3 team talked about creating RSS feeds from embedded metadata in audio and video, but the demo didn't work.

My personal highlight was the hack (which actually won a prize) from Team Bob. Here is a screencast of it:


BBC Dylan - News 24 Revisited (Clip) from James Adam on Vimeo.

Thanks to Matthew Cashmore and the rest of the BBC backstage team for this great event! (and thanks to the sponsors for all the free stuff - I think I have enough T-shirts for about a year now :-))

Tuesday 20 May 2008

Ceriese: RDF translator for Eurostat data

Riese logo

Some time ago, I did a bit of work on the RIESE project, aiming at publishing the Eurostat dataset on the Semantic Web, and interlinking it with further datasets (eg. Geonames and DBpedia). This can look a bit far from my interests in music data, but there is a connexion which illustrates the power of linked data, as explained at the end of this post.

Original data

There are three distinct things we consider in the Eurostat dataset:

  • A table of content in HTML defining the hierarchical structure of the Eurostat datasets;
  • Tab-separated values dictionary files defining the ~80 000 data codes used in the dataset (eg. "eu27" for the European Union of 27 countries);
  • The actual statistical data, in tab-separated values files. Around 4000 datasets for roughly 350 million statistical items.

Ontology

The first thing we need to figure out when exposing data on the Semantic Web is the model we'll link to. This lead into the design of SCOVO (Statistical Core Vocabulary). The concepts in this ontology can be depicted as follows:

SCOVO ontology

The interesting thing about this model is that the statistical item is considered as a primary entity. We used as a basis the Event ontology - a statistical item is a particular classification of a space/time region. This allows to be really flexible and extensible. We can for example attach multiple dimensions to a particular item, resources pertaining to its creation, etc.

RDF back-end

I wanted to see how to publish such large amounts of RDF data and how my publication tools perform, so I designed the Ceriese software to handle that.

The first real concern when dealing with such large amounts of data is, of course, scalability. The overall Eurostat dataset is over 3 billion triples. Given that we don't have high-end machines with lots of memory, using the core SWI RDF store was out of the question (I wonder if any in-memory triple store can handle 1 billion triples, btw).

So there are three choices at this point:

  • Use a database-backed triple store;
  • Dump static RDF file in a file-system served through Apache;
  • Generate the RDF on-the-fly.

We don't have the sort of money it takes (for both the hardware and the software) for the first choice to really scale, so we tried the second and the third solution. I took my old Prolog-2-RDF software that I am using to publish the Jamendo dataset and we wrote some P2R mapping files converting the tab-separated value files. Then, we made P2R dump small RDF files in a file-system hierarchy, corresponding to the description of the different web resources we wanted to publish. Then, some Apache tweaks and Michael and Wolfgang's work on XHTML/RDFa publishing were enough to make the dataset available in the web of data.

But this approach had two main problems. First, it took ages to run this RDF dump, so we never actually succeeded to complete it once. Also, it was impossible to provide a SPARQL querying facility. No aggregation of data was made available.

So we eventually settled on the third solution. I took my Prolog hacker hat, and tried to optimise P2R to make it fast enough. I did it by using the same trick I used in my small N3 reasoner, Henry. P2R mappings are compiled as native Prolog clauses (rdf(S,P,O) :- ... ), which cut down the search space a lot. TSV files are accessed within those rules and parsed on-the-fly. The location of the TSV file to access is derived from a small in-memory index. Parsed TSV files are cached for a whole query, to avoid parsing the same file for different triple patterns in the query.

Same mechanisms are applied to derive a SKOS hierarchy from the HTML table of content.

Now, a DESCRIBE query takes less than 0.5 seconds for any item, on my laptop. Not perfect, but, still...

A solution to improve the access time a lot would be to dump the TSV file in a relational database, and access this database in our P2R mappings instead of the raw TSV files.

Trying it out and creating your own SPARQL end-point from Eurostat data is really easy.

  • Get SWI-Prolog;
  • Get the software from there;
  • Get the raw Eurostat data (or get it from the Eurostat web-site, as this one can be slightly out-dated);
  • Put it in data/, in your Ceriese directory;
  • Launch start.pl;
  • Go to http://localhost:3020/

Now what?

Michael and Wolfgang did an amazing work at putting together a really nice web interface, publishing the data in XHTML+RDFa. They also included some interlinking, especially for geographical locations, which are now linked to Geonames.

So what's the point from my music geek point of view?? Well, now, after aggregating Semantic Web data about my music collection (using these tools), I can sort hip-hop artists by murder rates in their city :-) This is quite fun as it is (especially as the Eurostat dataset holds a really diverse range of statistics), but it would be really interesting to mine that to get some associations between statistical data and musical facts. This would surely lead to interesting sociological results (eg. how does musical "genre" associate with particular socio-economic indicators?)

Wednesday 14 May 2008

Data-rich music collection management

I just put a live demo of something I showed earlier on this blog.

You can explore the Creative Commons-licensed part of my music collection (mainly coming from Jamendo) using aggregated Semantic Web data.

For example, here is what you get after clicking on "map" on the right-hand side and "MusicArtist" on the left-hand side:

Data-rich music collection management

The aggregation is done using the GNAT and GNARQL tools available in the motools sourceforge project. The data comes from datasets within the Linking Open Data project.The UI is done by the amazing ClioPatria software, with a really low amount of configuration.

An interesting thing is to load this demo into Songbird, as it can aggregate and play the audio as you crawl around.

Check the demo!

Update: It looks like it doesn't work with IE, but it is fine with Opera and FF2 or FF3. If the map doesn't load at first, just try again and it should be ok.

Monday 7 April 2008

D2RQ mapping for Musicbrainz

I just started a D2R mapping for Musicbrainz, which allows to create a SPARQL end-point and to provide linked data access out of Musicbrainz fairly easily. A D2R instance loaded with the mapping as it is now is also available (be gentle, it is running on a cheap computer :-) ).

Added to the things that are available within the Zitgist mapping:

  • SPARQL end point ;
  • Support for tags ;
  • Supports a couple of advanced relationships (still working my way through it, though) ;
  • Instrument taxonomy directly generated from the db, and related to performance events;
  • Support for orchestras ;
  • Linked with DBpedia for places and Lingvoj for languages

There is still a lot to do, though: it is really a start. The mapping is available on the motools sourceforge project. I hope to post a follow-up soon! (including examples of funny SPARQL queries :-) ).

Update: For some obscure port-forwarding reasons, the SNORQL interface to the SPARQL end point does not work on the test server.

Update 2: This is fixed. (thanks to the anonymous SPARQL crash tester which helped me find the bug, by the way :-) )

Wednesday 12 December 2007

HENRY: A small N3 parser/reasoner for SWI-Prolog

Yesterday, I finally took some time to package the little hack I've written last week, based on the SWI N3 parser I wrote back in September.

Update: A newer version with lots of bug fixes is available there.

It's called Henry, it is really small, hopefully quite understandable, and it does N3 reasoning. The good thing is that you can embed such reasoning in the SWI Semantic Web Server, and then access a N3-entailed RDF store using SPARQL.

How to use it?

Just get this tarball, extract it, and make sure you have SWI-Prolog installed, with its Semantic Web library.

Then, the small swicwm.pl script provides a small command-line tool to test it (roughly equivalent, in CWM terms, to cwm $1 -think -data -rdf).

Here is a simple example (shipped in the package, along other funnier examples).

The file uncle.n3 holds:

prefix : <http://example.org/> .
:yves :parent :fabienne.
:fabienne :brother :olivier.
{?c :parent ?f. ?f :brother ?u} => {?c :uncle ?u}.

And:

$ ./swicwm.pl examples/uncle.n3

<!DOCTYPE rdf:RDF [
    <!ENTITY ns1 'http://example.org/'>
    <!ENTITY rdf 'http://www.w3.org/1999/02/22-rdf-syntax-ns#'>
]>

<rdf:RDF
    xmlns:ns1="&ns1;"
    xmlns:rdf="&rdf;"
>
<rdf:Description rdf:about="&ns1;fabienne">
  <ns1:brother rdf:resource="&ns1;olivier"/>
</rdf:Description>

<rdf:Description rdf:about="&ns1;yves">
  <ns1:parent rdf:resource="&ns1;fabienne"/>
  <ns1:uncle rdf:resource="&ns1;olivier"/>
</rdf:Description>

</rdf:RDF>

How does it work?

Henry is built around my SWI N3 parser, which basically translates the N3 in a quad form, that can be stored in the SWI RDF store. The two tricks to reach such a representation are the following:

  • Each formulae is seen as a named graph, identified by a blank node (there exists a graph, where the following is true...);
  • Universal quantification is captured through a specific set of atoms (a bit like __bnode1 captures an existentially quantified variable).

For example:

prefix : <http://example.org/> .
{?c :parent ?f. ?f :brother ?u} => {?c :uncle ?u}.

would be translated to:

subject||predicate||object||graph
__uqvar_c||http://example.org/parent||__uqvar_f||__graph1
__uqvar_f||http://example.org/brother||__uqvar_u||__graph1
__uqvar_c||http://example.org/uncle||__uqvar_u||__graph2
__graph1||log:implies||__graph2||uncle.n3

Then, when running the compile predicate, such a representation is translated into a bunch of Prolog clauses, such as, in our example:

rdf(C,'http://example.org/uncle',U) :- rdf(C,'http://example.org/parent',F), rdf(F,'http://example.org/brother',U).

Such rules are defined in a particular entailment module, allowing it to be plugged in the SWI Semantic Web server. Of course, rules can get into an infinite loop, and this will make the whole thing crash :-)

I tried to make the handling of lists and builtins as clear and simple as possible. Defining a builtin is as simple as registering a new predicate, associating a particular URI to a prolog predicate (see builtins.pl for an example).

An example using both lists and builtins is the following one. By issuing this SPARQL query to a N3-entailed store:

PREFIX list: <http://www.w3.org/2000/10/swap/list#>

SELECT ?a
WHERE
{?a list:in ("a" "b" "c")}

You will get back a, b and c (you have no clue how much I struggled to make this thing work :-) ).

But take care!

Of course, there are lots of bugs and issues, and I am sure there are lots of cases where it'll crash and burn :-) Anyway, I hope it will be useful.

Friday 16 November 2007

Sindice module for SWI-Prolog

I just committed in the motools sourceforge project a small SWI-Prolog module for accessing the Sindice Semantic Web search engine.

Just about 20 lines of code, and it handles Sindice URI lookups (find documents referencing a particular URI), and keyword lookups (find documents mentioning a similar literal). I guess it sort of proves how well designed the Sindice service is!

Anyway, a typical SWI-Prolog session making use of this module would look like:

?- use_module(sindice).
Yes
?- sindice_q(uri('http://dbtune.org/jamendo/artist/5')).
% Parsed "http://sindice.com/query/lookup?uri=http%3a%2f%2fdbtune.org%2fjamendo%2fartist%2f5" in 0.00 sec; 2 triples
Yes
?- sindice_r(Q,U).
Q = 'http://sindice.com/query/lookup?uri=http://dbtune.org/jamendo/artist/5',
U = 'http://dbtune.org:2105/sparql/?query=describe <http://dbtune.org/jamendo/artist/5>' ;

Q = 'http://sindice.com/query/lookup?uri=http://dbtune.org/jamendo/artist/5',
U = 'http://moustaki.org/resources/jamendo_mbz.rdf'

Then, up to you to crawl further (using rdf_load). By replacing uri by keyword, a keyword search is performed, which results are accessible in the same way.

Thursday 30 August 2007

GNAT 0.1 released

Chris Sutton and I did some work since the first release of GNAT, and it is now in a releasable state!

You can get it here.

What does it do?

As mentioned in my previous blog post, GNAT is a small software able to link your personal music collection to the Semantic Web. It will find dereferencable identifiers available somewhere on the web for tracks in your collection. Basically, GNAT crawls through your collection, and try by several means to find the corresponding Musicbrainz identifier, which is then used to find the corresponding dereferencable URI in Zitgist. Then, RDF/XML files are put in the corresponding folder:

/music
/music/Artist1
/music/Artist1/AlbumA/info_metadata.rdf
/music/Artist1/AlbumA/info_fingerprint.rdf
/music/Artist1/AlbumB/info_metadata.rdf
/music/Artist1/AlbumB/info_fingerprint.rdf

What next?

These files hold a number of <http://zitgist.com/music/track/...> mo:available_as <local file> statements. These files can then be used by a tool such as GNARQL (which will be properly released next week), which swallows them, exposes a SPARQL end point, and provides some linked data crawling facilities (to gather more information about the artists in our collection, for example), therefore allowing to use the links pictured here (yes, sorry, I didn't know how to introduce properly the new linking-open-data schema - it looks good! and keeps on growing!:-) ):

Linking-Open-Data

Two identification strategies

GNAT can use two different identification strategies:

  • Metadata lookup: in this mode, only available tags are used to identify the track. We chose an identification algorithm which is slower (although if you try to identify, for example, a collection with lots of releases, you won't notice it too much, as only the first track of a release will be slower to identify), but works a bit better than Picard' metadata lookup. Basically, the algorithm used is the same as the one I used to link the Jamendo dataset to the Musicbrainz one.
  • Fingerprinting: in this mode, the Music IP fingerprinting client is used in order to find a PUID for the track, which is then used to get back to the Musicbrainz ID. This mode is obviously better when the tags are crap :-)
  • The two strategies can be run in parallel, and most of the times, the best identification results are obtained when combining the two...

Usage

  • To perform a metadata lookup for the music collection available at /music:

./AudioCollection.py metadata /music

  • To perform a fingerprint-based lookup for the music collection available at /music:

./AudioCollection.py fingerprint /music

  • To clean every previously performed identifications:

./AudioCollection.py clean /music

Dependencies

  • MOPY (included) - Music Ontology PYthon interface
  • genpuid (included) - MusicIP fingerprinting client
  • rdflib - easy_install rdflib
  • mutagen - easy_install mutagen
  • Musicbrainz2 - You need a version later than 02.08.2007 (sorry)

Thursday 23 August 2007

Small Musicbrainz library for SWI-Prolog

I just put online a really small SWI-Prolog module, allowing to do some queries on the Musicbrainz web service. It provides the following predicates:

  • find_artist_id(+Name,-ID,-Score), which find artist ids given a name, along with a Lucene score
  • find_release_id(+Name,-ID,-Score), which provides the same thing for a release
  • find_track_id(+Name,-ID,-Score), same thing for a track

I wrote only three predicates, because to identify a track, I often found the best way was not to do one single Musicbrainz query with the track name, the artist name, and the release name if it is available, but to do the following:

* Try to identify the artist
* For each artist found, try to identify the release (if it's available)
* For each release try to identify the track

(Which is in fact really similar to what I have done for linking the Jamendo dataset to the Musicbrainz one).

Indeed, when you do a single query, it seems like the Musicbrainz web service does an exact match on the extra arguments, which fails if the album or the artist is badly spelled. And I did not succeed to write a good Lucene query that was doing the identification with such accuracy... I will detail that a bit when the next generation GNAT is in a releasable state:) But well, take care you do not flood the Musicbrainz web service! No more than one query per second!

Thursday 5 July 2007

Specification generation script

I just put online a small Prolog script, allowing to generate XHTML specification out of a RDF vocabulary (it should work for both RDFS and OWL). It is really similar to the Python script the SIOC community uses (thanks Uldis for the code:-) ), regarding the external behavior of the script. It provides a few enhancement though, like support of the status of terms, OWL constructs, and a few other things. You can check the Music Ontology specification to see one example output, generated from a RDFS/OWL file.

Here is the script.

Monday 11 June 2007

Linking open data: interlinking the Jamendo and the Musicbrainz datasets

This post deals with further interlinking experiences based on the Jamendo dataset, in particular equivalence mining - that is, stating that a resource in the Jamendo dataset is the same as a resource in the Musicbrainz dataset.

For example, we want to derive automatically that http://dbtune.org/jamendo/artist/5 is the same as http://musicbrainz.org/artist/0781a... (I will use this example throughout this post, as it illustrates many of the problems I had to overcome).

Independent artists and the failure of literal lookup

In my previous post, I detailed a linking example which was basically a literal lookup, to get back from a particular string (such as Paris, France) to an URI identifying this geographical location, through a web-service (in this case, the Geonames one). This relies on the hypothesis that one literal can be associated to exactly one URI. For example, if the string is just Paris, the linking process fails: should we link to an URI identifying Paris, Texas or Paris,France?

For mainstream artists, having at most one URI in the Musicbrainz dataset associated to a given string seems like a fair assumption. There is no way I could start a band called Metallica, I think :-)

But, for independent artist, this is not true... For example, the French band Both has exactly the same name as a US band. We therefore need a disambiguation process here.

Another problem arises when a band in the Jamendo dataset, like NoU, is not in the Musicbrainz dataset, but there is another band called Nou there. We need to throw away such wrong matchings.

Disambiguation and propagation

Now, let's try to identify whether http://dbtune.org/jamendo/artist/5 is equivalent to http://zitgist.com/music/artist/078... or http://zitgist.com/music/artist/5f9..., and that http://dbtune.org/jamendo/artist/10... is not equivalent to http://zitgist.com/music/artist/7c4....

By GETting these URIs, we can access their RDF description, which are designed according to the Music Ontology. We can use these descriptions in order to express that, if two artists have produced records with similar names, they are more likely to be the same. This also implies that the matched records are likely to be the same. So, at the same time, we disambiguate and we propagate the equivalence relationships.

Algorithm

This leads us to the following equivalence mining algorithm. We define a predicate similar_to(+Resource1,?Resource2,-Confidence), which captures the notion of similarity between two objects. In our Jamendo/Musicbrainz mapping example, we define this predicate as follows (we use a Prolog-like notation---variables start with an upper case characters, the mode is given in the head: ?=in or out, +=in, -=out):

     similar_to(+Resource1, -Resource2, -Confidence) is true if
               Resource1 is a mo:MusicArtist
               Resource1 has a name Name
               The musicbrainz web service, when queried with Name, returns ID associated with Confidence
               Resource2 is the concatenation of 'http://zitgist.com/music/artist/' and ID

and

     similar_to(+Resource1, +Resource2, -Confidence) is true if
               Resource1 is a mo:Record or a mo:Track
               Resource2 is a mo:Record or a mo:Track
               Resource1 and Resource2 have a similar title, with a confidence Confidence

Moreover, in the other cases, similar_to is always true, but the confidence is then 0.

Now, we define a path (a set of predicates), which will be used to propagate the equivalence. In our example, it is {foaf:made,mo:has_track}: we are starting from a MusicArtist resource, which made some records, and these records have tracks.

The equivalence mining algorithm is defined as follows. We first run the process depicted here:

Equivalence Mining algorithm

Every newly appearing resource is dereferenced, so the algorithm works in a linked data environment. It just uses one start URI as an input.

Then, we define a mapping as a set of tuples {Uri1,Uri2}, associated with a confidence C, which is the sum of the confidences associated to every tuple. The result mapping is the one with the highest confidence (and higher than a threshold in order to drop wrong matchings, such as the one mentioned earlier, for NoU).

Implementation

I wrote an implementation of such an algorithm, using SWI-Prolog (everything is GPL'd). In order to make it run, you need the CVS version of SWI, compiled with the http, the semweb and the nlp packages. You can test it by loading ldmapper.pl in SWI, and then, run:

?- mapping('http://dbtune.org/jamendo/artist/5',Mapping).

To adapt it to other datasets, you just have to add some similar_to clauses, and define which path you want to follow. Or, if you are not a Prolog geek, just give me a list of URI you want to map, along with a path, and the sort of similarity you want to introduce: I'll be happy to do it!

Results

I experimented with this implementation, in order to automatically link together the Jamendo and the Musicbrainz dataset. As the current implementation is not multi-threaded (it runs the algorithm on one artist after another), it is a bit slow (one day to link the entire dataset). It derived 1702 equivalence statements (these statements are available there), distributed over tracks, artists and records, and it spotted with a good confidence that every other artist/track/record in Jamendo are not referenced within Musicbrainz.

Here are some examples:

- page 1 of 2