DBTune blog

To content | To menu | To search

Monday 29 September 2008

D2R server, SNORQL and Firefox 3

In case it might be useful for someone else (I've had several requests for it offline), here is a small patch to make D2R server work with the latest versions of ARQ, in order to make the SNORQL SPARQL explorer work in Firefox 3. I sent the patch to Richard some time ago, so hopefully the newest D2R should work with latest versions of ARQ.

Oh, and I've finished (well, almost, just a couple more lines to add to the conclusion) writing up, and started this morning at the BBC!

Sunday 7 September 2008

DBTune wins the second prize in the Triplify challenge!

I submitted DBTune to the Triplify challenge, a couple of months ago. The text of the submission is there. The results of the challenge were given on Friday, at the i-semantics conference. Many many thanks to Michael Hausenblas for representing DBTune there!

And, DBTune won the second prize! Here is a picture of the prize ceremony:

Congratulations to the winners, LinkedMDB, for their amazing work and well-deserved prize, and many thanks to Sören Auer for organizing the challenge!

Wednesday 3 September 2008

Good-bye C4DM, hello BBC!

I've been rather quiet for the last month: intense PhD writing. I have been trying to get it fully written by the end of September. Indeed, I will be joining BBC Audio & Music at the end of the month. I am really really excited about that! Of course, I am a bit sad to leave the Centre for Digital Music, after three fantastic years spent there: great people, great work, great projects, great art and great beer :-)

Thursday 31 July 2008

Semantic search on aggregated music data

I just moved the semantic search demo to a faster server, so it should hopefully be a lot more reliable. This demo uses the amazing ClioPatria on top of an aggregation of music-related data. This aggregation was simply constructed by taking a bunch of Creative Commons MP3s, running GNAT on them, and crawling linked data starting from the web identifiers outputted by GNAT.

I also set up the search tab to work correctly. For example, when you search for "punk", you get the following results.

Punk search 1

Punk search 2

Note that the results are explained: "punk" might be related to the title, the biography, a tag, the lyrics, content-based similarity to something tagged as punk (although it looks like Henry crashed in the middle of the aggregation, so not a lot of such data is available yet), etc. Moreover, you get back different types of resources: artists, records, tracks, lyrics, performances etc.

For example, if you click on one of the records, you get the following.

Punk search 3

This record is available under a Creative Commons license, so you can get a direct access to the corresponding XSPF playlist, Bittorrent items etc., by following the Music Ontology "available as" property. For example, you can click on an XSPF playlist, and listen to the selected record.

Punk search 4

Of course, you can still do the previous things - plotting music artists (or search results, just take a look at the "view" drop-down box) on a map, on a time-line, browse using facets, etc.

Btw, if you like DBTune, please vote for it in the Triplify Challenge! :-)

Wednesday 30 July 2008

Last.fm events and DBpedia mobile

For a recent event at the Dana Centre, I was asked to make a small demo of some nice things you can do with Semantic Web technologies. As it is not funny to re-use demos, I decided to go for something new. So after two hours hacking and skyping with Christian Becker, we added to the last.fm linked data exporter a support for recommended events. I also implemented a bit of geo-coding on the server side (although, with the new last.fm API, I guess this part is becoming useless).

Then, thanks to RDF goodness, it was really straight-forward to make that work with DBpedia mobile. DBpedia mobile is a service getting your geo-location from your mobile device, and displaying you a map with nearby sights, using data from DBpedia. DBpedia mobile also uses the RDF cache of a really nice linked data browser called Marbles.

So, after browsing your DBTune last-fm URI in Marbles, you can go to DBpedia mobile and see recommended events alongside nearby sights. To do so, select the Performances (by moustaki) filter. Here is what I get for my profile, when at the university:

DBpedia mobile and last.fm events

Monday 28 July 2008

Music Ontology linked data on BBC.co.uk/music

Just a couple of minutes ago on the Music Ontology mailing list, Nicholas Humfrey from the BBC announced the availability of linked data on BBC Music.

$ rapper -o turtle \
   http://www.bbc.co.uk/music/artists/cc197bad-dc9c-440d-a5b5-d52ba2e14234

[...]
<http://www.bbc.co.uk/music/artists/cc197bad-dc9c-440d-a5b5-d52ba2e14234#artist>
   a mo:MusicGroup;
   foaf:name "Coldplay";
   owl:sameAs <http://dbpedia.org/resource/Coldplay>;
   mo:member
<http://www.bbc.co.uk/music/artists/18690715-59fa-4e4d-bcf3-8025cf1c23e0#artist>,
<http://www.bbc.co.uk/music/artists/d156ceb2-fd90-4e82-baea-829bbdf1c127#artist>,
<http://www.bbc.co.uk/music/artists/6953c4db-7214-4724-a140-e87550bde420#artist>,
<http://www.bbc.co.uk/music/artists/98d1ec5a-dd97-4c0b-9c83-7928aac89bca#artist>
[...]

This is just really, really, really great... Congratulations to the /music team!

Update: Tom Scott just wrote a really nice post about the new BBC music site, explaining what the BBC is trying to achieve by going down the linked data path.

Sunday 27 July 2008

Musicbrainz RDF updated

Well, I guess everything is in the title :-) The dump used is now of the 26th of July. I also moved everything to a much faster server. Also, the D2R mapping is still not 100% complete - I am really slowly getting through it, as PhD writing takes almost all my time these days. I added recently owl:sameAs links to the DBTune Myspace service, so you can easily get from Musicbrainz artists to the corresponding MP3s available on MySpace and their social networks. See for example Madonna, linked through owl:sameAs to the corresponding DBpedia artist and to the corresponding Myspace artist.

Friday 25 July 2008

List of accepted ISMIR 2008 papers

Just spotted through Paul's blog: the list of accepted ISMIR 2008 papers is now available online. All the papers sound really interesting, so I guess it will be a really good ISMIR!! I am especially glad to see that the Variations3 people will present their work on FRBR-based musical metadata. They seem to have done a lot of interesting things over the last year! I also hope we can make things connect in some ways with MO, thanks to this common FRBR backbone.

Anyway, I can't wait for the actual proceedings which, apparently, will be available online prior to the conference. Quite a few of the selected papers are already available on the Web as pre-prints, though (this really interesting one from Patrick Rabbat and Francois Pachet, for example).

I should have uploaded it earlier, but here is the paper we wrote with Mark Sandler. It describes all the structured data publishing and interlinking work we've been doing over the last year, based on the Music Ontology framework we described last year. We tried to illustrate that by (hopefully) fun examples (Mozart and Metallica are closer than you think... :-) ). It also describes a SPARQL-based web service for feature extraction, driven by workflows written in N3.

Thursday 17 July 2008

Literal search using the Jamendo SPARQL end-point

I just wrote a small SWI-Prolog module for literal search using the ClioPatria SPARQL end-point. It uses the rdf_litidex module, and performs a metaphone search on existing literals in the database. All of that is triggered through a built-in RDF predicate.

Here is an example query you can perform on the Jamendo SPARQL end-point (make sure you select lit as the entailment - it will be the default one soon):

SELECT ?o
WHERE
{"punk jazz" <http://purl.org/ontology/swi#soundslike> ?o}

This query binds ?o to all resources within the end-point that are associated with matching literals. For example, you would get back:

The module is available there.

Wednesday 9 July 2008

Nominated!

We learned yesterday that DBTune was nominated for the Triplify Challenge! The other seven projects are really interesting as well, so I guess the competition will be really high! The final results will be given at the I-Semantics conference in early September.

Also, Tim Berners-Lee made a great talk about linked data and the semantic web on Radio 4 earlier today. The first use-case he mentions sounds quite familiar: finding bands based on geo-location data. He already mentioned that in one of his blog posts, linking to this screencast.

An interesting discussion took place on the Linking Open Data mailing list just afterwards, to gather use-cases for explaining to a general public what linked data can be useful for.

Tuesday 1 July 2008

Echonest Analyze XML to Music Ontology RDF

I wrote a small XSL stylesheet to transform the XML results of the Echonest Analyze API to Music Ontology RDF. The Echonest Analyze API is a really great (and simple) web service to process audio files and get back an XML document describing some of their features (rhythm, structure, pitch, timbre, etc.). A lot of people already did really great things with it, from collection management to visualisation.

The XSL is available on that page. The resulting RDF can be queried using SPARQL. For example, the following query selects the boundaries of structural segments (chorus, verse, etc.):

PREFIX af: <http://purl.org/ontology/af/>
PREFIX event: <http://purl.org/NET/c4dm/event.owl#>
PREFIX tl: <http://purl.org/NET/c4dm/timeline.owl#>

SELECT ?start ?duration
FROM <http://dbtune.org/echonest/analyze-example.rdf>
WHERE
{
?e      a af:StructuralSegment;
        event:time ?time.
?time   tl:start ?start;
        tl:duration ?duration.
}

I also added on that page the small bit to add to the Echonest Analyze XML to make it GRDDL-ready. That means that the XML document can be automatically translated to actual RDF data (which can then be aggregated, stored, linked to, queried, etc.).

<Analysis    xmlns:grddl="http://www.w3.org/2003/g/data-view#" 
                grddl:transformation="http://dbtune.org/echonest/echonest.xsl">
...
</Analysis>

This provides a lot more data to aggregate for describing my music collection !

If there is one thing I really wish could be integrated in the Echonest API, it would be a Musicbrainz lookup... Right now, I have to manually link the data I get from it to the rest of my aggregated data. If the Echonest results could include a link to the corresponding Musicbrainz resource, it would really simplify this step :-)

Wednesday 25 June 2008

Linking Open Data: BBC playcount data as linked data

For the Mashed event this week end, the BBC released some really interesting data. This includes playcount data, stating how much an artist is featured within a particular BBC programmes (at the brand or episode level).

During the event, I wrote some RDF translators for this data, linking web identifiers in the DBTune Musicbrainz linked data to web identifiers in the BBC Programmes linked data. We used it with Kurt and Ben in our hack. Ben made a nice write-up about it. By finding web identifiers for tracks in a collection and following links to the BBC Programmes data, and finally connecting this Programmes data to the box holding all recorded BBC radio programmes over a year that was available at the event, we can quite easily generate playlists from an audio collection. Two python scripts implementing this mechanism are available there. The first one uses solely brands data, whereas the second one uses episodes data (and therefore helps to get fewer and more accurate items in the resulting playlist). Finally, the thing we spent the most time on was the SQLite storage for our RDF cache :-)

This morning, I published the playcount data as linked data. I wrote a new DBTune service for that. It publishes a set of web identifiers for playcount data, interlinking Musicbrainz and BBC Programmes. I also put online a SPARQL end-point holding all this playcount data along with aggregated data from Musicbrainz and the BBC Programmes linked data (around 2 million triples overall).

For example, you can try the following SPARQL query:

SELECT ?brand ?title ?count
WHERE {
   ?artist a mo:MusicArtist;
      foaf:name "The Beatles". 
   ?pc pc:object ?artist;
       pc:count ?count.
   ?brand a po:Brand;
       pc:playcount ?pc;
       dc:title ?title 
    FILTER (?count>10)}

This will return every BBC brand that has featured The Beatles more than 10 times.

Thanks to Nicholas and Patrick for their help!

Mashed!

I was at Mashed (the former Hack Day) this week-end - a really good and geeky event, organised by the BBC at Alexandra Palace. We arrived on the Saturday morning for some talks, detailing the different things we'd be able to play with over the week-end. Amongst these, a full DVB-T multiplex (apparently, it was the first time since 1956 that a TV signal was broadcasted from Alexandra Palace), lots of data from the BBC Programmes team and a box full of recorded radio content over the last year.

After these presentations, the 24 hours hacking session began. We sat down with Kurt and Ben and wrote a small hack which basically starts from a personal music collection and creates you a playlist of recorded BBC programmes. I will write a bit more about this later today

During the 24 hours hack, we had a Rock Band session on big screen, a real-world Tron game (basically, two guys running with GPS phones, guided by two persons watching their trail on a google satellite map :-) ), a rocket launching...

Finally, at 2pm on the Sunday, people presented their hacks. Almost 50 hacks were presented, all extremely interesting. Take a look at the complete list of hacks! On the music side, Patrick's recommender was particularly interesting. It used Latent Semantic Analysis on playcount data for artists in BBC brands and episodes to recommend brands from artists or artists from artists. It gave some surprising results :-) Jamie Munroe resurrected the FPFF Musicbrainz fingerprinting algorithm (which was apparently due to replace the old TRM one before MusicIP offered their services to Musicbrainz) to identify tracks played several times in BBC programmes. The WeDoID3 team talked about creating RSS feeds from embedded metadata in audio and video, but the demo didn't work.

My personal highlight was the hack (which actually won a prize) from Team Bob. Here is a screencast of it:


BBC Dylan - News 24 Revisited (Clip) from James Adam on Vimeo.

Thanks to Matthew Cashmore and the rest of the BBC backstage team for this great event! (and thanks to the sponsors for all the free stuff - I think I have enough T-shirts for about a year now :-))

Thursday 12 June 2008

Describing the content of RDF datasets

There seems to be an overall consensus in the Linking Open Data community that we need a way to describe in RDF the different datasets published and interlinked within the project. There is already a Wiki page detailing some aspects of the corresponding vocabulary, called voiD (vocabulary of interlinked datasets).

One thing I would really like this vocabulary to do would be to describe exactly the inner content of a dataset - what could we find in this SPARQL end-point or in this RDF document? I thought quite a lot about this recently, as I begin to really need that. Indeed, when you have RDF documents describing lots of audio annotations, and which generation is really computation intensive, you want to pick just the one that fits your request. There have been quite a lot of similar efforts in the past. However, most of them rely on one or another sort of reification, which makes it quite hard to actually use.

After some failed tries, I came up with the following, which I hope is easy and expressive enough :-)

It relies on a single property void:example, which links a resource identifying a particular dataset to a small RDF document holding an example of what you could find in that dataset. Then, with just a bit of SPARQL magic, you can easily query for datasets having a particular capability. Easy, isn't it? :-)

Here is a real-world example of that. A first RDF document describes one of the DBtune dataset:

:ds1
        a void:Dataset;
        rdfs:label "Jamendo end-point on DBtune";
        dc:source <http://jamendo.com/>;
        foaf:maker <http://moustaki.org/foaf.rdf#moustaki>;
        void:sparql_end_point <http://dbtune.org:2105/sparql/>;
        void:example <http://moustaki.org/void/jamendo_example.n3>;
        .

The void:example property points towards a small RDF file, giving an example of what you can find within this dataset.

Then, the following SPARQL query asks whether this dataset has a SPARQL end-point and holds information about music records, associated tags, and places to download them.

PREFIX void: <http://purl.org/ontology/void#>
PREFIX mo: <http://purl.org/ontology/mo/>
PREFIX tags: <http://www.holygoat.co.uk/owl/redwood/0.1/tags/>

ASK
FROM NAMED <http://moustaki.org/void/void.n3>
FROM NAMED <http://moustaki.org/void/jamendo_example.n3>
{
        GRAPH <http://moustaki.org/void/void.n3> {
                ?ds a void:Dataset;
                        void:sparql_end_point ?sparql;
                        void:example ?ex.
        }
        GRAPH ?ex {
                ?r a mo:Record;
                        mo:available_as ?l;
                        tags:taggedWithTag ?t.
        }
}

I tried this query with ARQ, and it works perfectly :-)

$ sparql --query=void.sparql
Ask => Yes

Update: It also works with ARC2. Although it does not load automatically the SPARQL FROM clause. You can try the same query on this SPARQL end-point, which previously loaded the two documents (the voiD description and the example).

Update 2: A nice blog post about automatically generating the data you need for describing an end-point - thanks shellac for the pointer!

Update 3: Following discussion on the #swig IRC channel.

Tuesday 3 June 2008

Sorted Sound at the Dana Centre

If you're around London on Thursday, a couple of people from the Centre for Digital Music in Queen Mary, University of London (including myself) will talk about the research we do in music technologies, at the Dana Centre in South Kensington.

The event description is mainly focused about search. Kurt will indeed demo Soundbite and Ben and Michela from Goldsmiths college will demo a fast content-based search on large music databases. However, Chris, Katy and Matthew will demo the Sonic Visualiser, a great open source software to analyse and visualise audio data. I will talk about Semantic Web technologies, in particular the Music Ontology and Linked Data. I will also be demoing some things related to organising music collection using Semantic Web data, and user interfaces to interact with them in unusual ways. As Tom puts it, the Semantic Web is not all about search :-)

Tuesday 20 May 2008

Ceriese: RDF translator for Eurostat data

Riese logo

Some time ago, I did a bit of work on the RIESE project, aiming at publishing the Eurostat dataset on the Semantic Web, and interlinking it with further datasets (eg. Geonames and DBpedia). This can look a bit far from my interests in music data, but there is a connexion which illustrates the power of linked data, as explained at the end of this post.

Original data

There are three distinct things we consider in the Eurostat dataset:

  • A table of content in HTML defining the hierarchical structure of the Eurostat datasets;
  • Tab-separated values dictionary files defining the ~80 000 data codes used in the dataset (eg. "eu27" for the European Union of 27 countries);
  • The actual statistical data, in tab-separated values files. Around 4000 datasets for roughly 350 million statistical items.

Ontology

The first thing we need to figure out when exposing data on the Semantic Web is the model we'll link to. This lead into the design of SCOVO (Statistical Core Vocabulary). The concepts in this ontology can be depicted as follows:

SCOVO ontology

The interesting thing about this model is that the statistical item is considered as a primary entity. We used as a basis the Event ontology - a statistical item is a particular classification of a space/time region. This allows to be really flexible and extensible. We can for example attach multiple dimensions to a particular item, resources pertaining to its creation, etc.

RDF back-end

I wanted to see how to publish such large amounts of RDF data and how my publication tools perform, so I designed the Ceriese software to handle that.

The first real concern when dealing with such large amounts of data is, of course, scalability. The overall Eurostat dataset is over 3 billion triples. Given that we don't have high-end machines with lots of memory, using the core SWI RDF store was out of the question (I wonder if any in-memory triple store can handle 1 billion triples, btw).

So there are three choices at this point:

  • Use a database-backed triple store;
  • Dump static RDF file in a file-system served through Apache;
  • Generate the RDF on-the-fly.

We don't have the sort of money it takes (for both the hardware and the software) for the first choice to really scale, so we tried the second and the third solution. I took my old Prolog-2-RDF software that I am using to publish the Jamendo dataset and we wrote some P2R mapping files converting the tab-separated value files. Then, we made P2R dump small RDF files in a file-system hierarchy, corresponding to the description of the different web resources we wanted to publish. Then, some Apache tweaks and Michael and Wolfgang's work on XHTML/RDFa publishing were enough to make the dataset available in the web of data.

But this approach had two main problems. First, it took ages to run this RDF dump, so we never actually succeeded to complete it once. Also, it was impossible to provide a SPARQL querying facility. No aggregation of data was made available.

So we eventually settled on the third solution. I took my Prolog hacker hat, and tried to optimise P2R to make it fast enough. I did it by using the same trick I used in my small N3 reasoner, Henry. P2R mappings are compiled as native Prolog clauses (rdf(S,P,O) :- ... ), which cut down the search space a lot. TSV files are accessed within those rules and parsed on-the-fly. The location of the TSV file to access is derived from a small in-memory index. Parsed TSV files are cached for a whole query, to avoid parsing the same file for different triple patterns in the query.

Same mechanisms are applied to derive a SKOS hierarchy from the HTML table of content.

Now, a DESCRIBE query takes less than 0.5 seconds for any item, on my laptop. Not perfect, but, still...

A solution to improve the access time a lot would be to dump the TSV file in a relational database, and access this database in our P2R mappings instead of the raw TSV files.

Trying it out and creating your own SPARQL end-point from Eurostat data is really easy.

  • Get SWI-Prolog;
  • Get the software from there;
  • Get the raw Eurostat data (or get it from the Eurostat web-site, as this one can be slightly out-dated);
  • Put it in data/, in your Ceriese directory;
  • Launch start.pl;
  • Go to http://localhost:3020/

Now what?

Michael and Wolfgang did an amazing work at putting together a really nice web interface, publishing the data in XHTML+RDFa. They also included some interlinking, especially for geographical locations, which are now linked to Geonames.

So what's the point from my music geek point of view?? Well, now, after aggregating Semantic Web data about my music collection (using these tools), I can sort hip-hop artists by murder rates in their city :-) This is quite fun as it is (especially as the Eurostat dataset holds a really diverse range of statistics), but it would be really interesting to mine that to get some associations between statistical data and musical facts. This would surely lead to interesting sociological results (eg. how does musical "genre" associate with particular socio-economic indicators?)

Wednesday 14 May 2008

Data-rich music collection management

I just put a live demo of something I showed earlier on this blog.

You can explore the Creative Commons-licensed part of my music collection (mainly coming from Jamendo) using aggregated Semantic Web data.

For example, here is what you get after clicking on "map" on the right-hand side and "MusicArtist" on the left-hand side:

Data-rich music collection management

The aggregation is done using the GNAT and GNARQL tools available in the motools sourceforge project. The data comes from datasets within the Linking Open Data project.The UI is done by the amazing ClioPatria software, with a really low amount of configuration.

An interesting thing is to load this demo into Songbird, as it can aggregate and play the audio as you crawl around.

Check the demo!

Update: It looks like it doesn't work with IE, but it is fine with Opera and FF2 or FF3. If the map doesn't load at first, just try again and it should be ok.

Monday 12 May 2008

Linked Data on the Web 2008

I just got back from Beijing (I did a two weeks trip around China after the actual conference), where I attended the Linked Data on the Web workshop and the WWW conference.

The workshop was really good, gathering lots of people from the Linking Open Data community (it was the first time I met most of these people, after more than one year working with them :-) ).

The attendance was much higher than expected, with around 100 people registered for the workshop.

C

It started well with this sentence by Tim Berners-Lee in the workshop introduction:

Linked Data is the Semantic Web done right, and the Web done right.

That's a pretty good way to start a day :-) Then, Chris Bizer did a good overview of what the community has achieved in one year, illustrated by the different versions of Richard's diagram:

C

All the talks and papers were extremely high quality. I got particularly interested by some of them, including Tim's presentation on the new SPARQL/Update capabilities of the Tabulator data browser. This allows easy interaction with data wikis, where everyone can add or correct information.

C

I really liked Alexandre Passant's presentation on the Flickr exporter, which is highlighting a mechanism that I used for the Last.fm linked data exporter: linking several identities on several web-sites is just a owl:sameAs link away. Alexandre also did another presentation on MOAT (Meaning of a Tag), a really interesting project allowing to relate tags to Semantic Web URIs. For example, it allows to easily draw a link between my tag "paris texas" to the movie Paris, Texas in DBpedia.

I got a bit confused by Paul Miller's presentation about licensing open data. I have been aware of these efforts mainly by the work of the Open Knowledge Foundation and the Open Data Commons project, and I think these are truly crucial issues: we need open data, and explicit licensing. But perhaps the audience was not so well chosen: most (if not all) of us in the Linking Open Data community do not own the data they publish as RDF and interlink. DBpedia exports data extracted from Wikipedia, DBTune exports data from different music-related sources such as Jamendo or Last.fm, etc. The only data that we can possibly explicitly license are links (the only thing we actually own), and it does not have any values without any data :-) So I guess the outreach should mainly be done to raw data publishers rather than Semantic Web translators? But hopefully, in a near future, the two communities will be the same!

C

One of my personal highlights was also Christian Becker's presentation about DBpedia mobile: a location-enabled linked data browser for mobile devices., giving you nearby sights and detailed descriptions, restaurants, hotels, etc. We chatted a bit after the workshop with Alexandre and Christian about adding Last.fm events to the DBtune exporter to also display nearby gigs (with optional filtering based on your foaf:interests, of course :-) ).

Jun Zhao's presentation about linked data and provenance for biological resources was extremely interesting: they are dealing with problems strongly similar to ours in a Music Information Retrieval context. How to trust a particular statement (for example, a structural segmentation of a particular track) found on the web? We need to know whether it was written by a human, or derived through a set of algorithms, and in this case, we might want to choose timbre-based instead of chroma-based workflows in the case of Rock music, for example. This is the sort of things we implemented within our Henry software (more to come on that later, including online demo as soon as I put it on better hardware, and (hopefully) a PhD :-D ).

Wolfgang Halb did a presentation about our Riese project, but more on that later as I wrote the back-end software powering it and I'd like to give it a full blog entry soon.

I did a presentation about automatic interlinking algorithms on the data web, with a focus on music-related datasets. I detailed an algorithm we developed for this purpose, propagating similarity measures around web data as long as we can't take an interlinking (creating a bunch of owl:sameAs links) decision. This algorithm is good in the sense that it gives a really low rate of false-positives. On the test-set detailed in the paper, it made no wrong decisions. I blogged about this algorithm earlier.

C

Some people expressed concerns about the proliferation of owl:sameAs links (highlighted in this presentation by Paolo Bouquet). But I truly think it is a necessary thing, as long as web identifiers are tied to their actual representation. I need to be able to have a web identifier for a song in Jamendo and a web identifier for the same song in Musicbrainz, and I need a way to link these together: owl:sameAs is perfect for that. I wouldn't trust a centralised "identity" system (what actually is identity anyway? :-) ), as it would break the nice decentralised information paradigm we're implementing within the Linking Open Data project.

Anyway, lots of great people, a great time, lots of interesting discussions and new ideas... I am really looking forward for WWW 2009 in Madrid and the next workshop!!!

Monday 7 April 2008

D2RQ mapping for Musicbrainz

I just started a D2R mapping for Musicbrainz, which allows to create a SPARQL end-point and to provide linked data access out of Musicbrainz fairly easily. A D2R instance loaded with the mapping as it is now is also available (be gentle, it is running on a cheap computer :-) ).

Added to the things that are available within the Zitgist mapping:

  • SPARQL end point ;
  • Support for tags ;
  • Supports a couple of advanced relationships (still working my way through it, though) ;
  • Instrument taxonomy directly generated from the db, and related to performance events;
  • Support for orchestras ;
  • Linked with DBpedia for places and Lingvoj for languages

There is still a lot to do, though: it is really a start. The mapping is available on the motools sourceforge project. I hope to post a follow-up soon! (including examples of funny SPARQL queries :-) ).

Update: For some obscure port-forwarding reasons, the SNORQL interface to the SPARQL end point does not work on the test server.

Update 2: This is fixed. (thanks to the anonymous SPARQL crash tester which helped me find the bug, by the way :-) )

Wednesday 2 April 2008

13.1 billion triples

After a rough estimation, it looks like the services hosted on DBTune provide access to 13.1 billion triples, therefore making a significant addition to the data web!

Here is the break-down of such an estimation:

  • MySpace: 250 million people * 50 triples (in average) = 12.5 billion triples ;
  • AudioScrobbler: 1.5 million users (only Europe?) * 400 triples = 600 million ;
  • Jamendo: 1.1 million triples + 5000 links to other data sources ;
  • Magnatune: 322 000 triples + 233 links ;
  • BBC John Peel sessions: 277 000 triples + 2100 links ;
  • Chord URI service: I don't count it, as it is potentially infinite (the RDF descriptions are generated from the chord symbol in the URI).

However, SPARQL end-points are not available for AudioScrobbler and MySpace, as the RDF is generated on-the-fly, from the XML feeds for the earlier, and from scraping for the latter.

Now, I wish linked data could be provided directly by the data sources themselves :-) (Again, all the code used to run the DBTune services is available in the motools project on Sourceforge).

- page 2 of 4 -