DBTune blog

To content | To menu | To search

Tag - sparql

Entries feed

Thursday 14 January 2010

Live SPARQL end-point for BBC Programmes

Update: We seem to have an issue with the 4store hosting the dataset, so the data is stale since the end of February. Update 2: All should be back to normal and in sync. Please comment on this post if you spot any issue, or general slowliness.

Last year, we got OpenLink and Talis to crawl BBC Programmes and provide two SPARQL end-points on top of the aggregated data. However, getting the data by crawling it means that the end-points did not have all the data, and that the data can get quite outdated -- especially as our programme data changes a lot.

At the moment, our data comes from two sources: PIPs (the central programme database at the BBC) and PIT (our content mangement system for programme information). In order to populate the /programmes database, we monitor changes on these two sources and replicate them on our database. We have a small piece of Ruby/ActiveRecord software (that we call the Tapp) which handles this process.

I made a small experiment, converting our ActiveRecord objects to RDF and hooking an HTTP POST or an HTTP DELETE request to a 4store instance for each change we receive. This means that this 4store instance is kept in sync with upstream data sources.

It took a while to backfill, but it is now up-to-date. Check out the SPARQL end-point, a test SPARQL query form and the size of the endpoint (currently about 44 million triples).

The end-point holds all information about services, programmes, categories, versions, broadcasts, ondemands, time intervals and segments, as defined within the Programme Ontology. All of these resources are held within their own named graph, which means we have a very large number of graphs (about 5 million). It makes it far easier to update the endpoint, as we can just replace the whole graph whenever something changes for a resource.

This is still highly experimental though, and and I already found a few bugs: some episodes seem to be missing (for example, some Strictly Come Dancing episodes are missing, for some reason). I've also encountered some really weird crashes of the machine hosting the end-point when concurrently pushing a large number of RDF documents at it - I still didn't succeed to identify the cause of it. To summarise: it might die without notice :-)

Here are some example SPARQL queries:

  • All programmes related to James Bond:
PREFIX po: <http://purl.org/ontology/po/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?uri ?label
WHERE {
  ?uri po:category 
    <http://www.bbc.co.uk/programmes/people/bmFtZS9ib25kLCBqYW1lcyAobm8gcXVhbGlmaWVyKQ#person> ; rdfs:label ?label
}
  • FInd all Eastenders broadcast dates after 2009-01-01, along with the type of the version that was broadcast:
PREFIX event: <http://purl.org/NET/c4dm/event.owl#> 
PREFIX tl: <http://purl.org/NET/c4dm/timeline.owl#> 
PREFIX po: <http://purl.org/ontology/po/> 
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
SELECT ?version_type ?broadcast_start
WHERE
{ <http://www.bbc.co.uk/programmes/b006m86d#programme> po:episode ?episode .
  ?episode po:version ?version .
  ?version a ?version_type .
  ?broadcast po:broadcast_of ?version .
  ?broadcast event:time ?time .
  ?time tl:start ?broadcast_start .
  FILTER ((?version_type != <http://purl.org/ontology/po/Version>) && (?broadcast_start > "2009-01-01T00:00:00Z"^^xsd:dateTime))}
PREFIX po: <http://purl.org/ontology/po/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX mo: <http://purl.org/ontology/mo/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX event: <http://purl.org/NET/c4dm/event.owl#>
PREFIX tl: <http://purl.org/NET/c4dm/timeline.owl#>
PREFIX owl: <http://www.w3.org/2002/07/owl#> 
SELECT DISTINCT ?programme ?label
WHERE {
  ?event1 po:track ?track1 .
  ?track1 foaf:maker ?maker1 . ?maker1 owl:sameAs <http://www.bbc.co.uk/music/artists/67f66c07-6e61-4026-ade5-7e782fad3a5d#artist> .
  ?event2 po:track ?track2 .
  ?track2 foaf:maker ?maker2 . ?maker2 owl:sameAs <http://www.bbc.co.uk/music/artists/fb7272ba-f130-4f0a-934d-6eeea4c18c9a#artist> .
  ?event1 event:time ?t1 .
  ?event2 event:time ?t2 .
  ?t1 tl:timeline ?tl .
  ?t2 tl:timeline ?tl .
  ?version po:time ?t .
  ?t tl:timeline ?tl .
  ?programme po:version ?version .
  ?programme rdfs:label ?label .
}

Monday 15 June 2009

And another fun BBC SPARQL query

This query returns BBC programmes featuring artists originating from France (this is just a straight adaptation of the last query in my previous post).

The results are quite fun! Apparently, the big French hits on the BBC are from Jean-Michel Jarre, Air, Modjo, Phoenix (are they known in France? I've only heard of them in the UK) and Vanessa Paradis.

Note that the tracklisting data we expose in our RDF just goes back a couple of months, so that might explain why the list is not bigger.

Thursday 11 June 2009

BBC SPARQL end-points

We recently announced on the BBC backstage blog the availability of two SPARQL end-points, one hosted by Talis and one by OpenLink. These two companies aggregated the RDF data we publish at http://www.bbc.co.uk/programmes and http://www.bbc.co.uk/music. This opens up quite a lot of fascinating SPARQL queries. Talis already compiled a small list, and here are a couple I just designed:

  • Give me programmes that deal with the fictional character James Bond - results
PREFIX po: <http://purl.org/ontology/po/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?uri ?label
WHERE {
  ?uri po:person 
    <http://www.bbc.co.uk/programmes/people/bmFtZS9ib25kLCBqYW1lcyAobm8gcXVhbGlmaWVyKQ#person> ; rdfs:label ?label
}
  • GIve me artists that were featured in the same programme as the Foo Fighters - results
PREFIX po: <http://purl.org/ontology/po/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX mo: <http://purl.org/ontology/mo/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX event: <http://purl.org/NET/c4dm/event.owl#>
PREFIX tl: <http://purl.org/NET/c4dm/timeline.owl#>
SELECT DISTINCT ?artist2 ?label2
WHERE {
  ?event1 po:track ?track1 .
  ?track1 foaf:maker <http://www.bbc.co.uk/music/artists/67f66c07-6e61-4026-ade5-7e782fad3a5d#artist> .
  ?event2 po:track ?track2 .
  ?track2 foaf:maker ?artist2 .
  ?artist2 rdfs:label ?label2 .
  ?event1 po:time ?t1 .
  ?event2 po:time ?t2 .
  ?t1 tl:timeline ?tl .
  ?t2 tl:timeline ?tl .
  FILTER (?t1 != ?t2)
}
  • Give me programmes that featured both Al Green and the Foo Fighters (yes! there is one result!!) - results
PREFIX po: <http://purl.org/ontology/po/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX mo: <http://purl.org/ontology/mo/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX event: <http://purl.org/NET/c4dm/event.owl#>
PREFIX tl: <http://purl.org/NET/c4dm/timeline.owl#>
SELECT DISTINCT ?programme ?label
WHERE {
  ?event1 po:track ?track1 .
  ?track1 foaf:maker <http://www.bbc.co.uk/music/artists/67f66c07-6e61-4026-ade5-7e782fad3a5d#artist> .
  ?event2 po:track ?track2 .
  ?track2 foaf:maker <http://www.bbc.co.uk/music/artists/fb7272ba-f130-4f0a-934d-6eeea4c18c9a#artist> .
  ?event1 po:time ?t1 .
  ?event2 po:time ?t2 .
  ?t1 tl:timeline ?tl .
  ?t2 tl:timeline ?tl .
  ?version po:time ?t .
  ?t tl:timeline ?tl .
  ?programme po:version ?version .
  ?programme rdfs:label ?label .
}
  • All programmes that featured an artist originating from Northern Ireland - results
PREFIX po: <http://purl.org/ontology/po/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX mo: <http://purl.org/ontology/mo/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX event: <http://purl.org/NET/c4dm/event.owl#>
PREFIX tl: <http://purl.org/NET/c4dm/timeline.owl#>
PREFIX dbprop: <http://dbpedia.org/property/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
SELECT DISTINCT ?programme ?label ?artistlabel ?dbpmaker
WHERE {
  ?event1 po:track ?track1 .
  ?track1 foaf:maker ?maker .
  ?maker rdfs:label ?artistlabel .
  ?maker owl:sameAs ?dbpmaker .
  ?dbpmaker dbprop:origin <http://dbpedia.org/resource/Northern_Ireland> .
  ?event1 po:time ?t1 .
  ?t1 tl:timeline ?tl .
  ?version po:time ?t .
  ?t tl:timeline ?tl .
  ?programme po:version ?version .
  ?programme rdfs:label ?label .
}

(Note that we just need the owl:sameAs in the above query as the Talis end-point doesn't support inference)

Let us know what kind of query you can come up with this data! :-)

Wednesday 29 October 2008

SPARQLing a funk legend

I just came across this awesome blog post from Kurt. It starts from a real music question (he saw Maceo Parker live, and wanted to know if he wrote one of the song he played in the encore: Pass the Peas), and finds an answer to it using Semantic Web technologies, in particular SPARQL.

Great stuff!

Thursday 17 July 2008

Literal search using the Jamendo SPARQL end-point

I just wrote a small SWI-Prolog module for literal search using the ClioPatria SPARQL end-point. It uses the rdf_litidex module, and performs a metaphone search on existing literals in the database. All of that is triggered through a built-in RDF predicate.

Here is an example query you can perform on the Jamendo SPARQL end-point (make sure you select lit as the entailment - it will be the default one soon):

SELECT ?o
WHERE
{"punk jazz" <http://purl.org/ontology/swi#soundslike> ?o}

This query binds ?o to all resources within the end-point that are associated with matching literals. For example, you would get back:

The module is available there.

Wednesday 25 June 2008

Linking Open Data: BBC playcount data as linked data

For the Mashed event this week end, the BBC released some really interesting data. This includes playcount data, stating how much an artist is featured within a particular BBC programmes (at the brand or episode level).

During the event, I wrote some RDF translators for this data, linking web identifiers in the DBTune Musicbrainz linked data to web identifiers in the BBC Programmes linked data. We used it with Kurt and Ben in our hack. Ben made a nice write-up about it. By finding web identifiers for tracks in a collection and following links to the BBC Programmes data, and finally connecting this Programmes data to the box holding all recorded BBC radio programmes over a year that was available at the event, we can quite easily generate playlists from an audio collection. Two python scripts implementing this mechanism are available there. The first one uses solely brands data, whereas the second one uses episodes data (and therefore helps to get fewer and more accurate items in the resulting playlist). Finally, the thing we spent the most time on was the SQLite storage for our RDF cache :-)

This morning, I published the playcount data as linked data. I wrote a new DBTune service for that. It publishes a set of web identifiers for playcount data, interlinking Musicbrainz and BBC Programmes. I also put online a SPARQL end-point holding all this playcount data along with aggregated data from Musicbrainz and the BBC Programmes linked data (around 2 million triples overall).

For example, you can try the following SPARQL query:

SELECT ?brand ?title ?count
WHERE {
   ?artist a mo:MusicArtist;
      foaf:name "The Beatles". 
   ?pc pc:object ?artist;
       pc:count ?count.
   ?brand a po:Brand;
       pc:playcount ?pc;
       dc:title ?title 
    FILTER (?count>10)}

This will return every BBC brand that has featured The Beatles more than 10 times.

Thanks to Nicholas and Patrick for their help!

Thursday 12 June 2008

Describing the content of RDF datasets

There seems to be an overall consensus in the Linking Open Data community that we need a way to describe in RDF the different datasets published and interlinked within the project. There is already a Wiki page detailing some aspects of the corresponding vocabulary, called voiD (vocabulary of interlinked datasets).

One thing I would really like this vocabulary to do would be to describe exactly the inner content of a dataset - what could we find in this SPARQL end-point or in this RDF document? I thought quite a lot about this recently, as I begin to really need that. Indeed, when you have RDF documents describing lots of audio annotations, and which generation is really computation intensive, you want to pick just the one that fits your request. There have been quite a lot of similar efforts in the past. However, most of them rely on one or another sort of reification, which makes it quite hard to actually use.

After some failed tries, I came up with the following, which I hope is easy and expressive enough :-)

It relies on a single property void:example, which links a resource identifying a particular dataset to a small RDF document holding an example of what you could find in that dataset. Then, with just a bit of SPARQL magic, you can easily query for datasets having a particular capability. Easy, isn't it? :-)

Here is a real-world example of that. A first RDF document describes one of the DBtune dataset:

:ds1
        a void:Dataset;
        rdfs:label "Jamendo end-point on DBtune";
        dc:source <http://jamendo.com/>;
        foaf:maker <http://moustaki.org/foaf.rdf#moustaki>;
        void:sparql_end_point <http://dbtune.org:2105/sparql/>;
        void:example <http://moustaki.org/void/jamendo_example.n3>;
        .

The void:example property points towards a small RDF file, giving an example of what you can find within this dataset.

Then, the following SPARQL query asks whether this dataset has a SPARQL end-point and holds information about music records, associated tags, and places to download them.

PREFIX void: <http://purl.org/ontology/void#>
PREFIX mo: <http://purl.org/ontology/mo/>
PREFIX tags: <http://www.holygoat.co.uk/owl/redwood/0.1/tags/>

ASK
FROM NAMED <http://moustaki.org/void/void.n3>
FROM NAMED <http://moustaki.org/void/jamendo_example.n3>
{
        GRAPH <http://moustaki.org/void/void.n3> {
                ?ds a void:Dataset;
                        void:sparql_end_point ?sparql;
                        void:example ?ex.
        }
        GRAPH ?ex {
                ?r a mo:Record;
                        mo:available_as ?l;
                        tags:taggedWithTag ?t.
        }
}

I tried this query with ARQ, and it works perfectly :-)

$ sparql --query=void.sparql
Ask => Yes

Update: It also works with ARC2. Although it does not load automatically the SPARQL FROM clause. You can try the same query on this SPARQL end-point, which previously loaded the two documents (the voiD description and the example).

Update 2: A nice blog post about automatically generating the data you need for describing an end-point - thanks shellac for the pointer!

Update 3: Following discussion on the #swig IRC channel.

Monday 7 April 2008

D2RQ mapping for Musicbrainz

I just started a D2R mapping for Musicbrainz, which allows to create a SPARQL end-point and to provide linked data access out of Musicbrainz fairly easily. A D2R instance loaded with the mapping as it is now is also available (be gentle, it is running on a cheap computer :-) ).

Added to the things that are available within the Zitgist mapping:

  • SPARQL end point ;
  • Support for tags ;
  • Supports a couple of advanced relationships (still working my way through it, though) ;
  • Instrument taxonomy directly generated from the db, and related to performance events;
  • Support for orchestras ;
  • Linked with DBpedia for places and Lingvoj for languages

There is still a lot to do, though: it is really a start. The mapping is available on the motools sourceforge project. I hope to post a follow-up soon! (including examples of funny SPARQL queries :-) ).

Update: For some obscure port-forwarding reasons, the SNORQL interface to the SPARQL end point does not work on the test server.

Update 2: This is fixed. (thanks to the anonymous SPARQL crash tester which helped me find the bug, by the way :-) )

Monday 25 February 2008

Playing with SPARQL and XMPP

Chatting with Dan Brickley at the Semantic Camp last week got me quite curious about mixing SPARQL and XMPP, so I decided to give it a try :-)

I first tried sparqlxmpp by Chris Schmidt, a Python implementation using redland as a back-end. Unfortunately, I did run into some troubles (a weird error attribute being inserted in the XML between the sending and the receiving, which made the whole thing crash).

So I decided to give a try at xOperator, which 0.1 version was announced last week. It is really easy to use and flexible (notably through the fact that you can specify third-party SPARQL end-points as new datasets, and the ability to template SPARQL queries (a bit like SparqlBot)).

Here is the result. You can add jamendo@jabber.org to your Jabber roster, which provides SPARQL/XMPP on top of the Jamendo SPARQL end-point. Then, here is an example session (moustaki is me, and jamendo is the xoperator instance):

<moustaki> list templates
<jamendo> These are the defined templates:
<jamendo> LIST ME * ARTISTS  select ?n where {?a rdf:type mo:MusicArtist; foaf:name ?n} LIMIT %%1%%
<jamendo> GIVE ME THE LOCATION OF *  select ?l where {?a a mo:MusicArtist;foaf:name '%%1%%'; foaf:based_near ?g. ?g 
                <http://www.geonames.org/ontology#name> ?l}
<jamendo> GIVE ME ALBUMS MADE BY *  select ?l where {?a a mo:MusicArtist; foaf:name '%%1%%'; foaf:made ?r. ?r a mo:Record; dc:title ?l}
<jamendo> WHERE CAN I GET *  select ?d where {?a dc:title '%%1%%'; mo:available_as ?d}
<jamendo> GIVE ME THE LOCATION OF *  select ?l where {?a a mo:MusicArtist;foaf:name '%%1%%'; foaf:based_near ?g. ?g 
                <http://www.geonames.org/ontology#name> ?l}

<moustaki> list me 5 artists
<jamendo> Store jamendo answered: 
<jamendo> "Cicada"
<jamendo> "Hace Soul"
<jamendo> "vincent j"
<jamendo> "NoU"
<jamendo> "Margin of Safety"

<moustaki> give me the location of Cicada
<jamendo> Store jamendo answered: 
<jamendo> "Département des Bouches-du-Rhône"

<moustaki> give me albums made by Cicada
<jamendo> Store jamendo answered: 
<jamendo> "Rebirth"   
<jamendo> "AuthentiK Yogourt"

<moustaki> where can I get AuthentiK Yogourt
<jamendo> Store jamendo answered: 
<jamendo> <http://www.jamendo.com/get/album/id/album/p2p/redirect/8309/?p2pnet=bittorrent&are=mp32>
<jamendo> <http://www.jamendo.com/get/album/id/album/p2p/redirect/8309/?p2pnet=bittorrent&are=ogg3>
<jamendo> <http://www.jamendo.com/get/album/id/album/p2p/redirect/8309/?p2pnet=ed2k&are=mp32>
<jamendo> <http://www.jamendo.com/get/album/id/album/p2p/redirect/8309/?p2pnet=ed2k&are=ogg3>
<jamendo> <http://www.jamendo.com/get/track/id/album/audio/play/8309/?item_o=track_no_asc&aue=ogg2&n=all>
<jamendo> <http://www.jamendo.com/get/track/id/album/audio/xspf/8309/?item_o=track_no_asc&aue=ogg2&n=all>

Now, making it interact with GNAT and GNARQL, two tools able to create a SPARQL end point holding information about your personal music collection, is the next step :)

Wednesday 12 December 2007

HENRY: A small N3 parser/reasoner for SWI-Prolog

Yesterday, I finally took some time to package the little hack I've written last week, based on the SWI N3 parser I wrote back in September.

Update: A newer version with lots of bug fixes is available there.

It's called Henry, it is really small, hopefully quite understandable, and it does N3 reasoning. The good thing is that you can embed such reasoning in the SWI Semantic Web Server, and then access a N3-entailed RDF store using SPARQL.

How to use it?

Just get this tarball, extract it, and make sure you have SWI-Prolog installed, with its Semantic Web library.

Then, the small swicwm.pl script provides a small command-line tool to test it (roughly equivalent, in CWM terms, to cwm $1 -think -data -rdf).

Here is a simple example (shipped in the package, along other funnier examples).

The file uncle.n3 holds:

prefix : <http://example.org/> .
:yves :parent :fabienne.
:fabienne :brother :olivier.
{?c :parent ?f. ?f :brother ?u} => {?c :uncle ?u}.

And:

$ ./swicwm.pl examples/uncle.n3

<!DOCTYPE rdf:RDF [
    <!ENTITY ns1 'http://example.org/'>
    <!ENTITY rdf 'http://www.w3.org/1999/02/22-rdf-syntax-ns#'>
]>

<rdf:RDF
    xmlns:ns1="&ns1;"
    xmlns:rdf="&rdf;"
>
<rdf:Description rdf:about="&ns1;fabienne">
  <ns1:brother rdf:resource="&ns1;olivier"/>
</rdf:Description>

<rdf:Description rdf:about="&ns1;yves">
  <ns1:parent rdf:resource="&ns1;fabienne"/>
  <ns1:uncle rdf:resource="&ns1;olivier"/>
</rdf:Description>

</rdf:RDF>

How does it work?

Henry is built around my SWI N3 parser, which basically translates the N3 in a quad form, that can be stored in the SWI RDF store. The two tricks to reach such a representation are the following:

  • Each formulae is seen as a named graph, identified by a blank node (there exists a graph, where the following is true...);
  • Universal quantification is captured through a specific set of atoms (a bit like __bnode1 captures an existentially quantified variable).

For example:

prefix : <http://example.org/> .
{?c :parent ?f. ?f :brother ?u} => {?c :uncle ?u}.

would be translated to:

subject||predicate||object||graph
__uqvar_c||http://example.org/parent||__uqvar_f||__graph1
__uqvar_f||http://example.org/brother||__uqvar_u||__graph1
__uqvar_c||http://example.org/uncle||__uqvar_u||__graph2
__graph1||log:implies||__graph2||uncle.n3

Then, when running the compile predicate, such a representation is translated into a bunch of Prolog clauses, such as, in our example:

rdf(C,'http://example.org/uncle',U) :- rdf(C,'http://example.org/parent',F), rdf(F,'http://example.org/brother',U).

Such rules are defined in a particular entailment module, allowing it to be plugged in the SWI Semantic Web server. Of course, rules can get into an infinite loop, and this will make the whole thing crash :-)

I tried to make the handling of lists and builtins as clear and simple as possible. Defining a builtin is as simple as registering a new predicate, associating a particular URI to a prolog predicate (see builtins.pl for an example).

An example using both lists and builtins is the following one. By issuing this SPARQL query to a N3-entailed store:

PREFIX list: <http://www.w3.org/2000/10/swap/list#>

SELECT ?a
WHERE
{?a list:in ("a" "b" "c")}

You will get back a, b and c (you have no clue how much I struggled to make this thing work :-) ).

But take care!

Of course, there are lots of bugs and issues, and I am sure there are lots of cases where it'll crash and burn :-) Anyway, I hope it will be useful.