Linked Data on the Web 2008
By Yves on Monday 12 May 2008, 08:53 - Permalink
I just got back from Beijing (I did a two weeks trip around China after the actual conference), where I attended the Linked Data on the Web workshop and the WWW conference.
The workshop was really good, gathering lots of people from the Linking Open Data community (it was the first time I met most of these people, after more than one year working with them :-) ).
The attendance was much higher than expected, with around 100 people registered for the workshop.

It started well with this sentence by Tim Berners-Lee in the workshop introduction:
Linked Data is the Semantic Web done right, and the Web done right.
That's a pretty good way to start a day :-) Then, Chris Bizer did a good overview of what the community has achieved in one year, illustrated by the different versions of Richard's diagram:

All the talks and papers were extremely high quality. I got particularly interested by some of them, including Tim's presentation on the new SPARQL/Update capabilities of the Tabulator data browser. This allows easy interaction with data wikis, where everyone can add or correct information.

I really liked Alexandre Passant's
presentation on the Flickr
exporter, which is highlighting a mechanism that I used for the Last.fm linked data exporter: linking several
identities on several web-sites is just a owl:sameAs link away.
Alexandre also did another
presentation on MOAT (Meaning of a
Tag), a really interesting project allowing to relate tags to Semantic Web
URIs. For example, it allows to easily draw a link between my tag "paris texas"
to the movie Paris, Texas
in DBpedia.
I got a bit confused by Paul Miller's
presentation about licensing open data. I have been aware of these efforts
mainly by the work of the Open Knowledge
Foundation and the Open Data
Commons project, and I think these are truly crucial issues: we need open
data, and explicit licensing. But perhaps the audience was not
so well chosen: most (if not all) of us in the Linking Open Data community do
not own the data they publish as RDF and interlink. DBpedia exports data extracted from Wikipedia,
DBTune exports data from different
music-related sources such as Jamendo or
Last.fm, etc. The only data that we can possibly
explicitly license are links (the only thing we actually own), and it does not
have any values without any data :-) So I guess the outreach should mainly be
done to raw
data publishers rather than Semantic Web translators
?
But hopefully, in a near future, the two communities will be the same!

One of my personal highlights was also Christian
Becker's
presentation about DBpedia mobile: a location-enabled linked data browser
for mobile devices., giving you nearby sights and detailed descriptions,
restaurants, hotels, etc. We chatted a bit after the workshop with Alexandre
and Christian about adding Last.fm events to the DBtune exporter to also
display nearby gigs (with optional filtering based on your
foaf:interests, of course :-) ).
Jun Zhao's presentation about linked data and provenance for biological resources was extremely interesting: they are dealing with problems strongly similar to ours in a Music Information Retrieval context. How to trust a particular statement (for example, a structural segmentation of a particular track) found on the web? We need to know whether it was written by a human, or derived through a set of algorithms, and in this case, we might want to choose timbre-based instead of chroma-based workflows in the case of Rock music, for example. This is the sort of things we implemented within our Henry software (more to come on that later, including online demo as soon as I put it on better hardware, and (hopefully) a PhD :-D ).
Wolfgang Halb did a presentation about our Riese project, but more on that later as I wrote the back-end software powering it and I'd like to give it a full blog entry soon.
I did
a presentation about automatic interlinking algorithms on the data web,
with a focus on music-related datasets. I detailed an algorithm we developed
for this purpose, propagating similarity measures around web data as long as we
can't take an interlinking (creating a bunch of owl:sameAs links)
decision. This algorithm is good in the sense that it gives a
really low rate of false-positives (on the test-set detailed in
the paper, it made no wrong decisions. I blogged about this algorithm
earlier.

Some people expressed concerns about the proliferation of
owl:sameAs links (highlighted in this
presentation by Paolo Bouquet). But I truly think it is a necessary thing,
as long as web identifiers are tied to their actual representation. I need to
be able to have a web identifier for a song in Jamendo and a web identifier for
the same song in Musicbrainz, and I need a way to link these together:
owl:sameAs is perfect for that. I wouldn't trust a centralised
"identity" system (what actually is identity anyway? :-) ), as it would break
the nice decentralised information paradigm we're implementing within the
Linking Open Data project: we are not Freebase.
Anyway, lots of great people, a great time, lots of interesting discussions and new ideas... I am really looking forward for WWW 2009 in Madrid and the next workshop!!!
Comments
Yves
yes, it was a good workshop, wasn't it? :-) I hope you had a good time touring China. Other than the wandering (almost) Dr Heath, we just had to make do with two crammed days immediately after the conference, but managed to fit quite a bit in.
To respond, specifically, to your point about Linked Data... yes, we definitely do need to be (and we are) talking to the data owners themselves.
We also need to be talking to those who re-integrate, recombine and reuse the data (such as yourself). Firstly, in the eyes of the law 'ignorance is no defence.' Just because you can re-use someone's data doesn't mean you're allowed to. Those who reuse data need to be aware of the issues around what they're doing... especially as early 'experiments' and 'demonstrators' grow to become services that users rely upon; and that the data owners might actually notice and (possibly) question.
Secondly, we all need to be applying ourselves to understanding ways in which a whole raft of licensing terms can be better expressed in machine-actionable form. I'd prefer that users didn't need to engage with most of this stuff. It would be far better if we had data stores that expressed their licensing terms, and applications that obeyed those terms as they worked with data coming from diverse sources. That requires a real meeting of minds between license drafters, data owners, and application builders. Several starts have been made, but we're certainly not there yet.
We all, I think, want data to be as freely and widely available for re-use as possible. Licenses such as Open Data Commons aren't intended to constrain use. They're intended to explicitly and proactively describe all the things that you can do with someone else's data. They're an important missing piece in the journey from innovative and exciting experimentation (where, if we're honest, 'rights' and 'ownership' are rarely as carefully acknowledged or respected as they should be) to robust and sustainable delivery of services and applications.
Hello Paul!
I actually completely agree with you. Explicit licensing in a machine-readable format is the key (and Creative Commons is the perfect example proving it). And the more open the licensing is, the better :-)
It is also a daily problem that the "re-user" of data have to face as well. It is clearly not usual for data providers to give explicit licensing (apart from some exceptions, like Musicbrainz) - in the best case they provide buggy licensing that don't really apply to data.
We must ask not only for the data to the providers, but also for explicit, machine-readable, licensing of it. And as you say, it needs a real meeting of minds between the three communities.
But I have the feeling (at least in the domain I am mainly interested in) that most data providers don't even realise there is a problem here: only license drafters and re-users do. So we *really* need a serious outreach there.
It might be useful to look at how such a realisation occurred in the Creative Commons and in the open-source worlds. In the CC world, I guess it is mainly due to the numerous court cases, and a sudden realisation that traditional licensing schemes were not fitted to handle new creative processes. In the open source world, I really have no idea :-)
I just hope we won't need court cases to make such a realisation occur...