First BBC microsite powered by a triple-store
By Yves on Tuesday 13 July 2010, 15:46 - Permalink
Jem Rayfield wrote a very interesting post on the technologies used by the World Cup BBC web site, which also got covered by Read Write Web.
All this is very exciting, the World Cup Website proved that triple store technologies can be used to drive a production website with significant traffic. I am expecting lots more parts of the BBC web infrastructure to evolve in the same way :-)
There are two issues we are still currently trying to solve though:
- We need to be able to cluster our triples in several dimension. For
example, we may want to have a graph for a particular programme, and a much
larger graph for a particular dataset (e.g. programme data, wildlife finder
data, world cup data). The smaller graph is used to make our updates relatively
cheap (we replace the whole graph whenever we receive an update). The bigger
graph is used to give some degree of isolations between the different sources
of data. For that, we need
graphs within graphs
. It can be done with N3-type graph literals, but is impossible to achieve in a standard quad-store setup, where one single triple can't be part of several graphs. - With regards to programme data, the main bottleneck we're facing is the number of updates per second we need to be able to process, which most of available triple stores struggle to keep up. The 4store instance on DBTune does keep up, but it has a negative impact on the querying performances, as the write operations are blocking the reads. We were quite surprised to see that the available triple store benchmarks do not take the write throughput into account!
Comments
Hi Yves, it seems like you use a combination of triple stores at the BBC. Is that correct? Do you have a feeling about strengths and weakness of each? Do you see going for one in the end or using different stores in different parts of the architecture?
Thanks,
Paul
Hello Paul!
The World Cup website, which is currently the only website to use a triple store as a backend storage on production, is using BigOWLIM. The 4store instance I described in this blog is just a test instance, to see if 4store can keep up with our write throughput.
Ultimately, we'd want to use just one single triple store solution (much easier for our operations team to handle...), even though they all have their strengths and weaknesses (e.g. 4store is very efficient on writes, but does not do any inference, BigOWLIM is less efficient on writes, but does very quick inference, etc.)
Best,
y
Hi Yves: thanks so much for blogging some of these behind the scenes details. Can you speak to what sort of inferencing you are doing with BigOWLIM. Have you run across any triple stores that allow for replication, so you can write to one, and then replicate to read-only stores?
Hello Ed!
I believe Jem and his team are already using replication in BigOWLIM - this is certainly something that is very useful, especially if you want to maintain some degree of isolation between different systems.
Cheers,
y
Thanks Yves. I didn't know BigOWLIM did replication. If you get the chance to blog at all about the sort of inferencing you did in the worldcup data that would be interesting.
Hello Ed!
That would be a question for Jem I think :-) Feel free to send him an email! (Ping me if you don't have his email address)
Cheers,
y
"For that, we need graphs within graphs. It can be done with N3-type graph literals, but is impossible to achieve in a standard quad-store setup, where one single triple can't be part of several graphs."
Can't you use metadata about the graph URIs from the plain quad-store?
Hello Dan!
Well, not really. For example, we want to be able to say that:
{ <http://www.bbc.co.uk/programmes/b00... a po:Programme }
is both part of the 'b006m86d' graph (all information about Eastenders), so that we can replace the whole graph whenever the Eastenders information is updated, and part of the 'programmes' graph (all information about BBC Programmes), so that we can ensure a degree of isolation between programmes data, world cup data, wildlife data etc.
In a typical quad-store set up, you would have to replicate that statement twice, i.e. storing both quads:
'http://www.bbc.co.uk/programmes/b00... 'rdf:type' 'po:Programme' 'eastenders'
and
'http://www.bbc.co.uk/programmes/b00... 'rdf:type' 'po:Programme' 'programmes'
Of course, you could hack around it, in that particular example, by saying 'eastenders' 'isPartOf' 'programmes', but it starts to make your SPARQL queries horribly complex, especially if your clustering does not follow a hierarchy, or if the hierarchy is several layers deep (e.g. 'programmes' 'isPartOf' 'bbc' etc.).
Cheers,
y
Hello
we are the developers of BigOWLIM and I will take the opportunity to comment, as this blogs rises two important issues
“For that, we need graphs within graphs.” Yes, we found this necessary for a large range of data management scenarios. We invented and implemented in BigOWLIM a mechanism called “tripleset”, which essentially allows one to efficiently deal with parts of datasets – in this case a SPARQL dataset, which can be seen as a collection of named graphs. The contents of such dataset can be seen as set of <S,P,O,NG> quadruples or as an RDF multi-graph. Triplesets can include any subset of the contents of the dataset; they can contain statements from different named graphs. While in theory one can do the same with NGs, we preferred to invent a new mechanism, because we wanted to use NGs solely for data-source provenance. Whenever dataset is combined from multiple sources, the information from each source is loaded in a separate named graph. This makes it possible to update only a fraction of a dataset when one of the sources is updated (addressing your point about “The smaller graph is used to make our updates relatively cheap (we replace the whole graph whenever we receive an update”). If NGs are used this way, you need a different mechanism to designate fractions of your dataset which are to be used for “operational” purposes. And this is where the tripleset mechanism comes – you can think of it as a way to tag parts of datasets. We have recently presented triplesets as a proposal for extension of the RDF data model at the “RDF Next Steps” workshop, http://www.w3.org/2001/sw/wiki/RDF/...
About the “write performance”. One should consider that in BigOWLIM writing means also materialisation – we do the entire inference work when the data is loaded or updated. Everything than can be inferred from the current contents of the repository is inferred and indexed at the end of each transaction. And this comes at a price – if BigOWLIM is used as a plain RDF store, i.e. if RDF is loaded without any inference activities (what most of our competitors do), loading is about 2 times faster than the same engine configured to perform materialisation with respect to the RDFS semantics. Loading with OWL Horst inference is 4 times slower than the plain RDF store configuration. And OWL 2 RL inference is even slower. Still in the BBC scenario, updates are quick enough even though materialisation is performed.
I should also mention that writes in BigOWLIM do not block reads – we find this unacceptable for many applications. Write transactions happen in parallel with read (e.g. query evaluation) operations. The results of the write operations become visible in transaction isolation mode comparable to “READ COMMITTED”.
And a final detail – when materialization is used one should take special care of deletions. When a statement is deleted, the repository should invalidate all materialized statements which are no longer inferable. This is tough! Thus, most of the systems which do materialisation require that all materialised statements are dropped after deletion and re-inferred again. This is slow and does not work for scenarios when deletes are performed all the time and materialisation cannot be delayed. To overcome this we have implemented in BigOWLIM a fairly complicated “smooth invalidation” mechanism, which allows us to invalidate the irrelevant inferences with a speed comparable to the one for inference. As far as I am aware, on one else in the industry can do this. And without it the updates that were made in the World Cup 2010 sites would have required much more time to get processed. Probably, they would have been finalized just before the 2012 Olympics :-)
Sorry for the long comment, but these are complex matters and one cannot explain them in a twit
Cheers
Naso
Thank you so much for your lengthy response Naso! That looks very promising!
To give you a few metrics of what we'd need to start managing programmes data in a triple store, we would want to handle between 5 and 10 updates a second, and between 50 and 100 queries per second. For programmes data, there is a limited need for reasoning, so we can consider switching it off entirely - which would win us some time.
Many many thanks,
Yves
<p><a href="http://www.moncleroutletsito2012s.c..."><strong>Moncler Outlet</strong></a> Utilizzare solo il collo fino al petto e l'addome Tra Yamao,<a href="http://www.moncleroutletsito2012s.c..."><strong>Outlet Moncler</strong></a> resistenza all'acqua Moncler morbido e alto, <a href="http://www.moncleroutletsito2012s.c..."><strong>Piumini Moncler</strong></a> giù più della media più leggero e thinner.<a href="http://www.moncleroutletsito2012s.c..."><strong>Moncler 2012</strong></a> Non solo la forma delicata, può liberamente flex.</p>
on the lookout for a program, but don't forget don't let yourself be a tad.<br><p>
too rash when coming up with a call.The best way to Find out Great Motor cars accessible for sale Really are Nearly as good a good deal As they definitely Search The concept of conventional vehicles is among the superb solidarity plus mutual interest rates. Although, like several vicinity concerned with some huge cash also opposition, it is usually insanity workout reviews schedule easy to end up being bamboozled interested in and create a more costly and also aggravating befuddle..
taken seriously. North america . Embassy will help you to select a medical care if you decide to develop into sick and tired or maybe if you may be damaged or torn however a major contributor to the installments. Thought out strategies to a urgent clinic in a distant industry, evacuating people by airplane may cost 1000's of dollars. Carting an incident style or an insurance protection expertise cards is relevant not to mention understanding style of scientific suppliers your lifestyle medical insurance tops, are the 2 things for you to do before heading overseas.
The purpose for the fleshlight vstroker code promo is just one stage beyond each shopping cart application page, sorry to say along with this particular technique your can not you will need to employ multiple codes on a single screen.
Our recommendation is to attempt to use the code because of
the highest discounted very first to find if it is
applicable to your order. Each coupon code sphere for Fleshlight
definitely is just over the examine order button throughout the image below.
Once you click review order you'll be capable of seeing if the coupon has applied properly. You cannot load multiple. The code regulations place for Fleshlight can be a action past each purchasing basket site, sadly employing this strategy you only are not able to be sure that you utilize several codes for a passing fancy screen. Our personal strategy will likely become make an effort to use the guidelines along with the finest affordable launching to view if this pertains to your prescribe. The code legislation area of expertise regarding fleshlight bargains is just over the review order option by the image down the page. Any time you push overview order you will examine if the exact coupon offers deleted rightly.
Do you have any video of that? I'd care to find out some additional information.
Pretty section of content. I just stumbled upon your web site and in accession capital to assert that I get actually enjoyed account your blog posts.
Any way I'll be subscribing to your feeds and even I achievement you access consistently quickly.
Very nice post. I just stumbled upon your blog
and wished to say that I have truly enjoyed surfing
around your blog posts. After all I'll be subscribing to your feed and I hope you write again soon!
Great article. I am experiencing a few of these issues as well.
.