Named Graphs and quad stores
By Yves on Tuesday 21 December 2010, 10:53 - Permalink
I was recently looking again at the Named Graphs paper, and was wondering why the consensus for implementing Named Graphs support in triple stores was to store quads.
Named Graphs are defined as follows. We consider U as being the set of all web identifiers, L as being the set of all literals and B the set of all blank nodes. U , L and B are pairwise disjoint. Let N = U ∪L∪B, then the set T = N × U × N is the set of all RDF triples. The set of RDF graphs G is the power set of T, and a named graph is a pair ng = (u, g), where u ∈ U and g ∈ G.
As mentioned in one my
previous post, one thing that I am really keen on is the idea of having
triples that belong to multiple graphs. This situation is already happening
in the wild
a fair bit. If you look at a /programmes
aggregation feed (all available Radio 4 programmes starting with g
),
it mentions multiple programmes. All of the statements about each of these
programmes also belong to the corresponding programme graph (e.g. Gardener's Question).
Nothing in the above definition forbids that from happening - each graph and sub-graph can be named. ( (a,b,c), (d,f,g) ) can be named x and ( (a,b,c) ) can be named y.
Now, most triple stores implement Named Graphs by, in fact, storing quads. They store tuples of the form (subject, predicate, object, graph). Which means that when you try to store the two graphs above, they will in fact store (a,b,c,x), (d,f,g,x) and (a,b,c,y) --- the triples will be replicated in each graph they are mentioned in. For the above /programmes example, it is terribly inefficient. If you crawl /programmes, you will end up hitting lots of such aggregation URLs, leading in a tremendous amount of data duplication - lots of triples will be repeated across graphs. Perhaps it would be better to store something like (a,b,c,(x,y)) and (d,f,g,(x))?
I would be curious to know if anyone else is hitting this issue. I can think of multiple use-cases for triples belonging to multiple graphs: access control, versioning, attribution at different granularities...
Comments
Have a look at "triple sets" from the OWLIM people.
http://www.ontotext.com/publication...
I would prefer that every triple would have an identifier in the 4th slot.
Using this identifier one could indicate to which graphs the statement belongs and add other reified info.
As already mentioned, I guess you are searching for something like 'statement identifier'. The 'triple set' approach from OWLIM seems to have promise (cf. also my thoughts related to this topic 1 and 2).
AFAIK, Virtuoso has also a 5th row in their triple store for internal use as a 'statement identifier'.
1 http://www.semanticoverflow.com/que...
2 http://lists.w3.org/Archives/Public...
I use a separate graph2triple table in ARC (with triple ids) since I ran into massive combinatorial explosions with an earlier quad-column layout (see link below).
As a result, however, I couldn't implement some of SPARQL's features directly with the normalized graph table (certain combinations of GRAPH + OPTIONAL, IIRC, but then ARC is probably the last store that directly translates SPARQL to a single SQL query, that'll become impossible in SPARQL 1.1. anyway).
Best,
Benji
(genuine question:!)
To what extent is this really an issue with the formal account of named graphs (and SPARQL's version of it), versus an implementation detail for people actually building RDF storage and query systems? Can't the redundancy you mention be optimised away internally within the implementation?
Hello!
Thanks for all the detailed answers - a couple of comments:
@zazi @PaulZH It looks like triple sets works around the issue by adding a 5th element, is that right? If so I guess it has the same issue - triples in multiple graphs will end up being replicated?
@danbri It is an implementation detail :-) I don't think there is nothing wrong with the formal account of Named Graphs (well, apart from the fact that it requires graphs to be named, but that's another thing). I am just wondering why the 'quad-store' trick for implementing named graphs is dominant, when it does seem to lead to a fair lot of data duplication.
For implementation, storing as triple+list would mean the base level storage had to have lists and support for looking things up in lists.
Storing as quads means that an indexed graph column can be used allowing GRAPH <G> { ....} to work well.
It's a tradoff. For persistent stores - indexed columns are fast; list processing is not when using standard indexing techniques at scale. Whether the space reduction (if any the RDF term slots are usually just integers, either 4 or 8 bytes before compression), and better caching in RAM (disk version) compensates for lack of direct indexing will depend on the application.
For in-memory, the typical list data structure can take quite a few bytes itself (e.g space to grow or internal pointers). A very compact structure could be designed; it's likely to need to reorganise during data insertion which can be costly. If you wanted read-only-once-loaded, this is probably not a problem.
In summary - it all depends!
I'm with Andy on this one, but, generally speaking, I think that performance trumps storage space these days. If what you're building is targeted at end users, then interactive performance trumps pretty much all other concerns (or your software doesn't get used). Quads allows for easier indexing and better performance, so I'd assume most stores lean that way for that reason.
I spend recently some more thoughts about named semantic graph storing on the overall hot topic 'context modelling' for semantic graphs. My current conclusion is also the usage of quadruple with an explicit 4th column for a 'statement identifier'. This design allows triple (statement) identification over the enclosures of named semantic graphs und furthermore the treatment of provenance for statements and graphs in a same way and also the handling of n-ary relations and its mapping to binary relations (property reification).
Here is an example:
<#alice> :friend <#bob> <#s1> # a simple binary relation
<#ab> rdf:type :HumanRelationship <#s2> # a property reification (an n-ary relation)
<#ab> :participant <#alice> <#s3>
<#ab> :participant <#bob> <#s4>
<#ab> :relationshipType :friendship <#s5>
<#ab> :since "2010-12-12"^xsd:date <#s6>
<#s1> :reified <#ab> <#s7> # an inference that results in a reification relation
<#g1> rdf:type rdfg:Graph <#s8> # a graph enclosure
<#g1> :contains <#s1> <s#9>
<#g1> :contains <#s2> <#s10>
<#g1> :contains <#s3> <#s11>
<#g1> :contains <#s4> <#s12>
<#g1> :contains <#s5> <#s13>
<#g1> :contains <#s6> <#s14>
<#g1> :contains <#s7> <#s15>
<#g2> rdf:type rdfg:Graph <#s16> # another graph enclosure
<#g2> :contains <#s1> <#s17>
-> I can easily get all statements of a graph by using the :contains relation (indexing)
-> I can easily get semantically related property-oriented context descriptions (n-ary relations) by using the :reified relation
-> provenance on the statement and on the graph level can easily be done by describing the <#g*> and <#s*> information resources:
-> one can also decouple a reused statement by changing its statement identifier; that means, the triple of the statement are still the same but the relation to the original statement might now be another e.g., reflected by a provenance statement e.g., <#s20> :original <#s19>
-> statement descriptions can be typed i.e. by rdf:Statement
Can such a storing be efficient for the intended use case?
which might be planning to pursue along with automotive online. These internet websites have real useful information that could be available to players. Specific quantity connected with extra fees is often imposed, even so could well be certain that there is added online auctions that you might locate for your. Without ever all your membership rights, you'll probably still obviously try to find some of these auction sales; but in case you are a person in your car or truck winning bidder web, all information will likely you ought to.
united states of america?verts consular advice sheet set. Stocking an acceptable health insurance and having a medical related evacuation insurance policy need to be taken seriously. North america . Embassy will aid you to training course amounts if you decide to end up being sickly possibly you're sprained but you are in charge of the instalments. To acquire with an emergency infirmary on a rural part, evacuating you really with fresh air costs thousands of pounds. Bearing an insurance claim develop and a insurance packages character note is relevant along with understanding what sort of health companies your medical insurance consists of, are often the couple of things a person.
affected individuals are given delay pills, is due to the best power and even the main victims' relatives and also neighbors happen to be got into contact with. Vacation rentals To do with Tours And thus Passenger cars Seem to be Dicey Throughout Region That Have A healthy Range of St Wrecks An adolescent Frenchwoman, given that had been later declared, had become in pain an effective appendicitis combat found in Afghanistan when i noticed to become one mind boggling doctors issue throughout items runs. Your.