## Named Graphs and quad stores

By Yves on Tuesday 21 December 2010, 10:53 - Permalink

I was recently looking again at the Named Graphs paper, and was wondering why the consensus for implementing Named Graphs support in triple stores was to store quads.

Named Graphs are defined as follows. We consider **U** as being
the set of all web identifiers, **L** as being the set of all
literals and **B** the set of all blank nodes. **U**
, **L** and **B** are pairwise disjoint. Let
**N** = **U** ∪**L**∪**B**,
then the set **T** = **N** × **U** ×
**N** is the set of all RDF triples. The set of RDF graphs
**G** is the power set of **T**, and a named graph is
a pair **ng** = (**u**, **g**), where
**u** ∈ **U** and **g** ∈
**G**.

As mentioned in one my
previous post, one thing that I am really keen on is the idea of having
triples that belong to multiple graphs. This situation is already happening
in the wild

a fair bit. If you look at a /programmes
aggregation feed (all available Radio 4 programmes starting with g

),
it mentions multiple programmes. All of the statements about each of these
programmes also belong to the corresponding programme graph (e.g. Gardener's Question).

Nothing in the above definition forbids that from happening - each graph and sub-graph can be named. ( (a,b,c), (d,f,g) ) can be named x and ( (a,b,c) ) can be named y.

Now, most triple stores implement Named Graphs by, in fact, storing quads. They store tuples of the form (subject, predicate, object, graph). Which means that when you try to store the two graphs above, they will in fact store (a,b,c,x), (d,f,g,x) and (a,b,c,y) --- the triples will be replicated in each graph they are mentioned in. For the above /programmes example, it is terribly inefficient. If you crawl /programmes, you will end up hitting lots of such aggregation URLs, leading in a tremendous amount of data duplication - lots of triples will be repeated across graphs. Perhaps it would be better to store something like (a,b,c,(x,y)) and (d,f,g,(x))?

I would be curious to know if anyone else is hitting this issue. I can think of multiple use-cases for triples belonging to multiple graphs: access control, versioning, attribution at different granularities...

## Comments

Have a look at "triple sets" from the OWLIM people.

http://www.ontotext.com/publication...

I would prefer that every triple would have an identifier in the 4th slot.

Using this identifier one could indicate to which graphs the statement belongs and add other reified info.

As already mentioned, I guess you are searching for something like 'statement identifier'. The 'triple set' approach from OWLIM seems to have promise (cf. also my thoughts related to this topic 1 and 2).

AFAIK, Virtuoso has also a 5th row in their triple store for internal use as a 'statement identifier'.

1 http://www.semanticoverflow.com/que...

2 http://lists.w3.org/Archives/Public...

I use a separate graph2triple table in ARC (with triple ids) since I ran into massive combinatorial explosions with an earlier quad-column layout (see link below).

As a result, however, I couldn't implement some of SPARQL's features directly with the normalized graph table (certain combinations of GRAPH + OPTIONAL, IIRC, but then ARC is probably the last store that directly translates SPARQL to a single SQL query, that'll become impossible in SPARQL 1.1. anyway).

Best,

Benji

(genuine question:!)

To what extent is this really an issue with the formal account of named graphs (and SPARQL's version of it), versus an implementation detail for people actually building RDF storage and query systems? Can't the redundancy you mention be optimised away internally within the implementation?

Hello!

Thanks for all the detailed answers - a couple of comments:

@zazi @PaulZH It looks like triple sets works around the issue by adding a 5th element, is that right? If so I guess it has the same issue - triples in multiple graphs will end up being replicated?

@danbri It is an implementation detail :-) I don't think there is nothing wrong with the formal account of Named Graphs (well, apart from the fact that it requires graphs to be named, but that's another thing). I am just wondering why the 'quad-store' trick for implementing named graphs is dominant, when it does seem to lead to a fair lot of data duplication.

For implementation, storing as triple+list would mean the base level storage had to have lists and support for looking things up in lists.

Storing as quads means that an indexed graph column can be used allowing GRAPH <G> { ....} to work well.

It's a tradoff. For persistent stores - indexed columns are fast; list processing is not when using standard indexing techniques at scale. Whether the space reduction (if any the RDF term slots are usually just integers, either 4 or 8 bytes before compression), and better caching in RAM (disk version) compensates for lack of direct indexing will depend on the application.

For in-memory, the typical list data structure can take quite a few bytes itself (e.g space to grow or internal pointers). A very compact structure could be designed; it's likely to need to reorganise during data insertion which can be costly. If you wanted read-only-once-loaded, this is probably not a problem.

In summary - it all depends!

I'm with Andy on this one, but, generally speaking, I think that performance trumps storage space these days. If what you're building is targeted at end users, then interactive performance trumps pretty much all other concerns (or your software doesn't get used). Quads allows for easier indexing and better performance, so I'd assume most stores lean that way for that reason.

I spend recently some more thoughts about named semantic graph storing on the overall hot topic 'context modelling' for semantic graphs. My current conclusion is also the usage of quadruple with an explicit 4th column for a 'statement identifier'. This design allows triple (statement) identification over the enclosures of named semantic graphs und furthermore the treatment of provenance for statements and graphs in a same way and also the handling of n-ary relations and its mapping to binary relations (property reification).

Here is an example:

<#alice> :friend <#bob> <#s1> # a simple binary relation

<#ab> rdf:type :HumanRelationship <#s2> # a property reification (an n-ary relation)

<#ab> :participant <#alice> <#s3>

<#ab> :participant <#bob> <#s4>

<#ab> :relationshipType :friendship <#s5>

<#ab> :since "2010-12-12"^xsd:date <#s6>

<#s1> :reified <#ab> <#s7> # an inference that results in a reification relation

<#g1> rdf:type rdfg:Graph <#s8> # a graph enclosure

<#g1> :contains <#s1> <s#9>

<#g1> :contains <#s2> <#s10>

<#g1> :contains <#s3> <#s11>

<#g1> :contains <#s4> <#s12>

<#g1> :contains <#s5> <#s13>

<#g1> :contains <#s6> <#s14>

<#g1> :contains <#s7> <#s15>

<#g2> rdf:type rdfg:Graph <#s16> # another graph enclosure

<#g2> :contains <#s1> <#s17>

-> I can easily get all statements of a graph by using the :contains relation (indexing)

-> I can easily get semantically related property-oriented context descriptions (n-ary relations) by using the :reified relation

-> provenance on the statement and on the graph level can easily be done by describing the <#g*> and <#s*> information resources:

-> one can also decouple a reused statement by changing its statement identifier; that means, the triple of the statement are still the same but the relation to the original statement might now be another e.g., reflected by a provenance statement e.g., <#s20> :original <#s19>

-> statement descriptions can be typed i.e. by rdf:Statement

Can such a storing be efficient for the intended use case?