Database forestry: not a coding contest

Weedpacket

Given that there are often posts about how to do trees in a database, I thought it might be fun to have a wee expo to show how it's done. The easiest way to get a lot of different approaches is to get a lot of different people to do one each.

The idea is that you're storing a treelike structure in your database - where each record has zero or more children and at most one parent. What the records themselves are is not relevant, but typical usage is that one record is cited and you need to retrieve the entire subtree that has that record as its root.

How might you code such a thing? What's your choice of db schema? No doubt many of you have already faced this task before, and even built yourself a generic class for the job. So how does it work?

Due to the variation in DBMSs (subselects? stored procedures? etc.) out there, there seems little gain in comparing comparitive speed unless a common platform can be found that can host every variation that gets posted. So aesthetics are centre stage here (it's cruising, not drag racing). Of course, you're free to benchmark when describing whys and wherefores.

Shrike

Well to get the ball rolling: I wrote something a while back to create website navigation from XML. So I have adapted this to demonstrate how a tree structure is used.

The biggest benefit of using XML is that the tree structure is inherent in the document. Other slightly less useful things are that I can use SimpleXML, and Xpath, to manipulate the XML document with ease.

The example is here and source code is here. It uses some PHP 5 abilities, but could be rewritten in PHP 4, assuming Simple XML would work on 4.

Sgarissta

Having just faced this exact scenario, let me pose another question. (I'll post my solution below) Given the tree scenario, which method seems more effective?

1.) A single table consisting of

[code]

|NODE_ID | DATA | PARENT_ID |

[/code]

Where PARENT_ID is a self-join back to a NODE_ID. This obviously as the disadvantage that a given node doesn't know it's children, and therefore to build an entire subtree requires multiple queries (or a select of the majority of the table).

2.) Two tables as defined below...

NODES_TABLE
-----------------------
|NODE_ID | DATA |
-----------------------

NODE_CHILDREN
-----------------------------------------------------
| NODE_CHILD_ID | NODE_ID | CHILD_ID 
-----------------------------------------------------

In the second scenario there is an entry in the NODE_CHILDREN table for every combination of a parent, and a child. This obviously has the disadvantage or requiring more data to be stored, but can increase speed/efficiency of sub-tree queries based on a given parent.

Given this, I could not decide on which solution was the most efficient in all situations. In several cases I needed the flexibility of the single table, and the ability to "recurse" a child and on up to each of it's subsequent parents. On the other hand in the case of drawing an interface to display the tree, the second solution required far less computation, and less extra data needed to be queried. My solution was, implement both side-by-side. This gave me the flexibility of using which ever method fit the given scenario, and at the cost of only one extra column storing a single ID.

ahundiak

Sgarissta,

Weedpacket specified that you need to be able to retrieve the complete sub-tree of an element. I don't think your second table can do that without repeated queries. In fact, I dont really see a difference between your two solutions.

To pull a complete sub-tree (without getting into stored procedures and what not) probably requires a table like:

ancestor_id
descendent_id
generation

With one entry for each relation. The generation indicates if a node is a child or grandchild or whatever.

Easy to query. Slow to update. And contains redundant data.

Sgarissta

Originally posted by ahundiak

Weedpacket specified that you need to be able to retrieve the complete sub-tree of an element. I don't think your second table can do that without repeated queries. In fact, I dont really see a difference between your two solutions.

Maybe I'm just misreading...but all i see is the ability to pull a complete sub-tree, nowhere does it limit it to a single query.

Originally posted by WeedPacket
...you need to retrieve the entire subtree that has that record as its root

As for your solution, maybe I'm missing something, but I fail to see how it doesn't suffer from the same problem (requiring multiple queries) as my second solution. As the only real addition your solution makes, is telling what "level" in the tree the current node is, and the "parent" node. But no matter how I look at it, to get more than a single node's immediate relations will require you to recurse (especially when looking "down" the tree).

Weedpacket

Originally posted by Sgarissta
Maybe I'm just misreading...but all i see is the ability to pull a complete sub-tree, nowhere does it limit it to a single query.

That's true; the terms are deliberately very loose. I didn't even say if the tree was large or small°! The idea is to get - and debate - as many different views of the issue as possible. (One may also want to manipulate the tree in certain ways, or determine other properties of the tree/records; needs inform design, form follows function.)

° Or even what constitutes a "large" tree.

ahundiak

In my design, for each node there would be one record for each child, grandchild, great-grandchild etc. So one query could retrieve the complete sub-tree.

Of course, my design does not work as presented. A parent_id would need to be added to each relation. And it would take a bit of processing to then rebuild the tree.

So it's one fast query and some slow processing or multiple queries with less processing.

Sxooter

If you're using PostgreSQL, look in the contrib/ltree directory for functions designed to do this.

Ahh, the advantages of an extensible database.