What Is the Most Efficient/Elegant Way to Parse a Flat Table into a Tree

What is the most efficient/elegant way to parse a flat table into a tree?

Now that MySQL 8.0 supports recursive queries, we can say that all popular SQL databases support recursive queries in standard syntax.

WITH RECURSIVE MyTree AS (
SELECT * FROM MyTable WHERE ParentId IS NULL
UNION ALL
SELECT m.* FROM MyTABLE AS m JOIN MyTree AS t ON m.ParentId = t.Id
)
SELECT * FROM MyTree;

I tested recursive queries in MySQL 8.0 in my presentation Recursive Query Throwdown in 2017.

Below is my original answer from 2008:


There are several ways to store tree-structured data in a relational database. What you show in your example uses two methods:

  • Adjacency List (the "parent" column) and
  • Path Enumeration (the dotted-numbers in your name column).

Another solution is called Nested Sets, and it can be stored in the same table too. Read "Trees and Hierarchies in SQL for Smarties" by Joe Celko for a lot more information on these designs.

I usually prefer a design called Closure Table (aka "Adjacency Relation") for storing tree-structured data. It requires another table, but then querying trees is pretty easy.

I cover Closure Table in my presentation Models for Hierarchical Data with SQL and PHP and in my book SQL Antipatterns: Avoiding the Pitfalls of Database Programming.

CREATE TABLE ClosureTable (
ancestor_id INT NOT NULL REFERENCES FlatTable(id),
descendant_id INT NOT NULL REFERENCES FlatTable(id),
PRIMARY KEY (ancestor_id, descendant_id)
);

Store all paths in the Closure Table, where there is a direct ancestry from one node to another. Include a row for each node to reference itself. For example, using the data set you showed in your question:

INSERT INTO ClosureTable (ancestor_id, descendant_id) VALUES
(1,1), (1,2), (1,4), (1,6),
(2,2), (2,4),
(3,3), (3,5),
(4,4),
(5,5),
(6,6);

Now you can get a tree starting at node 1 like this:

SELECT f.* 
FROM FlatTable f
JOIN ClosureTable a ON (f.id = a.descendant_id)
WHERE a.ancestor_id = 1;

The output (in MySQL client) looks like the following:

+----+
| id |
+----+
| 1 |
| 2 |
| 4 |
| 6 |
+----+

In other words, nodes 3 and 5 are excluded, because they're part of a separate hierarchy, not descending from node 1.


Re: comment from e-satis about immediate children (or immediate parent). You can add a "path_length" column to the ClosureTable to make it easier to query specifically for an immediate child or parent (or any other distance).

INSERT INTO ClosureTable (ancestor_id, descendant_id, path_length) VALUES
(1,1,0), (1,2,1), (1,4,2), (1,6,1),
(2,2,0), (2,4,1),
(3,3,0), (3,5,1),
(4,4,0),
(5,5,0),
(6,6,0);

Then you can add a term in your search for querying the immediate children of a given node. These are descendants whose path_length is 1.

SELECT f.* 
FROM FlatTable f
JOIN ClosureTable a ON (f.id = a.descendant_id)
WHERE a.ancestor_id = 1
AND path_length = 1;

+----+
| id |
+----+
| 2 |
| 6 |
+----+

Re comment from @ashraf: "How about sorting the whole tree [by name]?"

Here's an example query to return all nodes that are descendants of node 1, join them to the FlatTable that contains other node attributes such as name, and sort by the name.

SELECT f.name
FROM FlatTable f
JOIN ClosureTable a ON (f.id = a.descendant_id)
WHERE a.ancestor_id = 1
ORDER BY f.name;

Re comment from @Nate:

SELECT f.name, GROUP_CONCAT(b.ancestor_id order by b.path_length desc) AS breadcrumbs
FROM FlatTable f
JOIN ClosureTable a ON (f.id = a.descendant_id)
JOIN ClosureTable b ON (b.descendant_id = a.descendant_id)
WHERE a.ancestor_id = 1
GROUP BY a.descendant_id
ORDER BY f.name

+------------+-------------+
| name | breadcrumbs |
+------------+-------------+
| Node 1 | 1 |
| Node 1.1 | 1,2 |
| Node 1.1.1 | 1,2,4 |
| Node 1.2 | 1,6 |
+------------+-------------+

A user suggested an edit today. SO moderators approved the edit, but I am reversing it.

The edit suggested that the ORDER BY in the last query above should be ORDER BY b.path_length, f.name, presumably to make sure the ordering matches the hierarchy. But this doesn't work, because it would order "Node 1.1.1" after "Node 1.2".

If you want the ordering to match the hierarchy in a sensible way, that is possible, but not simply by ordering by the path length. For example, see my answer to MySQL Closure Table hierarchical database - How to pull information out in the correct order.

How to efficiently build a tree from a flat structure?

Store IDs of the objects in a hash table mapping to the specific object. Enumerate through all the objects and find their parent if it exists and update its parent pointer accordingly.

class MyObject
{ // The actual object
public int ParentID { get; set; }
public int ID { get; set; }
}

class Node
{
public List<Node> Children = new List<Node>();
public Node Parent { get; set; }
public MyObject AssociatedObject { get; set; }
}

IEnumerable<Node> BuildTreeAndGetRoots(List<MyObject> actualObjects)
{
Dictionary<int, Node> lookup = new Dictionary<int, Node>();
actualObjects.ForEach(x => lookup.Add(x.ID, new Node { AssociatedObject = x }));
foreach (var item in lookup.Values) {
Node proposedParent;
if (lookup.TryGetValue(item.AssociatedObject.ParentID, out proposedParent)) {
item.Parent = proposedParent;
proposedParent.Children.Add(item);
}
}
return lookup.Values.Where(x => x.Parent == null);
}

What is the best practice for fetching a tree of nodes from database for further rendering?

I usually recommend a design called Closure Table.

See example in my answer to What is the most efficient/elegant way to parse a flat table into a tree?

I also designed this presentation: Models for Hierarchical Data with SQL and PHP. I developed a PHP app that render a tree in 0.3 seconds, from a collection of hierarchical data with 490k nodes.

I blogged about Closure Table here: Rendering Trees with Closure Table.

I wrote a chapter about different strategies for hierarchical data in my book, SQL Antipatterns: Avoiding the Pitfalls of Database Programming.

How to retrieve the path to a node in a tree - efficiently (related to the post 'parse a flat table into a tree?')

SELECT ancestor_id
FROM ClosureTable
WHERE descendant_id = 4;

Returns the values 1, 2, 4. But they are returned on separate rows, and they give no indication that they're in the right order (we might not assume numerical order corresponds to tree hierarchy order).

Sometimes you would also store the depth for every path in the ClosureTable. But even if not, you can count how many ancestors a given node has, and use that to sort:

SELECT ct1.ancestor_id, COUNT(*) AS depth
FROM ClosureTable ct1
JOIN ClosureTable ct2 ON (ct1.ancestor_id = ct2.descendant_id)
WHERE ct1.descendant_id = 4
GROUP BY ct1.ancestor_id
ORDER BY depth;

Yes, this still returns the result in three rows. If you use MySQL, you have access to GROUP_CONCAT(). Otherwise it's easy to fetch three rows and concatenate their values in application code.

SQL query to find empty nodes in a tree structure table

I could use recursion to find the data nodes within each 'node' node. The first one is for SqlServer:

with treeTable as (
select *
from (values
(1,'1','node',0,null)
,(2,'2','node',1,1)
,(3,'3','node',1,1)
,(4,'4','node',2,2)
,(5,'5','node',2,3)
,(6,'6','node',2,3)
,(7,'7','node',3,4)
,(8,'8','node',3,6)
,(9,'**','data',2,2)
,(10,'**','data',2,2)
,(11,'**','data',3,4)
,(12,'**','data',3,5)
) T(id, name, type, depth, parentId)
),
Hry as (
SELECT StartNode=TN.id, ThisNode=TN.id, TN.name, tn.type, tn.depth, L=0
FROM treeTable TN
union all
SELECT StartNode=H.StartNode, ThisNode=TN.id, TN.name, TN.type, TN.depth, L=L+1
FROM treeTable TN
inner join
Hry H
on H.ThisNode=TN.ParentId
)
/* Now repeat for each 'node' node: recursively traverse each sub-tree and see if there are data nodes underneath */
select T.id
from treeTable T
left join
Hry H
on T.id=H.StartNode
and H.type='data'
where T.type='node'
and H.StartNode is null

For Sqlite

with treeTable as (
select 1 as id, '1'as name,'node' as type,0 as depth ,null as parentID union all
select 2 as id, '2'as name,'node' as type,1 as depth , 1 as parentID union all
select 3 as id, '3'as name,'node' as type,1 as depth , 1 as parentID union all
select 4 as id, '4'as name,'node' as type,2 as depth , 2 as parentID union all
select 5 as id, '5'as name,'node' as type,2 as depth , 3 as parentID union all
select 6 as id, '6'as name,'node' as type,2 as depth , 3 as parentID union all
select 7 as id, '7'as name,'node' as type,3 as depth , 4 as parentID union all
select 8 as id, '8'as name,'node' as type,3 as depth , 6 as parentID union all
select 9 as id,'**'as name,'data' as type,2 as depth , 2 as parentID union all
select 10 as id,'**'as name,'data' as type,2 as depth , 2 as parentID union all
select 11 as id,'**'as name,'data' as type,3 as depth , 4 as parentID union all
select 12 as id,'**'as name,'data' as type,3 as depth , 5 as parentID
),
Hry as (
SELECT TN.id as StartNode, TN.id as ThisNode, TN.name, tn.type, tn.depth, 0 as L
FROM treeTable TN
union all
SELECT H.StartNode as StartNode, TN.id as ThisNode, TN.name, TN.type, TN.depth, L+1 as L
FROM treeTable TN
inner join
Hry H
on H.ThisNode=TN.ParentId
)
/* Now repeat for each 'node' node: recursively traverse each sub-tree and see if there are data nodes underneath */
select T.id
from treeTable T
left join
Hry H
on T.id=H.StartNode
and H.type='data'
where T.type='node'
and H.StartNode is null

Whats faster in Oracle? Small table with tree structure vs. Huge flat table

I would definitely go for the first option (hierarchical approach). I think it's better to model the data correctly than to just use a bad data model to gain performance. Since you are modeling a hierarchy here, it makes sense to store it that way in the DB.

If you want the best of both worlds, my recommendation would be to look at using a materialized view to "flatten" the hierarchical data, then you are still storing the data properly, but you get the performance gains (if any) by using the materialized view.

There's almost always a way to follow a good data model and still find ways to get good performance. But a bad data model will cost you for years to come, and it takes great pain to correct it later.

However, even with the flattened approach, you have to consider that you are increasing the number of records dramatically, especially as you get to the leaf nodes in the tree, so I'd be surprised if having a flat hierarchy table (your second approach) would improve performance since there are many more records to process.



Related Topics



Leave a reply



Submit