Optimized SQL for Tree Structures

Tree depth in SQL

You'll need recursion capability. Using recursive queries this will be:

WITH RECURSIVE node_ancestors(node_id, parent_id) AS (
SELECT id, id FROM tree_node WHERE id IN (1, 2, 3)
UNION ALL
SELECT na.node_id, tree_node.parent_id
FROM node_ancestors AS na, tree_node
WHERE tree_node.id = na.parent_id AND tree_node.parent_id IS NOT NULL
)
SELECT node_id, COUNT(parent_id) AS depth FROM node_ancestors GROUP BY node_id;

The other options are doing the recursion in stored procedures, doing the recursion in the application and limiting the amount of recursion and using lots of joins. (The last option isn't really efficient for non-trivial depths)

How do I optimize sql for reading multiple trees?

I have solved my problem now as follows:

I've split up my single table with all entries into two tables. One table contains all root entries of the trees, the other one contains all children (and the children of the children and so on).

So the first query is done against the root table, the 2nd one against the children. Because the children table is only about quarter of size now, the query is way faster.

Also, I realized that one of my indexes was defined wrong so that it did not get into action. I realized this by an explain when I tried my problem on a db2 database using the db2 tools.

So, my tip for everyone out there: Keep your tables small and use explains to see if you're indexes are doing what they're supposed to do.

What is the most efficient/elegant way to parse a flat table into a tree?

Now that MySQL 8.0 supports recursive queries, we can say that all popular SQL databases support recursive queries in standard syntax.

WITH RECURSIVE MyTree AS (
SELECT * FROM MyTable WHERE ParentId IS NULL
UNION ALL
SELECT m.* FROM MyTABLE AS m JOIN MyTree AS t ON m.ParentId = t.Id
)
SELECT * FROM MyTree;

I tested recursive queries in MySQL 8.0 in my presentation Recursive Query Throwdown in 2017.

Below is my original answer from 2008:


There are several ways to store tree-structured data in a relational database. What you show in your example uses two methods:

  • Adjacency List (the "parent" column) and
  • Path Enumeration (the dotted-numbers in your name column).

Another solution is called Nested Sets, and it can be stored in the same table too. Read "Trees and Hierarchies in SQL for Smarties" by Joe Celko for a lot more information on these designs.

I usually prefer a design called Closure Table (aka "Adjacency Relation") for storing tree-structured data. It requires another table, but then querying trees is pretty easy.

I cover Closure Table in my presentation Models for Hierarchical Data with SQL and PHP and in my book SQL Antipatterns: Avoiding the Pitfalls of Database Programming.

CREATE TABLE ClosureTable (
ancestor_id INT NOT NULL REFERENCES FlatTable(id),
descendant_id INT NOT NULL REFERENCES FlatTable(id),
PRIMARY KEY (ancestor_id, descendant_id)
);

Store all paths in the Closure Table, where there is a direct ancestry from one node to another. Include a row for each node to reference itself. For example, using the data set you showed in your question:

INSERT INTO ClosureTable (ancestor_id, descendant_id) VALUES
(1,1), (1,2), (1,4), (1,6),
(2,2), (2,4),
(3,3), (3,5),
(4,4),
(5,5),
(6,6);

Now you can get a tree starting at node 1 like this:

SELECT f.* 
FROM FlatTable f
JOIN ClosureTable a ON (f.id = a.descendant_id)
WHERE a.ancestor_id = 1;

The output (in MySQL client) looks like the following:

+----+
| id |
+----+
| 1 |
| 2 |
| 4 |
| 6 |
+----+

In other words, nodes 3 and 5 are excluded, because they're part of a separate hierarchy, not descending from node 1.


Re: comment from e-satis about immediate children (or immediate parent). You can add a "path_length" column to the ClosureTable to make it easier to query specifically for an immediate child or parent (or any other distance).

INSERT INTO ClosureTable (ancestor_id, descendant_id, path_length) VALUES
(1,1,0), (1,2,1), (1,4,2), (1,6,1),
(2,2,0), (2,4,1),
(3,3,0), (3,5,1),
(4,4,0),
(5,5,0),
(6,6,0);

Then you can add a term in your search for querying the immediate children of a given node. These are descendants whose path_length is 1.

SELECT f.* 
FROM FlatTable f
JOIN ClosureTable a ON (f.id = a.descendant_id)
WHERE a.ancestor_id = 1
AND path_length = 1;

+----+
| id |
+----+
| 2 |
| 6 |
+----+

Re comment from @ashraf: "How about sorting the whole tree [by name]?"

Here's an example query to return all nodes that are descendants of node 1, join them to the FlatTable that contains other node attributes such as name, and sort by the name.

SELECT f.name
FROM FlatTable f
JOIN ClosureTable a ON (f.id = a.descendant_id)
WHERE a.ancestor_id = 1
ORDER BY f.name;

Re comment from @Nate:

SELECT f.name, GROUP_CONCAT(b.ancestor_id order by b.path_length desc) AS breadcrumbs
FROM FlatTable f
JOIN ClosureTable a ON (f.id = a.descendant_id)
JOIN ClosureTable b ON (b.descendant_id = a.descendant_id)
WHERE a.ancestor_id = 1
GROUP BY a.descendant_id
ORDER BY f.name

+------------+-------------+
| name | breadcrumbs |
+------------+-------------+
| Node 1 | 1 |
| Node 1.1 | 1,2 |
| Node 1.1.1 | 1,2,4 |
| Node 1.2 | 1,6 |
+------------+-------------+

A user suggested an edit today. SO moderators approved the edit, but I am reversing it.

The edit suggested that the ORDER BY in the last query above should be ORDER BY b.path_length, f.name, presumably to make sure the ordering matches the hierarchy. But this doesn't work, because it would order "Node 1.1.1" after "Node 1.2".

If you want the ordering to match the hierarchy in a sensible way, that is possible, but not simply by ordering by the path length. For example, see my answer to MySQL Closure Table hierarchical database - How to pull information out in the correct order.

Database Structure for Tree Data Structure

You mention the most commonly implemented, which is Adjacency List:
https://blogs.msdn.microsoft.com/mvpawardprogram/2012/06/25/hierarchies-convert-adjacency-list-to-nested-sets

There are other models as well, including materialized path and nested sets:
http://communities.bmc.com/communities/docs/DOC-9902

Joe Celko has written a book on this subject, which is a good reference from a general SQL perspective (it is mentioned in the nested set article link above).

Also, Itzik Ben-Gann has a good overview of the most common options in his book "Inside Microsoft SQL Server 2005: T-SQL Querying".

The main things to consider when choosing a model are:

1) Frequency of structure change - how frequently does the actual structure of the tree change. Some models provide better structure update characteristics. It is important to separate structure changes from other data changes however. For example, you may want to model a company's organizational chart. Some people will model this as an adjacency list, using the employee ID to link an employee to their supervisor. This is usually a sub-optimal approach. An approach that often works better is to model the org structure separate from employees themselves, and maintain the employee as an attribute of the structure. This way, when an employee leaves the company, the organizational structure itself does not need to be changes, just the association with the employee that left.

2) Is the tree write-heavy or read-heavy - some structures work very well when reading the structure, but incur additional overhead when writing to the structure.

3) What types of information do you need to obtain from the structure - some structures excel at providing certain kinds of information about the structure. Examples include finding a node and all its children, finding a node and all its parents, finding the count of child nodes meeting certain conditions, etc. You need to know what information will be needed from the structure to determine the structure that will best fit your needs.

How to store directory / hierarchy / tree structure in the database?

There are many ways to store hierarchies in SQL databases. Which one to choose depends on which DBMS product you use, and how the data will be used. As you have used the MSSQL2005 tag, I think you should start considering the "Adjacency List" model; if you find that it doesn't perform well for your application, then have a look at Vadim Tropashko's comparison which highlights differences between models with a focus on multiple performance characteristics.

Tree Structure Query in SQL Server

It looks like what is going on is that the compiler is not able to push down the outer predicate IsDeleted = 0 into the CTE. It therefore cannot use the umg_query_index as recommended by O. Jones.

Why that is the case, is because each run of the recursive section could return rows which are IsDeleted = 1, which need to feed back into the next run. While it is true that they do not appear in the final result, it is still possible that such a row may have child rows that do need to appear in the final resultset. So there is no way to eliminate them from the recursive join.

You have two options:

  1. Change your IX_UserManagement_GroupDef_2 index, so that it also contains the IsDeleted column, it does not need to be in the key, it can be an INCLUDE:
CREATE NONCLUSTERED INDEX [IX_UserManagement_GroupDef_2] ON [dbo].[UserManagement_GroupDef]
([ParentGroupId] ASC) INCLUDE (IsDeleted)
WITH (DROP_EXISTING = ON, ONLINE = ON) ON [PRIMARY]

  1. Probably the better option, push down the predicate yourself. Then it will use O. Jones' index on both parts of the CTE.
WITH tblChild AS
(
SELECT *
FROM UserManagement_GroupDef
WHERE ID = 155 AND IsDeleted = 0

UNION ALL

SELECT
u.*
FROM
UserManagement_GroupDef u
JOIN
tblChild ON u.ParentGroupId = tblChild.ID
WHERE u.IsDeleted = 0
)
SELECT *
FROM tblChild

This query has different semantics from your original one. It will not return any rows for which the parent is IsDeleted = 1.

At this point you could also change your new index to be a filtered one, as this means any IsDeleted rows will not be stored.

CREATE NONCLUSTERED INDEX umg_query_index ON UserManagement_GroupDef
(ParentGroupId) INCLUDE (IsDeleted) WHERE IsDeleted = 0
WITH (DROP_EXISTING = ON, ONLINE = ON) ON [PRIMARY];

You must include the IsDeleted column in the index, otherwise the optimizer may fail to use it, due to faulty logic as currently implemented.


One further note: single-column indexes are not usually that useful and should generally be avoided. It is not worth it for the server to try union two indexes together, it will scan the clustered index instead.

I suggest you drop the indexes IX_UserManagement_GroupDef and IX_UserManagement_GroupDef_3 as they are now completely superfluous.



Related Topics



Leave a reply



Submit