Database Structure For Tree Data Structure

Database Structure for Tree Data Structure

You mention the most commonly implemented, which is Adjacency List:
https://blogs.msdn.microsoft.com/mvpawardprogram/2012/06/25/hierarchies-convert-adjacency-list-to-nested-sets

There are other models as well, including materialized path and nested sets:
http://communities.bmc.com/communities/docs/DOC-9902

Joe Celko has written a book on this subject, which is a good reference from a general SQL perspective (it is mentioned in the nested set article link above).

Also, Itzik Ben-Gann has a good overview of the most common options in his book "Inside Microsoft SQL Server 2005: T-SQL Querying".

The main things to consider when choosing a model are:

1) Frequency of structure change - how frequently does the actual structure of the tree change. Some models provide better structure update characteristics. It is important to separate structure changes from other data changes however. For example, you may want to model a company's organizational chart. Some people will model this as an adjacency list, using the employee ID to link an employee to their supervisor. This is usually a sub-optimal approach. An approach that often works better is to model the org structure separate from employees themselves, and maintain the employee as an attribute of the structure. This way, when an employee leaves the company, the organizational structure itself does not need to be changes, just the association with the employee that left.

2) Is the tree write-heavy or read-heavy - some structures work very well when reading the structure, but incur additional overhead when writing to the structure.

3) What types of information do you need to obtain from the structure - some structures excel at providing certain kinds of information about the structure. Examples include finding a node and all its children, finding a node and all its parents, finding the count of child nodes meeting certain conditions, etc. You need to know what information will be needed from the structure to determine the structure that will best fit your needs.

Store tree data structure in database

You could store the data in a table using nested sets.

http://en.wikipedia.org/wiki/Nested_set_model#Example

I worry that your millions of nodes may make life difficult if you intend to constantly add new items. Perhaps that concern could be mitigated by using rational numbers instead of integers as the left and right values. Add a column for depth to speed up your desire to ask for descendants. I wrote some SQL to create the table and the stored procedures you asked for. I did it in SQL Server do the syntax might be slightly different but it's all standard SQL statements being executed. Also I just manually decided the upper and lower bounds for each Node. Obviously you'd have to deal with writing the code to get these nodes inserted (and maintained) in your database.

CREATE TABLE Tree(
    Node nvarchar(10) NOT NULL,
    Value int NOT NULL,
    L int NOT NULL,
    R int NOT NULL,
    Depth int NOT NULL,
);

INSERT INTO Tree (Node, Value, L, R, Depth) VALUES ('A', 100,  1, 28, 0);
INSERT INTO Tree (Node, Value, L, R, Depth) VALUES ('B', 100,  2,  3, 1);
INSERT INTO Tree (Node, Value, L, R, Depth) VALUES ('C', 300,  4,  5, 1);
INSERT INTO Tree (Node, Value, L, R, Depth) VALUES ('D', 150,  6, 25, 1);
INSERT INTO Tree (Node, Value, L, R, Depth) VALUES ('E', 200, 26, 27, 1);
INSERT INTO Tree (Node, Value, L, R, Depth) VALUES ('F', 400,  7,  8, 2);
INSERT INTO Tree (Node, Value, L, R, Depth) VALUES ('G', 250,  9, 10, 2);
INSERT INTO Tree (Node, Value, L, R, Depth) VALUES ('H', 500, 11, 12, 2);
INSERT INTO Tree (Node, Value, L, R, Depth) VALUES ('I', 350, 13, 21, 2);
INSERT INTO Tree (Node, Value, L, R, Depth) VALUES ('J', 100, 21, 22, 2);
INSERT INTO Tree (Node, Value, L, R, Depth) VALUES ('K',  50, 23, 24, 2);
INSERT INTO Tree (Node, Value, L, R, Depth) VALUES ('L', 100, 14, 15, 3);
INSERT INTO Tree (Node, Value, L, R, Depth) VALUES ('M', 300, 16, 17, 3);
INSERT INTO Tree (Node, Value, L, R, Depth) VALUES ('N', 200, 18, 19, 3);

CREATE PROCEDURE grandValue
    @Node NVARCHAR(10)
AS
BEGIN
    SET NOCOUNT ON;
    DECLARE @lbound INT;
    DECLARE @ubound INT;
    SELECT @lbound = L, @ubound = R FROM Tree WHERE Node = @Node
    SELECT SUM(Value) AS Total FROM TREE WHERE L >= @lbound AND R <= @ubound
    RETURN
END;

EXECUTE grandValue 'C';
EXECUTE grandValue 'I';
EXECUTE grandValue 'A';

CREATE PROCEDURE children
    @Node NVARCHAR(10)
AS
BEGIN
    SET NOCOUNT ON;
    DECLARE @lbound INT;
    DECLARE @ubound INT;
    DECLARE @depth INT;
    SELECT @lbound = L, @ubound = R, @depth=Depth FROM Tree WHERE Node = @Node
    SELECT Node FROM TREE WHERE L > @lbound AND R < @ubound AND Depth = (@depth + 1)
    RETURN
END;

EXECUTE children 'C';
EXECUTE children 'I';
EXECUTE children 'A';

CREATE PROCEDURE family
    @Node NVARCHAR(10)
AS
BEGIN
    SET NOCOUNT ON;
    DECLARE @lbound INT;
    DECLARE @ubound INT;
    SELECT @lbound = L, @ubound = R FROM Tree WHERE Node = @Node
    SELECT Node FROM TREE WHERE L > @lbound AND R < @ubound
    RETURN
END;

EXECUTE family 'C';
EXECUTE family 'I';
EXECUTE family 'A';

CREATE PROCEDURE parent
    @Node NVARCHAR(10)
AS
BEGIN
    SET NOCOUNT ON;
    DECLARE @lbound INT;
    DECLARE @ubound INT;
    DECLARE @depth INT;
    SELECT @lbound = L, @ubound = R, @depth = Depth FROM Tree WHERE Node = @Node
    SELECT Node FROM TREE WHERE L < @lbound AND R > @ubound AND Depth = (@depth - 1)
    RETURN
END;

EXECUTE parent 'C';
EXECUTE parent 'I';
EXECUTE parent 'A';

CREATE PROCEDURE ancestor
    @Node NVARCHAR(10)
AS
BEGIN
    SET NOCOUNT ON;
    DECLARE @lbound INT;
    DECLARE @ubound INT;
    SELECT @lbound = L, @ubound = R FROM Tree WHERE Node = @Node
    SELECT Node FROM TREE WHERE L < @lbound AND R > @ubound
    RETURN
END;

EXECUTE ancestor 'C';
EXECUTE ancestor 'I';
EXECUTE ancestor 'A';

For creating the nested sets in the table in the first place you can run some code to generate the inserts or start with the first node and then successively add each additional node - although since each add potentially modifies many of the nodes in the set there can be a lot of thrashing of the database as you build this.

Here's a stored procedure for adding a node as a child of another node:

CREATE PROCEDURE insertNode
    @ParentNode NVARCHAR(10), @NewNodeName NVARCHAR(10), @NewNodeValue INT
AS
BEGIN
    SET NOCOUNT ON;
    DECLARE @ubound INT;
    DECLARE @depth INT;
    SELECT @ubound = R, @depth = Depth FROM Tree WHERE Node = @ParentNode
    UPDATE Tree SET L = L + 2 WHERE L >= @ubound
    UPDATE Tree SET R = R + 2 WHERE R >= @ubound
    INSERT INTO Tree (Node, Value, L, R, Depth) VALUES (@NewNodeName, @NewNodeValue,  @ubound, @ubound + 1, @depth + 1);
    RETURN
END;

I got this from http://www.evanpetersen.com/item/nested-sets.html who also shows a nice graph walking algorithm for creating the initial L and R values. You'd have to enhance this to keep track of the depth as well but that's be easy.

How to represent a tree like structure in a db

I showed a solution similar to your nodes & edges tables, in my answer to the StackOverflow question: What is the most efficient/elegant way to parse a flat table into a tree? I call this solution "Closure Table".

I did a presentation on different methods of storing and using trees in SQL, Models for Hierarchical Data with SQL and PHP. I demonstrated that with the right indexes (depending on the queries you need to run), the Closure Table design can have very good performance, even over large collections of edges (about 500K edges in my demo).

I also covered the design in my book, SQL Antipatterns: Avoiding the Pitfalls of Database Programming.

Database structure and querying hierarchal data and trees of data

There are already good approaches out there which are more simple than the solution you propose.

Here are a couple of links which explain how to do it (we use this ourselves for much the same problem you describe and it works well).

Managing Hierarchical Data in MySQL (from MySQL)
Storing Hierarchical Data in a Database (from Sitepoint, but a clearer explanation, I think)

This makes inserting/updating more complex, but selecting portions of the tree structure far faster (with only one query). It allows finding all children of any given node in one query, and finding all the ancestors of a given node with one query.

Database Structure For Tree Data Structure