More Efficient Hierarchy System

More efficient hierarchy system

I guess something along these lines should do (untested, must be adapted to your needs):

$q = mysql_query("SELECT id, parent_id, name FROM categories");
while ($r = mysql_fetch_row($q)) {
  $names[$r[0]] = $r[2];
  $children[$r[0]][] = $r[1];
}

function render_select($root=0, $level=-1) {
  global $names, $children;
  if ($root != 0)
    echo '<option>' . strrep(' ', $level) . $names[$root] . '</option>';
  foreach ($children[$root] as $child)
    render_select($child, $level+1);
}

echo '<select>';
render_select();
echo '</select>';

an even funkier way of doing this is by using SQL stored procedures, but it may be way overkill in this case...

How do I efficiently search this hierarchical structure?

Add all the nodes to dictionary with the code as key. (you can do it once), the look-up in dictionary is basically O(1).

void FillDictionary(Dictionary<string, Node> dictionary, Node node)
{
  if (dictionary.ContainsKey(node.Code))
    return;

  dictionary.Add(node.Code, node);

  foreach (Node child in node.Children)
    FillDictionary(dictionary, child)
}

If you know the root, usage will be:

var dictionary = new Dictionary<string, Node>();
FillDictionary(dictionary, rootNode);

If you don't you can call the FillDictionary() method on all your nodes with the same dictionary.

What is the most efficient/elegant way to parse a flat table into a tree?

Now that MySQL 8.0 supports recursive queries, we can say that all popular SQL databases support recursive queries in standard syntax.

WITH RECURSIVE MyTree AS (
    SELECT * FROM MyTable WHERE ParentId IS NULL
    UNION ALL
    SELECT m.* FROM MyTABLE AS m JOIN MyTree AS t ON m.ParentId = t.Id
)
SELECT * FROM MyTree;

I tested recursive queries in MySQL 8.0 in my presentation Recursive Query Throwdown in 2017.

Below is my original answer from 2008:

There are several ways to store tree-structured data in a relational database. What you show in your example uses two methods:

Adjacency List (the "parent" column) and
Path Enumeration (the dotted-numbers in your name column).

Another solution is called Nested Sets, and it can be stored in the same table too. Read "Trees and Hierarchies in SQL for Smarties" by Joe Celko for a lot more information on these designs.

I usually prefer a design called Closure Table (aka "Adjacency Relation") for storing tree-structured data. It requires another table, but then querying trees is pretty easy.

I cover Closure Table in my presentation Models for Hierarchical Data with SQL and PHP and in my book SQL Antipatterns Volume 1: Avoiding the Pitfalls of Database Programming.

CREATE TABLE ClosureTable (
  ancestor_id   INT NOT NULL REFERENCES FlatTable(id),
  descendant_id INT NOT NULL REFERENCES FlatTable(id),
  PRIMARY KEY (ancestor_id, descendant_id)
);

Store all paths in the Closure Table, where there is a direct ancestry from one node to another. Include a row for each node to reference itself. For example, using the data set you showed in your question:

INSERT INTO ClosureTable (ancestor_id, descendant_id) VALUES
  (1,1), (1,2), (1,4), (1,6),
  (2,2), (2,4),
  (3,3), (3,5),
  (4,4),
  (5,5),
  (6,6);

Now you can get a tree starting at node 1 like this:

SELECT f.* 
FROM FlatTable f 
  JOIN ClosureTable a ON (f.id = a.descendant_id)
WHERE a.ancestor_id = 1;

The output (in MySQL client) looks like the following:

+----+
| id |
+----+
|  1 | 
|  2 | 
|  4 | 
|  6 | 
+----+

In other words, nodes 3 and 5 are excluded, because they're part of a separate hierarchy, not descending from node 1.

Re: comment from e-satis about immediate children (or immediate parent). You can add a "path_length" column to the ClosureTable to make it easier to query specifically for an immediate child or parent (or any other distance).

INSERT INTO ClosureTable (ancestor_id, descendant_id, path_length) VALUES
  (1,1,0), (1,2,1), (1,4,2), (1,6,1),
  (2,2,0), (2,4,1),
  (3,3,0), (3,5,1),
  (4,4,0),
  (5,5,0),
  (6,6,0);

Then you can add a term in your search for querying the immediate children of a given node. These are descendants whose path_length is 1.

SELECT f.* 
FROM FlatTable f 
  JOIN ClosureTable a ON (f.id = a.descendant_id)
WHERE a.ancestor_id = 1
  AND path_length = 1;

+----+
| id |
+----+
|  2 | 
|  6 | 
+----+

Re comment from @ashraf: "How about sorting the whole tree [by name]?"

Here's an example query to return all nodes that are descendants of node 1, join them to the FlatTable that contains other node attributes such as name, and sort by the name.

SELECT f.name
FROM FlatTable f 
JOIN ClosureTable a ON (f.id = a.descendant_id)
WHERE a.ancestor_id = 1
ORDER BY f.name;

Re comment from @Nate:

SELECT f.name, GROUP_CONCAT(b.ancestor_id order by b.path_length desc) AS breadcrumbs
FROM FlatTable f 
JOIN ClosureTable a ON (f.id = a.descendant_id) 
JOIN ClosureTable b ON (b.descendant_id = a.descendant_id) 
WHERE a.ancestor_id = 1 
GROUP BY a.descendant_id 
ORDER BY f.name

+------------+-------------+
| name       | breadcrumbs |
+------------+-------------+
| Node 1     | 1           |
| Node 1.1   | 1,2         |
| Node 1.1.1 | 1,2,4       |
| Node 1.2   | 1,6         |
+------------+-------------+

A user suggested an edit today. SO moderators approved the edit, but I am reversing it.

The edit suggested that the ORDER BY in the last query above should be ORDER BY b.path_length, f.name, presumably to make sure the ordering matches the hierarchy. But this doesn't work, because it would order "Node 1.1.1" after "Node 1.2".

If you want the ordering to match the hierarchy in a sensible way, that is possible, but not simply by ordering by the path length. For example, see my answer to MySQL Closure Table hierarchical database - How to pull information out in the correct order.

Hierarchy data structure wih efficient query algorithm

If your data fits into memory then you can implement this by putting a Set of children in each node of the hierarchy and then walking the sets to determine if the path is valid, for example

class University {
  private Set<Major> majors;
}

class Major {
  private Set<Student> students;
}

class Main {
  // true if the path is valid, else false
  public boolean query(University university, Major major, Student student) {
    return university.getMajors().contains(major) &&
      major.getStudents().contains(student);
  }
}

If you also need to walk the reverse path (i.e. if you need a bidirectional hierarchy) then you can put a Set of parents in each child.

This will run in average case O(d) where d is the depth of the hierarchy if you use HashSets, and in worst case O(d * lg(n)) where n is the size of the sets if you use TreeSets.

If your data doesn't fit into memory then you may want to consider using a graph database, e.g. Neo4j.

Edit: You can make the code more generic at the cost of type safety by using Map<String, E> at each level, assuming that each object has a unique name or some other string identifier.

abstract class Hierarchical<E extends Hierarchical> {
  protected final Map<String, E> children;

  public boolean query(Queue<String> query) {
    String key = query.poll();
    if(key != null) {
      E value = map.get(key);
      if(value != null) {
        return query.isEmpty() || value.contains(query);
      }
    }
    return false;
  }
}

class University extends Hierarchical<Major> {}

class Major extends Hierarchical<Student> {}

// special case for the bottom of the hierarchy
class Student extends Hierarchical<Hierarchical> {
  public Student() {
    children = null;
  }

  @Override
  public boolean query(Queue<String> query) {
    throw new UnsupportedOperationException("query should never reach this depth");
  }
}

class Main {
  // true if the path is valid, else false
  public boolean query(Hierarchial root, Queue<String> query) {
    return root.contains(query);
  }
}

This has the same runtime depending on whether you use a HashMap or TreeMap. The query only consists of a queue of strings; at each level of the hierarchy the first string is removed, the Map is queried and the child node is returned if found, and the query proceeds on to the child node until the queue is empty (return true) or a node isn't found (return false).

How to efficiently build a tree from a flat structure?

Store IDs of the objects in a hash table mapping to the specific object. Enumerate through all the objects and find their parent if it exists and update its parent pointer accordingly.

class MyObject
{ // The actual object
    public int ParentID { get; set; }
    public int ID { get; set; }
}

class Node
{
    public List<Node> Children = new List<Node>();
    public Node Parent { get; set; }
    public MyObject AssociatedObject { get; set; }
}

IEnumerable<Node> BuildTreeAndGetRoots(List<MyObject> actualObjects)
{
    Dictionary<int, Node> lookup = new Dictionary<int, Node>();
    actualObjects.ForEach(x => lookup.Add(x.ID, new Node { AssociatedObject = x }));
    foreach (var item in lookup.Values) {
        Node proposedParent;
        if (lookup.TryGetValue(item.AssociatedObject.ParentID, out proposedParent)) {
            item.Parent = proposedParent;
            proposedParent.Children.Add(item);
        }
    }
    return lookup.Values.Where(x => x.Parent == null);
}

Hierarchy Data shift

Building on the idea of finding the longest one first, and taking it as reference for future boss-padding, yields the following code, which works fine for the case above.

The idea is to build a 'bosses' lookup table, built from the longest leaf-to-root path we find. Whenever we have a boss that is in the lookup table, we make sure that he appears in the same position as he appears in the longest path, padding with nulls as necessary.

import org.json.JSONArray;
import java.util.ArrayList;
import java.util.HashMap;

public class T {
    static String refactor(String jsonData) {
        JSONArray array = new JSONArray(jsonData);

        // find longest array in original container
        JSONArray longest = null;
        for (int i=0; i<array.length(); i++) {
            JSONArray a = array.getJSONArray(i);
            if (longest == null || a.length() > longest.length()) {
                longest = a;
            }
        }

        // build a map with the people in "longest", for quick lookup
        HashMap<String, Integer> bosses = new HashMap<String, Integer>();
        for (int i=0; i<longest.length(); i+=2) {
            bosses.put(longest.getString(i) + "|" + longest.getString(i+1), i);
        }

        // prepare target container       
        ArrayList<JSONArray> container = new ArrayList<JSONArray>();

        // fill in missing values
        for (int i=0; i<array.length(); i++) {
            JSONArray a = array.getJSONArray(i);
            ArrayList<String> refactored = new ArrayList<String>();
            // copy leaf employee
            refactored.add(a.getString(0));
            refactored.add(a.getString(1));
            for (int j=2; j<a.length(); j+=2) {
                // possibly fill in nulls before adding this boss
                String boss = a.getString(j) + "|" + a.getString(j+1);
                if (bosses.containsKey(boss)) {
                    for (int k=j; k<bosses.get(boss); k++) {
                        // pad with nulls until we reach target position
                        refactored.add(null);
                    }
                }
                refactored.add(a.getString(j));
                refactored.add(a.getString(j+1));
            }
            container.add(new JSONArray(refactored));
        }
        return new JSONArray(container).toString();
    }

    public static void main(String args[]) {
        System.out.println(refactor(args[0]));
    }
}

Optimal way to model documents hierarchy in CouchDB

There's no right answer to this question, hence the lack of a definitive answer. It mostly depends on what kind of usage you want to optimize for.

You state that retrieval speed of documents that belong to a certain category (and their children) is most important. The first two solutions allow you to create a view that emits a blog post multiple times, once for each category in the chain from the leaf to the root. Thus selecting all documents can be done using a single (and thus fast) query. The only difference of second solution to first solution is that you move the parsing of the category "path" into components from the code that inserts the document to the map function of the view. I would prefer the first solution as it's simpler to implement the map function and a bit more flexible (e.g. it allows a category's name to contain a slash character).

In your scenario you probably also want to create a reduced view which counts the number of blog posts for each category. This is very simple with either of these solutions. With a fitting reduction function, the number of post in every category can be retrieved using a single request.

A downside of the first two solutions is that renaming or moving a category from one parent to another requires every document to be updated. The third solution allows that without touching the documents. But from the description of your scenario I assume that retrieval by category is very frequent and category renaming/moving is very rare.

Solution 4 I propose a fourth solution where blog post documents hold references to category documents but still reference all the ancestors of the post's category. This allows categories to be renamed without touching the blog posts and allows you to store additional metadata with a category (e.g. translations of the category name or a description):

{
    "_id": "8e7a440862347a22f4a1b2ca7f000e83",
    "type": "post",
    "author": "dexter",
    "title": "Hello",
    "category_ids": [3, 2, 1]
}

{
    "_id": "1",
    "type": "category",
    "name": "OO"
}

{
    "_id": "2",
    "type": "category",
    "name": "Programming",
    "parent": "1"
}

{
    "_id": "3",
    "type": "category",
    "name": "C++",
    "parent": "2"
}

You will still have to store the parents of categories with the categories, which is duplicating data in the posts, to allow categories to be traversed (e.g. for displaying a tree of categories for navigation).

You can extend this solution or any of your solutions to allow a post to be categorized under multiple categories, or a category to have multiple parents. When a post is categorized in multiple categories, you will need to store the union of the ancestors of each category in the post's document while preserving the categories selected by the author to allow them to be displayed with the post or edited later.

Lets assume that there is an additional category named "Ajax" with anchestors "JavaScript", "Programming" and "OO". To simplify the following example, I've chosen the document IDs of the categories to equal the category's name.

{
    "_id": "8e7a440862347a22f4a1b2ca7f000e83",
    "type": "post",
    "author": "dexter",
    "title": "Hello",
    "category_ids": ["C++", "Ajax"],
    "category_anchestor_ids": ["C++", "Programming", "OO", "Ajax", "JavaScript"]
}

To allow a category to have multiple parents, just store multiple parent IDs with a category. You will need to eliminate duplicates while finding all the ancestors of a category.

View for Solution 4 Suppose you want to get all the blog posts for a specific category. We will use a database with the following sample data:

{ "_id": "100", "type": "category", "name": "OO"                              }
{ "_id": "101", "type": "category", "name": "Programming", "parent_id": "100" }
{ "_id": "102", "type": "category", "name": "C++",         "parent_id": "101" }
{ "_id": "103", "type": "category", "name": "JavaScript",  "parent_id": "101" }
{ "_id": "104", "type": "category", "name": "AJAX",        "parent_id": "103" }

{ "_id": "200", "type": "post", "title": "OO Post",          "category_id": "104", "category_anchestor_ids": ["100"]                      }
{ "_id": "201", "type": "post", "title": "Programming Post", "category_id": "101", "category_anchestor_ids": ["101", "100"]               }
{ "_id": "202", "type": "post", "title": "C++ Post",         "category_id": "102", "category_anchestor_ids": ["102", "101", "100"]        }
{ "_id": "203", "type": "post", "title": "AJAX Post",        "category_id": "104", "category_anchestor_ids": ["104", "103", "101", "100"] }

In addition to that, we use a view called posts_by_category in a design document called _design/blog with the the following map function:

function (doc) {
    if (doc.type == 'post') {
        for (i in doc.category_anchestor_ids) {
            emit([doc.category_anchestor_ids[i]], doc)
        }
    }
}

Then we can get all the posts in the Programming category (which has ID "101") or one of it's subcategories using a GET requests to the following URL.

http://localhost:5984/so/_design/blog/_view/posts_by_category?reduce=false&key=["101"]

This will return a view result with the keys set to the category ID and the values set to the post documents. The same view can also be used to get a summary list of all categories and the number of post in that category and it's children. We add the following reduce function to the view:

function (keys, values, rereduce) {
    if (rereduce) {
        return sum(values)
    } else {
        return values.length
    }
}

And then we use the following URL:

http://localhost:5984/so/_design/blog/_view/posts_by_category?group_level=1

This will return a reduced view result with the keys again set to the category ID and the values set to the number of posts in each category. In this example, the categories name's would have to be fetched separately but it is possible to create view where each row in the reduced view result already contains the category name.

Whats faster in Oracle? Small table with tree structure vs. Huge flat table

I would definitely go for the first option (hierarchical approach). I think it's better to model the data correctly than to just use a bad data model to gain performance. Since you are modeling a hierarchy here, it makes sense to store it that way in the DB.

If you want the best of both worlds, my recommendation would be to look at using a materialized view to "flatten" the hierarchical data, then you are still storing the data properly, but you get the performance gains (if any) by using the materialized view.

There's almost always a way to follow a good data model and still find ways to get good performance. But a bad data model will cost you for years to come, and it takes great pain to correct it later.

However, even with the flattened approach, you have to consider that you are increasing the number of records dramatically, especially as you get to the leaf nodes in the tree, so I'd be surprised if having a flat hierarchy table (your second approach) would improve performance since there are many more records to process.

More Efficient Hierarchy System