Hierarchical/Tree Database for Directories Path in Filesystem

Hierarchical/tree database for directories path in filesystem

Here's a quick closure table example for SQLite. I've not included the statements for inserting items into an existing tree. Instead, I've just created the statements manually. You can find the insert and delete statements in the Models for hierarchical data slides.

For the sake of my sanity when inserting the IDs for the directories, I renamed the directories to match their IDs:

        (ROOT)
      /        \ 
    Dir2        Dir3
    /    \           \
  Dir4   Dir5        Dir6
  /          
Dir7

Create tables

CREATE TABLE `filesystem` (
  `id` INTEGER,
  `dirname` TEXT,
  PRIMARY KEY (`id`)
);

CREATE TABLE `tree_path` (
  `ancestor` INTEGER,
  `descendant` INTEGER,
  PRIMARY KEY (`ancestor`, `descendant`)
);

Insert directories into filesystem table

INSERT INTO filesystem (id, dirname) VALUES (1, 'ROOT');
INSERT INTO filesystem (id, dirname) VALUES (2, 'Dir2');
INSERT INTO filesystem (id, dirname) VALUES (3, 'Dir3');
INSERT INTO filesystem (id, dirname) VALUES (4, 'Dir4');
INSERT INTO filesystem (id, dirname) VALUES (5, 'Dir5');
INSERT INTO filesystem (id, dirname) VALUES (6, 'Dir6');
INSERT INTO filesystem (id, dirname) VALUES (7, 'Dir7');

Create the closure table paths

INSERT INTO tree_path (ancestor, descendant) VALUES (1, 1);
INSERT INTO tree_path (ancestor, descendant) VALUES (1, 2);
INSERT INTO tree_path (ancestor, descendant) VALUES (1, 3);
INSERT INTO tree_path (ancestor, descendant) VALUES (1, 4);
INSERT INTO tree_path (ancestor, descendant) VALUES (1, 5);
INSERT INTO tree_path (ancestor, descendant) VALUES (1, 6);
INSERT INTO tree_path (ancestor, descendant) VALUES (1, 7);
INSERT INTO tree_path (ancestor, descendant) VALUES (2, 2);
INSERT INTO tree_path (ancestor, descendant) VALUES (2, 4);
INSERT INTO tree_path (ancestor, descendant) VALUES (2, 5);
INSERT INTO tree_path (ancestor, descendant) VALUES (2, 7);
INSERT INTO tree_path (ancestor, descendant) VALUES (3, 3);
INSERT INTO tree_path (ancestor, descendant) VALUES (3, 6);
INSERT INTO tree_path (ancestor, descendant) VALUES (4, 4);
INSERT INTO tree_path (ancestor, descendant) VALUES (4, 7);
INSERT INTO tree_path (ancestor, descendant) VALUES (5, 5);
INSERT INTO tree_path (ancestor, descendant) VALUES (6, 6);
INSERT INTO tree_path (ancestor, descendant) VALUES (7, 7);

Run some queries

# (ROOT) and subdirectories
SELECT f.id, f.dirname FROM filesystem f
  JOIN tree_path t
    ON t.descendant = f.id
 WHERE t.ancestor = 1;

+----+---------+
| id | dirname |
+----+---------+
|  1 | ROOT    |
|  2 | Dir2    |
|  3 | Dir3    |
|  4 | Dir4    |
|  5 | Dir5    |
|  6 | Dir6    |
|  7 | Dir7    |
+----+---------+

# Dir3 and subdirectories
SELECT f.id, f.dirname
  FROM filesystem f
  JOIN tree_path t
    ON t.descendant = f.id
 WHERE t.ancestor = 3;

+----+---------+
| id | dirname |
+----+---------+
|  3 | Dir3    |
|  6 | Dir6    |
+----+---------+

# Dir5 and parent directories
SELECT f.id, f.dirname
  FROM filesystem f
  JOIN tree_path t
    ON t.ancestor = f.id
 WHERE t.descendant = 5;

+----+---------+
| id | dirname |
+----+---------+
|  1 | ROOT    |
|  2 | Dir2    |
|  5 | Dir5    |
+----+---------+

# Dir7 and parent directories
SELECT f.id, f.dirname
  FROM filesystem f
  JOIN tree_path t
    ON t.ancestor = f.id
 WHERE t.descendant = 7;

+----+---------+
| id | dirname |
+----+---------+
|  1 | ROOT    |
|  2 | Dir2    |
|  4 | Dir4    |
|  7 | Dir7    |
+----+---------+

SELECT f.id, f.dirname
  FROM filesystem f
  JOIN tree_path t
    ON t.ancestor = f.id
 WHERE t.descendant = (
SELECT id
  FROM filesystem
 WHERE dirname LIKE '%7%'
);

+----+---------+
| id | dirname |
+----+---------+
|  1 | ROOT    |
|  2 | Dir2    |
|  4 | Dir4    |
|  7 | Dir7    |
+----+---------+

How to store directory / hierarchy / tree structure in the database?

There are many ways to store hierarchies in SQL databases. Which one to choose depends on which DBMS product you use, and how the data will be used. As you have used the MSSQL2005 tag, I think you should start considering the "Adjacency List" model; if you find that it doesn't perform well for your application, then have a look at Vadim Tropashko's comparison which highlights differences between models with a focus on multiple performance characteristics.

MySQL : Storing directory structure and query it

Plan A: Have the directory path in a row in a separate table from the filename.

Plan B: Like Plan A, but with the path being multiple rows in a "hierarchical" arrangement in the new table.

Plan C: Leave the "dir/fn" as is, then use LIKE or RLIKE to locate a row, then use SUBSTRING_INDEX (or REGEXP_REPLACE in MariaDB) to substitute.

Find record in closure table based on path

In its basic form, the closure table just contains all pairs that are connected transitively.

To answer your query without recursion, the concatenated path string must be there already.

Do you need to select paths starting from the root node only?

The you could follow one of the following two options:

Store the absolute path as an additional column in your nodes table.
Or directly usse the absolute paths as key of your nodes, hence they are also the foreign keys in the closure table, and you can directly select them.

Do you need to select relatively to some starting node?

In that case, you could just store the relative path for each pair in the closure table, e.g., making it closure(ancestor_id int, descendant_id int, relative_path_between varchar). The INSERT operations used to build up the closure tables can be extended easily by the respective concat expression, e.g., to insert a new child Y for parent X with relative path Z, you would do

insert into closure_table(ancestor_id, descendant_id, relative_path_between)
select ancestor_id, 
       <Y>, 
       relative_path_between || '/' || <Z>
from closure_table 
where descendant_id = <X>
union all
select <Y>,<Y>,'';

(cf Bill Karwin's slides, slides 68 -- 78)

Store json-like hierarchical data as nested directory tree?

The closest thing I can think of that could be seen as some kind of convention for doing this are Linux configuration files. In modern Linux, you often split the configuration of a service into multiple files residing in a certain directory, e.g. /etc/exim4/conf.d/ instead of having a single file /etc/exim/exim4.conf. There are multiple reasons doing this:

Some configuration may be provided by the package manager (e.g. linking to other services that are installed via package manager), while other parts are user-defined. Since there would be a conflict if the user edits a file provided by the package manager, they can instead create a new file and enter additional configuration there.
For large configuration files (like for exim4), it's easier to navigate the configuration if you have multiple files for different concerns (hardcore vim users might disagree).
You can enable / disable parts of the configuration easier by renaming / moving the file that contains a certain part.

We can learn a bit from this: Separation into distinct files should happen if the semantic of the content is orthogonal, i.e. the semantic of one file does not depend on the semantic of another file. This is of course a rule for sibling files; we cannot really deduct rules for serializing a tree structure as directory tree from it. However, we can definitely see reasons for not splitting every value in an own file.

You mention problems of encoding special characters into a file name. You will only have this problem if you go against conventions! The implicit convention on file and directory names is that they act as locator / ID for files, never as content. Again, we can learn a bit from Linux config files: Usually, there is a master file that contains an include statement which loads all the split files. The include statement gives a path glob expression which locates the other files. The path to those files is irrelevant for the semantics of their content. Technically, we can do something similar with YAML.

Assume we want to split this single YAML file into multiple files (pardon my lack of creativity):

spam:
  spam: spam
  egg: sausage
baked beans:
- spam
- spam
- bacon

A possible transformation would be this (read stuff ending with / as directory, : starts file content):

confdir/
  main.yaml:
    spam: !include spammap/main.yaml
    baked beans: !include beans/
  spammap/
    main.yaml:
      spam: !include spam.yaml
      egg: !include egg.yaml
    spam.yaml:
      spam
    egg.yaml:
      sausage
  beans/
    1.yaml:
      spam
    2.yaml:
      spam
    3.yaml:
      bacon

(In YAML, !include is a local tag. With most implementations, you can register a custom constructor for it, thus loading the whole hierarchy as single document.)

As you can see, I put every hierarchy level and every value into a separate file. I use two kinds of includes: A reference to a file will load the content of that file; a reference to a directory will generate a sequence where each item's value is the content of one file in that directory, sorted by file name. As you can see, the file and directory names are never part of the content, sometimes I opted to name them differently (e.g. baked beans -> beans/) to avoid possible file system problems (spaces in filenames in this case – usually not a serious problem nowadays). Also, I adhere to the filename extension convention (having the files carry .yaml). This would be more quirky if you put content into the file names.

I named the starting file on each level main.yaml (not needed in beans/ since it's a sequence). While the exact name is arbitrary, this is a convention used in several other tools, e.g. Python with __init__.py or the Nix package manager with default.nix. Then I placed additional files or directories besides this main file.

Since including other files is explicit, it is not a problem with this approach to put a larger part of the content into a single file. Note that JSON lacks YAML's tags functionality, but you can still walk through a loaded JSON file and preprocess values like {"!include": "path"}.

To sum up: While there is not directly a convention how to do what you want, parts of the problem have been solved at different places and you can inherit wisdom from that.

Here's a minimal working example of how to do it with PyYAML. This is just a proof of concept; several features are missing (e.g. autogenerated file names will be ascending numbers, no support for serializing lists into directories). It shows what needs to be done to store information about the data layout while being transparent to the user (data can be accessed like a normal dict structure). It remembers file names stuff has been loaded from and stores to those files again.

import os.path
from pathlib import Path

import yaml
from yaml.reader import Reader
from yaml.scanner import Scanner
from yaml.parser import Parser
from yaml.composer import Composer
from yaml.constructor import SafeConstructor
from yaml.resolver import Resolver
from yaml.emitter import Emitter
from yaml.serializer import Serializer
from yaml.representer import SafeRepresenter

class SplitValue(object):
  """This is a value that should be written into its own YAML file."""

  def __init__(self, content, path = None):
    self._content = content
    self._path = path

  def getval(self):
    return self._content

  def setval(self, value):
    self._content = value

  def __repr__(self):
    return self._content.__repr__()

class TransparentContainer(object):
  """Makes SplitValues transparent to the user."""

  def __getitem__(self, key):
    val = super(TransparentContainer, self).__getitem__(key)
    return val.getval() if isinstance(val, SplitValue) else val

  def __setitem__(self, key, value):
    val = super(TransparentContainer, self).__getitem__(key)
    if isinstance(val, SplitValue) and not isinstance(value, SplitValue):
      val.setval(value)
    else:
      super(TransparentContainer, self).__setitem__(key, value)

class TransparentList(TransparentContainer, list):
  pass

class TransparentDict(TransparentContainer, dict):
  pass

class DirectoryAwareFileProcessor(object):
  def __init__(self, path, mode):
    self._basedir = os.path.dirname(path)
    self._file = open(path, mode)

  def close(self):
    try:
      self._file.close()
    finally:
      self.dispose() # implemented by PyYAML

  # __enter__ / __exit__ to use this in a `with` construct
  def __enter__(self):
    return self

  def __exit__(self, type, value, traceback):
    self.close()

class FilesystemLoader(DirectoryAwareFileProcessor, Reader, Scanner,
    Parser, Composer, SafeConstructor, Resolver):
  """Loads YAML file from a directory structure."""
  def __init__(self, path):
    DirectoryAwareFileProcessor.__init__(self, path, 'r')
    Reader.__init__(self, self._file)
    Scanner.__init__(self)
    Parser.__init__(self)
    Composer.__init__(self)
    SafeConstructor.__init__(self)
    Resolver.__init__(self)

def split_value_constructor(loader, node):
  path = loader.construct_scalar(node)
  with FilesystemLoader(os.path.join(loader._basedir, path)) as childLoader:
    return SplitValue(childLoader.get_single_data(), path)

FilesystemLoader.add_constructor(u'!include', split_value_constructor)

def transp_dict_constructor(loader, node):
  ret = TransparentDict()
  ret.update(loader.construct_mapping(node, deep=True))
  return ret

# override constructor for !!map, the default resolved tag for mappings
FilesystemLoader.add_constructor(yaml.resolver.BaseResolver.DEFAULT_MAPPING_TAG,
    transp_dict_constructor)

def transp_list_constructor(loader, node):
  ret = TransparentList()
  ret.append(loader.construct_sequence(node, deep=True))
  return ret

# like above, for !!seq
FilesystemLoader.add_constructor(yaml.resolver.BaseResolver.DEFAULT_SEQUENCE_TAG,
    transp_list_constructor)

class FilesystemDumper(DirectoryAwareFileProcessor, Emitter,
    Serializer, SafeRepresenter, Resolver):
  def __init__(self, path):
    DirectoryAwareFileProcessor.__init__(self, path, 'w')
    Emitter.__init__(self, self._file)
    Serializer.__init__(self)
    SafeRepresenter.__init__(self)
    Resolver.__init__(self)

    self.__next_unique_name = 1
    Serializer.open(self)

  def gen_unique_name(self):
    val = self.__next_unique_name
    self.__next_unique_name = self.__next_unique_name + 1
    return str(val)

  def close(self):
    try:
      Serializer.close(self)
    finally:
      DirectoryAwareFileProcessor.close(self)

def split_value_representer(dumper, data):
  if data._path is None:
    if isinstance(data._content, TransparentContainer):
      data._path = os.path.join(dumper.gen_unique_name(), "main.yaml")
    else:
      data._path = dumper.gen_unique_name() + ".yaml"
  Path(os.path.dirname(data._path)).mkdir(exist_ok=True)
  with FilesystemDumper(os.path.join(dumper._basedir, data._path)) as childDumper:
    childDumper.represent(data._content)
  return dumper.represent_scalar(u'!include', data._path)

yaml.add_representer(SplitValue, split_value_representer, FilesystemDumper)

def transp_dict_representer(dumper, data):
  return dumper.represent_dict(data)

yaml.add_representer(TransparentDict, transp_dict_representer, FilesystemDumper)

def transp_list_representer(dumper, data):
  return dumper.represent_list(data)

# example usage:

# explicitly specify values that should be split.
myData = TransparentDict({
  "spam": SplitValue({
    "spam": SplitValue("spam", "spam.yaml"),
    "egg": SplitValue("sausage", "sausage.yaml")}, "spammap/main.yaml")})

with FilesystemDumper("root.yaml") as dumper:
  dumper.represent(myData)

# load values from stored files.
# The loaded data remembers which values have been in which files.
with FilesystemLoader("root.yaml") as loader:
  loaded = loader.get_single_data()

# modify a value as if it was a normal structure.
# actually updates a SplitValue
loaded["spam"]["spam"] = "baked beans"
# dumps the same structure as before, with the modified value.
with FilesystemDumper("root.yaml") as dumper:
  dumper.represent(loaded)

Hierarchical/Tree Database for Directories Path in Filesystem