How to Build a Dendrogram from a Directory Tree

How to build a dendrogram from a directory tree?

Here's a possible approach to get what you originally asked for which is a system like tree. This will give a data.tree object that's pretty flexible and could be made to plot like you might want but it's not entirely clear to me what you want:

path <- c(
"root/a/some/file.R",
"root/a/another/file.R",
"root/a/another/cool/file.R",
"root/b/some/data.csv",
"root/b/more/data.csv"
)

library(data.tree); library(plyr)

x <- lapply(strsplit(path, "/"), function(z) as.data.frame(t(z)))
x <- rbind.fill(x)
x$pathString <- apply(x, 1, function(x) paste(trimws(na.omit(x)), collapse="/"))
(mytree <- data.tree::as.Node(x))

1 root
2 ¦--a
3 ¦ ¦--some
4 ¦ ¦ °--file.R
5 ¦ °--another
6 ¦ ¦--file.R
7 ¦ °--cool
8 ¦ °--file.R
9 °--b
10 ¦--some
11 ¦ °--data.csv
12 °--more
13 °--data.csv

plot(mytree)

You can get the parts you want (I think) but it'll require you to do the leg work and figure out conversion between data types in data.tree: https://cran.r-project.org/web/packages/data.tree/vignettes/data.tree.html#tree-conversion

I use this approach in my pathr package's tree function when use.data.tree = TRUE https://github.com/trinker/pathr#tree

EDIT Per@Luke's comment below...data.tree::as.Node takes a path directly:

(mytree <- data.tree::as.Node(data.frame(pathString = path)))

levelName
1 root2
2 ¦--a
3 ¦ ¦--some
4 ¦ ¦ °--file.R
5 ¦ °--another
6 ¦ ¦--file.R
7 ¦ °--cool
8 ¦ °--file.R
9 °--b
10 ¦--some
11 ¦ °--data.csv
12 °--more
13 °--data.csv

Making simple phylogenetic dendrogram (tree) from a list of species

It's probably a bit lame to answer my own question, but I found an easier solution. Maybe it helps someone one day.

library(ape)
taxa <- as.phylo(~Kingdom/Phylum/Class/Order/Species, data = dat)

col.grp <- merge(data.frame(Species = taxa$tip.label), dat[c("Species", "Group")], by = "Species", sort = F)

cols <- ifelse(col.grp$Group == "Benthos", "burlywood4", ifelse(col.grp$Group == "Zooplankton", "blueviolet", ifelse(col.grp$Group == "Fish", "dodgerblue", ifelse(col.grp$Group == "Phytoplankton", "darkolivegreen2", ""))))

plot(taxa, type = "cladogram", tip.col = cols)

Note that all columns have to be factors. This demonstrates the work flow with R. It takes a week to find out something, although the code itself is just a couple of rows =)

Sample Image

Plotting a dendrogram

Don't coerce "dist" object to matrix.

x <- matrix(rnorm(100), nrow = 5)
d <- dist(x)
dd <- hclust(d) ## works fine
plot(dd)

Sample Image

hclust(as.matrix(d))  ## fails
# Error in if (is.na(n) || n > 65536L) stop("size cannot be NA nor exceed 65536") :
# missing value where TRUE/FALSE needed

generating high-resolution dendrogram plot in R

You can achieve this with standard R functions.

Plot a dendrogram

To plot a dendrogram from a distance matrix you can use the hclust function. See its man page for further details on the algorithms available.

# To produce a dummy distance matrix
distMatrix <- dist(matrix(1:9, ncol=3))

# To convert it into a tree
tree <- hclust(distMatrix)

For the plot, the dendrogram class provides a useful plot method. Just convert the hclust output to dendrogram and plot it :

dendro <- as.dendrogram(tree)

This method provides a horiz argument that can switch X and Y axis, test the following :

plot(dendro, horiz=TRUE)
plot(dendro, horiz=FALSE)

Manage its size

For the readability, it is up to the device you use for exporting the image. R can produce huge images, it is up to the user to set the size and resolution. See the man page for png or pdf for further details (width, height and res are interesting arguments).

An other track to follow is the graphical parameters : playing with the various cex values, you will be able to resize the labels. See the man page of par for further details.

Readability is quite human oriented, so i don't think you will find an automated way to obtain a readable plot automaticaly, but with a few manual tunning you can achieve it with the tools i mentionned. If automation is mandatory, it can be obtained using some par elements generated by R like cin to predict the needed device width, but it is much simpler to tune it manually.

New axis

The axis function can help you.

Creating dendrograms manually: how to fix 'merge' matrix has invalid contents in plot.hclust?

The validity of a hclust tree is checked by the .validity.hclust function. Its source code is given here. Look at lines 121-135.

That you got the error means that your tree is not valid because of its merge matrix. It has non-unique elements (e.g., 1 and 2). In a properly constructed merge matrix, all entries are unique and run from -N_obs to N_obs-2 (zero excluded), where N_obs is a (positive) number of observations. This is checked by the following if test in the code:

if(identical(sort(as.integer(merge)), c(-(n:1L), +seq_len(n-2L))))
TRUE
else
"'merge' matrix has invalid contents"

From the reference of hclust:

merge an n − 1 by 2 matrix.

Row i of merge describes the merging of clusters at step i of the
clustering. If an element j in the row is negative, then observation
− j was merged at this stage. If j is positive then the merge was
with the cluster formed at the (earlier) stage j of the algorithm.
Thus negative entries in merge indicate agglomerations of singletons,
and positive entries indicate agglomerations of non-singletons.

All negative entries are singletons (observations), and positive numbers are merges of existing clusters and refer to merging steps of the algorithm.

So, revise your hclust object. Here is some code to give you an idea what a proper hclust object looks like:

iris2 <- iris[1:20,-5]
species_labels <- iris[,5]
d_iris <- dist(iris2)
tree_iris <- hclust(d_iris, method = "complete")

Take a closer look at tree_iris$merge.

UPDATE

After I got more time, I decided to fix your code. I modified the merge entry of the tree. This is what the working code that reproduces your dendrogram looks like:

tree <- list()
tree$merge <- matrix(c( -1, -7, # row 1
-2, -6, # row 2
-3, -12, # row 3
-4, -14, # row 4
-5, -8, # row 5
-9, -11, # row 6
-13, -20, # row 7
-15, -19, # row 8
1, 8, # row 9: 1,7,15,19
2, 5, # row 10: 2,6,5,8
3, 6, # row 11: 3,12,9,11
10, -18, # row 12: 2,6,5,8 + 18
9, 11, # row 13: 1,7,15,19 + 3,12,9,11
12, 4, # row 14: row 12 + row 4
-10, 7, # row 15: row 7 + 10
-16, -17, # row 16
13, 14, # row 17: row 13 + row 14
15, 16, # row 18: row 15 + row 16
17, 18), # row 19: row 17 + row 18
ncol = 2,
byrow = TRUE)
tree$height <- c(0.06573653, 0.06573653, 0.06573653, 0.06573653, 0.06573653, 0.06573653, 0.06573653, 0.06573653, 0.11167131, 0.11167131, 0.11167131, 0.12832304, 0.17304035, 0.17304035, 0.17304035, 0.17304035, 0.22965349, 0.22965349, 0.23334799)
tree$labels <- as.character(1:20)
tree$order <- c(1, 7, 15, 19, 3, 12, 9, 11, 2, 6, 5, 8, 18, 4, 14, 13, 20, 10, 16, 17)
class(tree) <- "hclust"
plot(tree)

Cut a dendrogram

cut will cut the tree at a specified height. It will return a list of the upper and lower portions

cut(dend, h = depth.cutoff)$upper

# $upper
# 'dendrogram' with 2 branches and 5 members total, at height 5.887262
#
# $lower
# $lower[[1]]
# 'dendrogram' with 2 branches and 6 members total, at height 4.515119
#
# $lower[[2]]
# 'dendrogram' with 2 branches and 2 members total, at height 3.789259
#
# $lower[[3]]
# 'dendrogram' with 2 branches and 5 members total, at height 3.837733
#
# $lower[[4]]
# 'dendrogram' with 2 branches and 3 members total, at height 3.845031
#
# $lower[[5]]
# 'dendrogram' with 2 branches and 4 members total, at height 4.298743

plot(cut(dend, h = depth.cutoff)$upper, horiz = T)

Sample Image



Related Topics



Leave a reply



Submit