How to build a dendrogram from a directory tree?
Here's a possible approach to get what you originally asked for which is a system like tree. This will give a data.tree
object that's pretty flexible and could be made to plot like you might want but it's not entirely clear to me what you want:
path <- c(
"root/a/some/file.R",
"root/a/another/file.R",
"root/a/another/cool/file.R",
"root/b/some/data.csv",
"root/b/more/data.csv"
)
library(data.tree); library(plyr)
x <- lapply(strsplit(path, "/"), function(z) as.data.frame(t(z)))
x <- rbind.fill(x)
x$pathString <- apply(x, 1, function(x) paste(trimws(na.omit(x)), collapse="/"))
(mytree <- data.tree::as.Node(x))
1 root
2 ¦--a
3 ¦ ¦--some
4 ¦ ¦ °--file.R
5 ¦ °--another
6 ¦ ¦--file.R
7 ¦ °--cool
8 ¦ °--file.R
9 °--b
10 ¦--some
11 ¦ °--data.csv
12 °--more
13 °--data.csv
plot(mytree)
You can get the parts you want (I think) but it'll require you to do the leg work and figure out conversion between data types in data.tree
: https://cran.r-project.org/web/packages/data.tree/vignettes/data.tree.html#tree-conversion
I use this approach in my pathr package's tree
function when use.data.tree = TRUE
https://github.com/trinker/pathr#tree
EDIT Per@Luke's comment below...data.tree::as.Node
takes a path directly:
(mytree <- data.tree::as.Node(data.frame(pathString = path)))
levelName
1 root2
2 ¦--a
3 ¦ ¦--some
4 ¦ ¦ °--file.R
5 ¦ °--another
6 ¦ ¦--file.R
7 ¦ °--cool
8 ¦ °--file.R
9 °--b
10 ¦--some
11 ¦ °--data.csv
12 °--more
13 °--data.csv
Making simple phylogenetic dendrogram (tree) from a list of species
It's probably a bit lame to answer my own question, but I found an easier solution. Maybe it helps someone one day.
library(ape)
taxa <- as.phylo(~Kingdom/Phylum/Class/Order/Species, data = dat)
col.grp <- merge(data.frame(Species = taxa$tip.label), dat[c("Species", "Group")], by = "Species", sort = F)
cols <- ifelse(col.grp$Group == "Benthos", "burlywood4", ifelse(col.grp$Group == "Zooplankton", "blueviolet", ifelse(col.grp$Group == "Fish", "dodgerblue", ifelse(col.grp$Group == "Phytoplankton", "darkolivegreen2", ""))))
plot(taxa, type = "cladogram", tip.col = cols)
Note that all columns have to be factors. This demonstrates the work flow with R. It takes a week to find out something, although the code itself is just a couple of rows =)
Plotting a dendrogram
Don't coerce "dist"
object to matrix.
x <- matrix(rnorm(100), nrow = 5)
d <- dist(x)
dd <- hclust(d) ## works fine
plot(dd)
hclust(as.matrix(d)) ## fails
# Error in if (is.na(n) || n > 65536L) stop("size cannot be NA nor exceed 65536") :
# missing value where TRUE/FALSE needed
generating high-resolution dendrogram plot in R
You can achieve this with standard R functions.
Plot a dendrogram
To plot a dendrogram from a distance matrix you can use the hclust
function. See its man page for further details on the algorithms available.
# To produce a dummy distance matrix
distMatrix <- dist(matrix(1:9, ncol=3))
# To convert it into a tree
tree <- hclust(distMatrix)
For the plot, the dendrogram
class provides a useful plot
method. Just convert the hclust output to dendrogram and plot it :
dendro <- as.dendrogram(tree)
This method provides a horiz
argument that can switch X and Y axis, test the following :
plot(dendro, horiz=TRUE)
plot(dendro, horiz=FALSE)
Manage its size
For the readability, it is up to the device you use for exporting the image. R can produce huge images, it is up to the user to set the size and resolution. See the man page for png
or pdf
for further details (width, height and res are interesting arguments).
An other track to follow is the graphical parameters : playing with the various cex
values, you will be able to resize the labels. See the man page of par
for further details.
Readability is quite human oriented, so i don't think you will find an automated way to obtain a readable plot automaticaly, but with a few manual tunning you can achieve it with the tools i mentionned. If automation is mandatory, it can be obtained using some par
elements generated by R like cin
to predict the needed device width, but it is much simpler to tune it manually.
New axis
The axis
function can help you.
Creating dendrograms manually: how to fix 'merge' matrix has invalid contents in plot.hclust?
The validity of a hclust
tree is checked by the .validity.hclust
function. Its source code is given here. Look at lines 121-135.
That you got the error means that your tree is not valid because of its merge
matrix. It has non-unique elements (e.g., 1 and 2). In a properly constructed merge
matrix, all entries are unique and run from -N_obs
to N_obs-2
(zero excluded), where N_obs
is a (positive) number of observations. This is checked by the following if
test in the code:
if(identical(sort(as.integer(merge)), c(-(n:1L), +seq_len(n-2L))))
TRUE
else
"'merge' matrix has invalid contents"
From the reference of hclust
:
merge an n − 1 by 2 matrix.
Row i of merge describes the merging of clusters at step i of the
clustering. If an element j in the row is negative, then observation
− j was merged at this stage. If j is positive then the merge was
with the cluster formed at the (earlier) stage j of the algorithm.
Thus negative entries in merge indicate agglomerations of singletons,
and positive entries indicate agglomerations of non-singletons.
All negative entries are singletons (observations), and positive numbers are merges of existing clusters and refer to merging steps of the algorithm.
So, revise your hclust
object. Here is some code to give you an idea what a proper hclust
object looks like:
iris2 <- iris[1:20,-5]
species_labels <- iris[,5]
d_iris <- dist(iris2)
tree_iris <- hclust(d_iris, method = "complete")
Take a closer look at tree_iris$merge
.
UPDATE
After I got more time, I decided to fix your code. I modified the merge
entry of the tree
. This is what the working code that reproduces your dendrogram looks like:
tree <- list()
tree$merge <- matrix(c( -1, -7, # row 1
-2, -6, # row 2
-3, -12, # row 3
-4, -14, # row 4
-5, -8, # row 5
-9, -11, # row 6
-13, -20, # row 7
-15, -19, # row 8
1, 8, # row 9: 1,7,15,19
2, 5, # row 10: 2,6,5,8
3, 6, # row 11: 3,12,9,11
10, -18, # row 12: 2,6,5,8 + 18
9, 11, # row 13: 1,7,15,19 + 3,12,9,11
12, 4, # row 14: row 12 + row 4
-10, 7, # row 15: row 7 + 10
-16, -17, # row 16
13, 14, # row 17: row 13 + row 14
15, 16, # row 18: row 15 + row 16
17, 18), # row 19: row 17 + row 18
ncol = 2,
byrow = TRUE)
tree$height <- c(0.06573653, 0.06573653, 0.06573653, 0.06573653, 0.06573653, 0.06573653, 0.06573653, 0.06573653, 0.11167131, 0.11167131, 0.11167131, 0.12832304, 0.17304035, 0.17304035, 0.17304035, 0.17304035, 0.22965349, 0.22965349, 0.23334799)
tree$labels <- as.character(1:20)
tree$order <- c(1, 7, 15, 19, 3, 12, 9, 11, 2, 6, 5, 8, 18, 4, 14, 13, 20, 10, 16, 17)
class(tree) <- "hclust"
plot(tree)
Cut a dendrogram
cut
will cut the tree at a specified height. It will return a list of the upper
and lower
portions
cut(dend, h = depth.cutoff)$upper
# $upper
# 'dendrogram' with 2 branches and 5 members total, at height 5.887262
#
# $lower
# $lower[[1]]
# 'dendrogram' with 2 branches and 6 members total, at height 4.515119
#
# $lower[[2]]
# 'dendrogram' with 2 branches and 2 members total, at height 3.789259
#
# $lower[[3]]
# 'dendrogram' with 2 branches and 5 members total, at height 3.837733
#
# $lower[[4]]
# 'dendrogram' with 2 branches and 3 members total, at height 3.845031
#
# $lower[[5]]
# 'dendrogram' with 2 branches and 4 members total, at height 4.298743
plot(cut(dend, h = depth.cutoff)$upper, horiz = T)
Related Topics
Ggplot Aes_String Does Not Work Inside a Function
What Are Some Good Books, Web Resources, and Projects for Learning R
How to Add Chapter Bibliographies Using Bookdown
Ggplot/Mapping Us Counties - Problems with Visualization Shapes in R
Sliding Time Intervals for Time Series Data in R
Setting Seed Locally (Not Globally) in R
Simplest Way to Plot Changes in Ranking Between Two Ordered Lists in R
Rstudio Empty on Startup - No Windows, No Menus, No Rendering
How to Manually Set Colors in a Bar Chart
To Find Whether a Column Exists in Data Frame or Not
Handling Missing/Incomplete Data in R--Is There Function to Mask But Not Remove Nas
An Elegant Way to Change Columns Type in Dataframe in R
Building a Box Plot from All Columns of Data Frame with Column Names on X in Ggplot2