How to convert a symmetric matrix into dist object?
It sounds like you already have a matrix calculated, and want to use that in hclust. Like @shadow said, you can use as.dist(yourMatrix)
to convert to the dist format.
Given a symmetric table of distances:
> yourMatrix<-matrix(c(1,2,3,4,2,1,2,1,3,2,1,3,4,1,3,1), nrow=4)
[,1] [,2] [,3] [,4]
[1,] 1 2 3 4
[2,] 2 1 2 1
[3,] 3 2 1 3
[4,] 4 1 3 1
>
>as.dist(yourMatrix)
1 2 3
2 2
3 3 2
4 4 1 3
Make sure that the values in your matrix are dissimilarity, or distance metrics rather than similarity scores.
How to convert data.frame into distance matrix for hierarchical clustering?
temp = as.vector(na.omit(unlist(df1)))
NM = unique(c(colnames(df1), row.names(df1)))
mydist = structure(temp, Size = length(NM), Labels = NM,
Diag = FALSE, Upper = FALSE, method = "euclidean", #Optional
class = "dist")
mydist
# DA DB DC DD
#DB 0.39
#DC 0.44 0.35
#DD 0.30 0.48 0.32
#DE 0.50 0.80 0.91 0.70
plot(hclust(mydist))
DATA
df1 = structure(list(DA = c(0.39, 0.44, 0.3, 0.5), DB = c(NA, 0.35,
0.48, 0.8), DC = c(NA, NA, 0.32, 0.91), DD = c(NA, NA, NA, 0.7
)), .Names = c("DA", "DB", "DC", "DD"), class = "data.frame", row.names = c("DB",
"DC", "DD", "DE"))
Convert scipy condensed distance matrix to lower matrix read by rows
Here's a quick implementation--but it creates the square redundant distance matrix as an intermediate step:
In [128]: import numpy as np
In [129]: from scipy.spatial.distance import squareform
c
is the condensed form of the distance matrix:
In [130]: c = np.array([1, 2, 3, 4, 5, 6])
d
is the redundant square distance matrix:
In [131]: d = squareform(c)
Here's your condensed lower triangle distances:
In [132]: d[np.tril_indices(d.shape[0], -1)]
Out[132]: array([1, 2, 4, 3, 5, 6])
Here's a method that avoids forming the redundant distance matrix. The function condensed_index(i, j, n)
takes the row i
and column j
of the redundant distance matrix, with j
> i
, and returns the corresponding index in the condensed distance array.
In [169]: def condensed_index(i, j, n):
...: return n*i - i*(i+1)//2 + j - i - 1
...:
As above, c
is the condensed distance array.
In [170]: c
Out[170]: array([1, 2, 3, 4, 5, 6])
In [171]: n = 4
In [172]: i, j = np.tril_indices(n, -1)
Note that the arguments are reversed in the following call:
In [173]: indices = condensed_index(j, i, n)
indices
gives the desired permutation of the condensed distance array.
In [174]: c[indices]
Out[174]: array([1, 2, 4, 3, 5, 6])
(Basically the same function as condensed_index(i, j, n)
was given in several answers to this question.)
Convert distance pairs to distance matrix to use in hierarchical clustering
You say you will use scipy for clustering, so I assume that means you will use the function scipy.cluster.hierarchy.linkage
. linkage
accepts the distance data in "condensed" form, so you don't have to create the full symmetric distance matrix. (See, e.g., How does condensed distance matrix work? (pdist), for a discussion on the condensed form.)
So all you have to do is get obj_distances.values()
into a known order and pass that to linkage
. That's what is done in the following snippet:
from scipy.cluster.hierarchy import linkage, dendrogram
obj_distances = {
('obj2', 'obj3'): 1.8,
('obj3', 'obj1'): 1.95,
('obj1', 'obj4'): 2.5,
('obj1', 'obj2'): 2.0,
('obj4', 'obj2'): 2.1,
('obj3', 'obj4'): 1.58,
}
# Put each key pair in a canonical order, so we know that if (a, b) is a key,
# then a < b. If this is already true, then the next three lines can be
# replaced with
# sorted_keys, distances = zip(*sorted(obj_distances.items()))
# Note: we assume there are no keys where the two objects are the same.
keys = [sorted(k) for k in obj_distances.keys()]
values = obj_distances.values()
sorted_keys, distances = zip(*sorted(zip(keys, values)))
# linkage accepts the "condensed" format of the distances.
Z = linkage(distances)
# Optional: create a sorted list of the objects.
labels = sorted(set([key[0] for key in sorted_keys] + [sorted_keys[-1][-1]]))
dendrogram(Z, labels=labels)
The dendrogram:
Copying distance matrix as a matrix into excel from R
For copying data from R to Excel I wrote this little function, which you can use:
writeClipboardDf <- function (dat, format = 1, colnames = TRUE, rownames = FALSE,
token = "\t") {
if (is.null(dim(dat))) {
writeClipboard(as.character(dat), format)
}
else {
strDf <- apply(dat, 1, paste, collapse = token)
if (rownames) {
rn <- rownames(dat)
if (is.null(rn)) {
rn <- 1:(nrow(dat))
}
if (colnames) {
strDf <- c(paste(names(dat), collapse = token),
strDf)
rn <- c("", rn)
}
strDf <- paste(rn, strDf, sep = token)
}
else {
if (colnames) {
strDf <- c(paste(names(dat), collapse = token),
strDf)
}
}
writeClipboard(strDf, format)
}
}
Basically, it takes the input and adds tabs \t
such taht you can conveniently paste the data into Excel. All you have to do then is to call writeClipboardDf(as.matrix(dist(x$columnname)))
.
Related Topics
How to Remove "Rows" with a Na Value
Adding Custom Image to Geom_Polygon Fill in Ggplot
Save All Plots Already Present in the Panel of Rstudio
Change Background Color of R Plot
How to Add Chapter Bibliographies Using Bookdown
Is There a Function to Add Aov Post-Hoc Testing Results to Ggplot2 Boxplot
Non-Linear Color Distribution Over the Range of Values in a Geom_Raster
How to Sort a Data.Frame with Only One Column, Without Losing Rownames
Why Is Seq(X) So Much Slower Than 1:Length(X)
Figures Captions and Labels in Knitr
Name Columns Within Aggregate in R
How to Separate Title Page and Table of Content Page from Knitr Rmarkdown PDF
How to Split the Main Title of a Plot in 2 or More Lines
Handling Missing/Incomplete Data in R--Is There Function to Mask But Not Remove Nas
Importing Common Yaml in Rstudio/Knitr Document
Function to Extract Domain Name from Url in R
R Glmnet:"(List) Object Cannot Be Coerced to Type 'Double' "