Convert and Save Distance Matrix to a Specific Format

How to convert a symmetric matrix into dist object?

It sounds like you already have a matrix calculated, and want to use that in hclust. Like @shadow said, you can use as.dist(yourMatrix) to convert to the dist format.

Given a symmetric table of distances:

> yourMatrix<-matrix(c(1,2,3,4,2,1,2,1,3,2,1,3,4,1,3,1), nrow=4)
[,1] [,2] [,3] [,4]
[1,] 1 2 3 4
[2,] 2 1 2 1
[3,] 3 2 1 3
[4,] 4 1 3 1
>
>as.dist(yourMatrix)
1 2 3
2 2
3 3 2
4 4 1 3

Make sure that the values in your matrix are dissimilarity, or distance metrics rather than similarity scores.

How to convert data.frame into distance matrix for hierarchical clustering?

temp = as.vector(na.omit(unlist(df1)))
NM = unique(c(colnames(df1), row.names(df1)))
mydist = structure(temp, Size = length(NM), Labels = NM,
Diag = FALSE, Upper = FALSE, method = "euclidean", #Optional
class = "dist")
mydist
# DA DB DC DD
#DB 0.39
#DC 0.44 0.35
#DD 0.30 0.48 0.32
#DE 0.50 0.80 0.91 0.70

plot(hclust(mydist))

Sample Image

DATA

df1 = structure(list(DA = c(0.39, 0.44, 0.3, 0.5), DB = c(NA, 0.35, 
0.48, 0.8), DC = c(NA, NA, 0.32, 0.91), DD = c(NA, NA, NA, 0.7
)), .Names = c("DA", "DB", "DC", "DD"), class = "data.frame", row.names = c("DB",
"DC", "DD", "DE"))

Convert scipy condensed distance matrix to lower matrix read by rows

Here's a quick implementation--but it creates the square redundant distance matrix as an intermediate step:

In [128]: import numpy as np

In [129]: from scipy.spatial.distance import squareform

c is the condensed form of the distance matrix:

In [130]: c = np.array([1, 2, 3, 4, 5, 6])

d is the redundant square distance matrix:

In [131]: d = squareform(c)

Here's your condensed lower triangle distances:

In [132]: d[np.tril_indices(d.shape[0], -1)]
Out[132]: array([1, 2, 4, 3, 5, 6])

Here's a method that avoids forming the redundant distance matrix. The function condensed_index(i, j, n) takes the row i and column j of the redundant distance matrix, with j > i, and returns the corresponding index in the condensed distance array.

In [169]: def condensed_index(i, j, n):
...: return n*i - i*(i+1)//2 + j - i - 1
...:

As above, c is the condensed distance array.

In [170]: c
Out[170]: array([1, 2, 3, 4, 5, 6])

In [171]: n = 4

In [172]: i, j = np.tril_indices(n, -1)

Note that the arguments are reversed in the following call:

In [173]: indices = condensed_index(j, i, n)

indices gives the desired permutation of the condensed distance array.

In [174]: c[indices]
Out[174]: array([1, 2, 4, 3, 5, 6])

(Basically the same function as condensed_index(i, j, n) was given in several answers to this question.)

Convert distance pairs to distance matrix to use in hierarchical clustering

You say you will use scipy for clustering, so I assume that means you will use the function scipy.cluster.hierarchy.linkage. linkage accepts the distance data in "condensed" form, so you don't have to create the full symmetric distance matrix. (See, e.g., How does condensed distance matrix work? (pdist), for a discussion on the condensed form.)

So all you have to do is get obj_distances.values() into a known order and pass that to linkage. That's what is done in the following snippet:

from scipy.cluster.hierarchy import linkage, dendrogram

obj_distances = {
('obj2', 'obj3'): 1.8,
('obj3', 'obj1'): 1.95,
('obj1', 'obj4'): 2.5,
('obj1', 'obj2'): 2.0,
('obj4', 'obj2'): 2.1,
('obj3', 'obj4'): 1.58,
}

# Put each key pair in a canonical order, so we know that if (a, b) is a key,
# then a < b. If this is already true, then the next three lines can be
# replaced with
# sorted_keys, distances = zip(*sorted(obj_distances.items()))
# Note: we assume there are no keys where the two objects are the same.
keys = [sorted(k) for k in obj_distances.keys()]
values = obj_distances.values()
sorted_keys, distances = zip(*sorted(zip(keys, values)))

# linkage accepts the "condensed" format of the distances.
Z = linkage(distances)

# Optional: create a sorted list of the objects.
labels = sorted(set([key[0] for key in sorted_keys] + [sorted_keys[-1][-1]]))

dendrogram(Z, labels=labels)

The dendrogram:

dendrogram

Copying distance matrix as a matrix into excel from R

For copying data from R to Excel I wrote this little function, which you can use:

    writeClipboardDf <- function (dat, format = 1, colnames = TRUE, rownames = FALSE, 
token = "\t") {
if (is.null(dim(dat))) {
writeClipboard(as.character(dat), format)
}
else {
strDf <- apply(dat, 1, paste, collapse = token)
if (rownames) {
rn <- rownames(dat)
if (is.null(rn)) {
rn <- 1:(nrow(dat))
}
if (colnames) {
strDf <- c(paste(names(dat), collapse = token),
strDf)
rn <- c("", rn)
}
strDf <- paste(rn, strDf, sep = token)
}
else {
if (colnames) {
strDf <- c(paste(names(dat), collapse = token),
strDf)
}
}
writeClipboard(strDf, format)
}
}

Basically, it takes the input and adds tabs \t such taht you can conveniently paste the data into Excel. All you have to do then is to call writeClipboardDf(as.matrix(dist(x$columnname))).



Related Topics



Leave a reply



Submit