Convert a Dataframe to an Object of Class "Dist" Without Actually Calculating Distances in R

Convert a dataframe to an object of class dist without actually calculating distances in R

I had a similar problem not to long ago and solved it like this:

n <- max(table(df$site.x)) + 1  # +1,  so we have diagonal of 
res <- lapply(with(df, split(Distance, df$site.x)), function(x) c(rep(NA, n - length(x)), x))
res <- do.call("rbind", res)
res <- rbind(res, rep(NA, n))
res <- as.dist(t(res))

How to convert data.frame into distance matrix for hierarchical clustering?

temp = as.vector(na.omit(unlist(df1)))
NM = unique(c(colnames(df1), row.names(df1)))
mydist = structure(temp, Size = length(NM), Labels = NM,
Diag = FALSE, Upper = FALSE, method = "euclidean", #Optional
class = "dist")
mydist
# DA DB DC DD
#DB 0.39
#DC 0.44 0.35
#DD 0.30 0.48 0.32
#DE 0.50 0.80 0.91 0.70

plot(hclust(mydist))

Sample Image

DATA

df1 = structure(list(DA = c(0.39, 0.44, 0.3, 0.5), DB = c(NA, 0.35, 
0.48, 0.8), DC = c(NA, NA, 0.32, 0.91), DD = c(NA, NA, NA, 0.7
)), .Names = c("DA", "DB", "DC", "DD"), class = "data.frame", row.names = c("DB",
"DC", "DD", "DE"))

Efficient Way to Convert CSV of Sparse Distances to Dist Object R

An object of class "dist" is a dense object. To go from the sparse representation will require a vector on the order of

R> 0.5*(91000000*90999999)
[1] 4.1405e+15

elements (give or take for the diagonal). In R, the maximum length of a vector is 2^31 - 1:

R> 2^31 - 1
[1] 2147483647

which is way smaller than the number of elements you need to store the dense "dist" object so it won't be possible and that is the reason for the error from dist(). For similar reasons you won't be able to store the lower triangle version of the data as a dense object as it too is held as a vector with the same length limits.

At this point I think you'll need to explain more about the actual problem and what you want the dissimilarity object for (in another Question!)? Do you need all dissimilarities between the 91 million objects or could you get by with a sample from this that will fit into the current length limitations for R's vectors?

euclidean distance of instances of one dataframe with all the instances of other dataframe

I assume that by "class attribute" you mean that the data set has a categorical variable that needs to be excluded from the calculations. This is accomplished simply by indexing the data and excluding the column in question. In addition, it is convenient to convert the data to a matrix object.

> data_train <- as.matrix(iris[ trainIndex, 1:4])
> data_test <- as.matrix(iris[-trainIndex, 1:4])
> dim(data_train)
[1] 105 4
> dim(data_test)
[1] 45 4

Now for the eucludian distances, you can do

> distance <- sapply(1:nrow(data_test), function(i)
+ sqrt(rowSums(sweep(data_train, 2, data_test[i, ])^2)))
> dim(distance)
[1] 105 45

distance is a 105 by 45 matrix, where the i-th column contains the
euclidean distances between the i-th row of data_test and each of the 105 rows of data_train.

Then to find the smallest k distances from each column, you can do the following

> k <- 3
> apply(distance, 2, function(x) sort(x)[1:k])
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 0.1000000 0.1414214 0.3000000 0.2828427 0.4582576 0.2000000 0.1414214
[2,] 0.1414214 0.2449490 0.3464102 0.3000000 0.5099020 0.2236068 0.1414214
[3,] 0.1414214 0.2645751 0.3605551 0.3316625 0.5656854 0.2449490 0.1732051
[,8] [,9] [,10] [,11] [,12] [,13] [,14]
[1,] 0.3000000 0.244949 0.1414214 0.2236068 0.2645751 0.1414214 0.2236068
[2,] 0.3316625 0.300000 0.1732051 0.3000000 0.3162278 0.2000000 0.2449490
[3,] 0.3464102 0.300000 0.2449490 0.3162278 0.3741657 0.2000000 0.2449490
[,15] [,16] [,17] [,18] [,19] [,20] [,21]
[1,] 0.1414214 0.2645751 0.3872983 0.3605551 0.4898979 0.1414214 0.1414214
[2,] 0.1732051 0.3316625 0.5099020 0.4582576 0.5196152 0.2449490 0.3162278
[3,] 0.2236068 0.4582576 0.5196152 0.6708204 0.5477226 0.4242641 0.3162278
[,22] [,23] [,24] [,25] [,26] [,27] [,28]
[1,] 0.3605551 0.3316625 0.3000000 0.1414214 0.1414214 0.2000000 0.2645751
[2,] 0.4242641 0.3741657 0.3872983 0.1732051 0.2645751 0.4123106 0.5744563
[3,] 0.4690416 0.4000000 0.4358899 0.3000000 0.2828427 0.4795832 0.6082763
[,29] [,30] [,31] [,32] [,33] [,34] [,35]
[1,] 0.1732051 0.2645751 0.4358899 0.2236068 0.4123106 0.5477226 0.2236068
[2,] 0.1732051 0.3162278 0.5291503 0.3741657 0.8185353 0.8944272 0.3000000
[3,] 0.2236068 0.4242641 0.5477226 0.4242641 0.8602325 1.2489996 0.3000000
[,36] [,37] [,38] [,39] [,40] [,41] [,42]
[1,] 0.3162278 0.2645751 0.1732051 0.2828427 0.4582576 0.3316625 0.3162278
[2,] 0.3162278 0.7000000 0.3872983 0.3605551 0.4690416 0.3605551 0.4358899
[3,] 0.3316625 0.9695360 0.4242641 0.4242641 0.5099020 0.3741657 0.4358899
[,43] [,44] [,45]
[1,] 0.2449490 0.2449490 0.2449490
[2,] 0.3464102 0.3605551 0.3000000
[3,] 0.3464102 0.4690416 0.6164414

calculate distance between each pair of coordinates in wide dataframe

The problem you're having is thatapply(...) coerces the first argument to a matrix. By definition, a matrix must have all elements of the same data type. Since one of the columns in dat (dat$subcounty) is char, apply(...) coerces everything to char. In your test dataset, everything was numeric, so you didn't have this problem.

This should work:

dat$dist.km <- sapply(1:nrow(dat),function(i)
spDistsN1(as.matrix(dat[i,3:4]),as.matrix(dat[i,5:6]),longlat=T))


Related Topics



Leave a reply



Submit