R - How to Get Row & Column Subscripts of Matched Elements from a Distance Matrix

R - How to get row & column subscripts of matched elements from a distance matrix

A distance matrix is a lower triangular matrix in packed format, where the lower triangular is stored as a 1D vector by column. You can check this via

str(distMatrix)
# Class 'dist' atomic [1:10] 1 4 10 15 3 9 14 6 11 5
# ...

Even if we call dist(vec1, diag = TRUE, upper = TRUE), the vector is still the same; only the printing styles changes. That is, no matter how you call dist, you always get a vector.

This answer focus on how to transform between 1D and 2D index, so that you can work with a "dist" object without first making it a complete matrix using as.matrix. If you do want to make it a matrix, use the dist2mat function defined in as.matrix on a distance object is extremely slow; how to make it faster?.


2D to 1D

1D to 2D


R functions

It is easy to write vectorized R functions for those index transforms. We only need some care dealing with "out-of-bound" index, for which NA should be returned.

## 2D index to 1D index
f <- function (i, j, dist_obj) {
if (!inherits(dist_obj, "dist")) stop("please provide a 'dist' object")
n <- attr(dist_obj, "Size")
valid <- (i >= 1) & (j >= 1) & (i > j) & (i <= n) & (j <= n)
k <- (2 * n - j) * (j - 1) / 2 + (i - j)
k[!valid] <- NA_real_
k
}

## 1D index to 2D index
finv <- function (k, dist_obj) {
if (!inherits(dist_obj, "dist")) stop("please provide a 'dist' object")
n <- attr(dist_obj, "Size")
valid <- (k >= 1) & (k <= n * (n - 1) / 2)
k_valid <- k[valid]
j <- rep.int(NA_real_, length(k))
j[valid] <- floor(((2 * n + 1) - sqrt((2 * n - 1) ^ 2 - 8 * (k_valid - 1))) / 2)
i <- j + k - (2 * n - j) * (j - 1) / 2
cbind(i, j)
}

These functions are extremely cheap in memory usage, as they work with index instead of matrices.


Applying finv to your question

You can use

vec1 <- c(2,3,6,12,17)
distMatrix <- dist(vec1)

finv(which(distMatrix == 5), distMatrix)
# i j
#[1,] 5 4

Generally speaking, a distance matrix contains floating point numbers. It is risky to use == to judge whether two floating point numbers are equal. Read Why are these numbers not equal? for more and possible strategies.


Alternative with dist2mat

Using the dist2mat function given in as.matrix on a distance object is extremely slow; how to make it faster?, we may use which(, arr.ind = TRUE).

library(Rcpp)
sourceCpp("dist2mat.cpp")
mat <- dist2mat(distMatrix, 128)
which(mat == 5, arr.ind = TRUE)
# row col
#5 5 4
#4 4 5

Appendix: Markdown (needs MathJax support) for the picture

## 2D index to 1D index

The lower triangular looks like this: $$\begin{pmatrix} 0 & 0 & \cdots & 0\\ \times & 0 & \cdots & 0\\ \times & \times & \cdots & 0\\ \vdots & \vdots & \ddots & 0\\ \times & \times & \cdots & 0\end{pmatrix}$$ If the matrix is $n \times n$, then there are $(n - 1)$ elements ("$\times$") in the 1st column, and $(n - j)$ elements in the j<sup>th</sup> column. Thus, for element $(i,\ j)$ (with $i > j$, $j < n$) in the lower triangular, there are $$(n - 1) + \cdots (n - (j - 1)) = \frac{(2n - j)(j - 1)}{2}$$ "$\times$" in the previous $(j - 1)$ columns, and it is the $(i - j)$<sup>th</sup> "$\times$" in the $j$<sup>th</sup> column. So it is the $$\left\{\frac{(2n - j)(j - 1)}{2} + (i - j)\right\}^{\textit{th}}$$ "$\times$" in the lower triangular.

----

## 1D index to 2D index

Now for the $k$<sup>th</sup> "$\times$" in the lower triangular, how can we find its matrix index $(i,\ j)$? We take two steps: 1> find $j$; 2> obtain $i$ from $k$ and $j$.

The first "$\times$" of the $j$<sup>th</sup> column, i.e., $(j + 1,\ j)$, is the $\left\{\frac{(2n - j)(j - 1)}{2} + 1\right\}^{\textit{th}}$ "$\times$" of the lower triangular, thus $j$ is the maximum value such that $\frac{(2n - j)(j - 1)}{2} + 1 \leq k$. This is equivalent to finding the max $j$ so that $$j^2 - (2n + 1)j + 2(k + n - 1) \geq 0.$$ The LHS is a quadratic polynomial, and it is easy to see that the solution is the integer no larger than its first root (i.e., the root on the left side): $$j = \left\lfloor\frac{(2n + 1) - \sqrt{(2n-1)^2 - 8(k-1)}}{2}\right\rfloor.$$ Then $i$ can be obtained from $$i = j + k - \left\{\frac{(2n - j)(j - 1)}{2}\right\}.$$

How to efficiently extract a row or column from a dist distance matrix

Resort to function f in my old answer here.

f <- function (i, j, dist_obj) {
if (!inherits(dist_obj, "dist")) stop("please provide a 'dist' object")
n <- attr(dist_obj, "Size")
valid <- (i >= 1) & (j >= 1) & (i > j) & (i <= n) & (j <= n)
k <- (2 * n - j) * (j - 1) / 2 + (i - j)
k[!valid] <- NA_real_
k
}

A helper function to extract a single row / column (a slice).

SliceExtract_dist <- function (dist_obj, k) {
if (length(k) > 1) stop("The function is not 'vectorized'!")
n <- attr(dist_obj, "Size")
if (k < 1 || k > n) stop("k out of bound!")
##
i <- 1:(k - 1)
j <- rep.int(k, k - 1)
v1 <- dist_obj[f(j, i, dist_obj)]
##
i <- (k + 1):n
j <- rep.int(k, n - k)
v2 <- dist_obj[f(i, j, dist_obj)]
##
c(v1, 0, v2)
}

Example

set.seed(0)
( d <- dist(cbind(runif(5),runif(5))) )
# 1 2 3 4
#2 0.9401067
#3 0.9095143 0.1162289
#4 0.5618382 0.3884722 0.3476762
#5 0.4275871 0.6968296 0.6220650 0.3368478

SliceExtract_dist(d, 1)
#[1] 0.0000000 0.9401067 0.9095143 0.5618382 0.4275871

SliceExtract_dist(d, 2)
#[1] 0.9401067 0.0000000 0.1162289 0.3884722 0.6968296

SliceExtract_dist(d, 3)
#[1] 0.9095143 0.1162289 0.0000000 0.3476762 0.6220650

SliceExtract_dist(d, 4)
#[1] 0.5618382 0.3884722 0.3476762 0.0000000 0.3368478

SliceExtract_dist(d, 5)
#[1] 0.4275871 0.6968296 0.6220650 0.3368478 0.0000000

Sanity check

as.matrix(d)
# 1 2 3 4 5
#1 0.0000000 0.9401067 0.9095143 0.5618382 0.4275871
#2 0.9401067 0.0000000 0.1162289 0.3884722 0.6968296
#3 0.9095143 0.1162289 0.0000000 0.3476762 0.6220650
#4 0.5618382 0.3884722 0.3476762 0.0000000 0.3368478
#5 0.4275871 0.6968296 0.6220650 0.3368478 0.0000000

Note: Function to extract diagonals readily exists.

How do I manipulate/access elements of an instance of dist class using core R?

I don't have a straight answer to your question, but if you are using the Euclidian distance, have a look at the rdist function from the fields package. Its implementation (in Fortran) is faster than dist, and the output is of class matrix. At the very least, it shows that some developers have chosen to move away from this dist class, maybe for the exact reason you are mentioning. If you are concerned that using a full matrix for storing a symmetric matrix is an inefficient use of memory, you could convert it to a triangular matrix.

library("fields")
points <- matrix(runif(1000*100), nrow=1000, ncol=100)

system.time(dist1 <- dist(points))
# user system elapsed
# 7.277 0.000 7.338

system.time(dist2 <- rdist(points))
# user system elapsed
# 2.756 0.060 2.851

class(dist2)
# [1] "matrix"
dim(dist2)
# [1] 1000 1000
dist2[1:3, 1:3]
# [,1] [,2] [,3]
# [1,] 0.0000000001 3.9529674733 3.8051198575
# [2,] 3.9529674733 0.0000000001 3.6552146293
# [3,] 3.8051198575 3.6552146293 0.0000000001

How to make a distance matrix from distance measurements from a loop in R?

Why are you using a loop in the first place? I'd use expand.grid and by to do the job:

comb = expand.grid(names(signature), names(signature))   # I fixed this line!
scores = by(comb,list(comb$Var1,comb$Var2), FUN=function(x) score(signature[[x[[1]]]],signature[[x[[2]]]]))
class(scores)="matrix"
scores

How do I go from cell given by dist back to row and column numbers

There might be a tidier way ...

dist.x <- dist(x)
which(as.matrix(dist.x) == max(dist.x) & lower.tri(dist.x), arr.ind=TRUE)
# row col
# 5 5 1

Extract diagonals from a distance matrix in R

One work around is to convert the dist object to matrix and then extract elements where row index is one larger than the column index:

mat = as.matrix(dist(mymatrix))
mat[row(mat) == col(mat) + 1]
# [1] 2.828427 3.000000 2.828427

N'th minimum pair from dist function

You want to split row / col index by entry:

n <- nrow(dt) - 1
j <- rep.int(1:n, n:1) # column number
i <- j + sequence(n:1) # row number
x <- dist(dt)
loc <- data.frame(i, j)
pair <- split(loc, x)

Sometimes it is a good idea to enforce factor levels:

lev <- sort(unique(x))
pair <- split(loc, factor(x, lev))

Misc

My solution above is exhaust, in that even if you want indices for the minimum, it will return a full list. You can do extraction, for example, by pair[3] to get result for the 3rd minimum.

While this is interesting in its own right, it is inefficient if you always want the result for one entry, and discarding the rest. My answer to this question helps you: R - How to get row & column subscripts of matched elements from a distance matrix, where you also learn the basics for lower triangular matrix.

Min Max and Mean values from Distance Matrix in R

I made your data into a usable form with:

patient_data <- data.frame(
PatientId = c(1, 1, 1, 2, 2, 2),
cX = c(5348, 6360, 5398, 5348, 6360, 5398),
cY = c(4902, 4887, 4874, 4902, 4887, 4874)
)
  PatientId   cX   cY
1 1 5348 4902
2 1 6360 4887
3 1 5398 4874
4 2 5348 4902
5 2 6360 4887
6 2 5398 4874

Then you are looking for the dplyr::group_by and dplyr::group_modify functions. You can use dplyr::group_map to check the output from the different steps.

library(magrittr)

patient_data %>%
dplyr::group_by(PatientId) %>%
dplyr::group_modify(~ {
distance_matrix <- .x %>% dist(diag = FALSE, upper = TRUE) %>% as.matrix() # get distance matrix
diag(distance_matrix) <- NA # set diagonal values to NA
data.frame( # get min/max/avg for each row of the distance matrix
cell_id = seq(nrow(distance_matrix)),
min_dist = apply(distance_matrix, MARGIN = 1, FUN = min, na.rm = TRUE),
max_dist = apply(distance_matrix, MARGIN = 1, FUN = max, na.rm = TRUE),
avg_dist = apply(distance_matrix, MARGIN = 1, FUN = mean, na.rm = TRUE)
)
})
# A tibble: 6 × 5
# Groups: PatientId [2]
PatientId cell_id min_dist max_dist avg_dist
<dbl> <int> <dbl> <dbl> <dbl>
1 1 1 57.3 1012. 535.
2 1 2 962. 1012. 987.
3 1 3 57.3 962. 510.
4 2 1 57.3 1012. 535.
5 2 2 962. 1012. 987.
6 2 3 57.3 962. 510.

Memory-efficient method to create dist object from distance matrix

My current solution is to calculate the dist object directly from lat and lon vectors, without generating the intermediate distance matrix at all. On large matrices, this is several hundred times faster than the "conventional" pipeline of geosphere::mdist() followed by stats::as.dist() and uses only as much memory as required to store the final dist object.

The following Rcpp source is based on using the functions from here to calculate haversine distance in c++, together with an adaptation of @Alexis method to iterate through the lower triangle elements in c++.

#include <Rcpp.h>
using namespace Rcpp;

double distanceHaversine(double latf, double lonf, double latt, double lont, double tolerance){
double d;
double dlat = latt - latf;
double dlon = lont - lonf;
d = (sin(dlat * 0.5) * sin(dlat * 0.5)) + (cos(latf) * cos(latt)) * (sin(dlon * 0.5) * sin(dlon * 0.5));
if(d > 1 && d <= tolerance){
d = 1;
}
return 2 * atan2(sqrt(d), sqrt(1 - d)) * 6378137.0;
}

double toRadians(double deg){
return deg * 0.01745329251; // PI / 180;
}

//-----------------------------------------------------------
// [[Rcpp::export]]
NumericVector calc_dist(Rcpp::NumericVector lat,
Rcpp::NumericVector lon,
double tolerance = 10000000000.0) {
std::size_t nlat = lat.size();
std::size_t nlon = lon.size();
if (nlat != nlon) throw std::range_error("lat and lon different lengths");
if (nlat < 2) throw std::range_error("Need at least 2 points");
std::size_t size = nlat * (nlat - 1) / 2;
NumericVector ans(size);
std::size_t k = 0;
double latf;
double latt;
double lonf;
double lont;

for (std::size_t j = 0; j < (nlat-1); j++) {
for (std::size_t i = j + 1; i < nlat; i++) {
latf = toRadians(lat[i]);
lonf = toRadians(lon[i]);
latt = toRadians(lat[j]);
lont = toRadians(lon[j]);
ans[k++] = distanceHaversine(latf, lonf, latt, lont, tolerance);
}
}

return ans;
}

/*** R
as_dist = function(lat, lon, tolerance = 10000000000.0) {
dd = calc_dist(lat, lon, tolerance)
attr(dd, "class") = "dist"
attr(dd, "Size") = length(lat)
attr(dd, "call") = match.call()
attr(dd, "Diag") = FALSE
attr(dd, "Upper") = FALSE
return(dd)
}
*/


Related Topics



Leave a reply



Submit