In R, Match Function for Rows or Columns of Matrix

In R, match function for rows or columns of matrix

match will work on lists of atomic vectors. So to match rows of one matrix to another, you could do:

match(data.frame(t(x)), data.frame(t(y)))

t transposes the rows into columns, then data.frame creates a list of the columns in the transposed matrix.

How would I identify which columns and rows match between two data matrices?

You might want to have a look at the %in% operator in R. According to your question, you might want something like this:

m1[,1] %in% m2[,1]
#[1] TRUE FALSE TRUE TRUE FALSE TRUE TRUE FALSE

You can then pair it with functions such as mean or sum which will help you to find the percentage as required:

sum(m1[,1] %in% m2[,1])
#[1] 5
mean(m1[,1] %in% m2[,1])
#[1] 0.625

EDIT: As required by the OP in the comments of this post, there are various methods for that, my personal favourite being the which function:

m1[which(m1[,1] %in% m2[,1]),]
#[1] "Taxon1" "Taxon3" "Taxon4" "Taxon6" "Taxon7"
m1[which(!(m1[,1] %in% m2[,1])),]
#[1] "Taxon2" "Taxon5" "Taxon8"

Again, this is only one method, out of many (I can count 3 right now...), so I suggest you to explore the other options...

R::How would I match the rows of one matrix to the rows in another matrix, regardless of the column order?

We can sort by row on each dataset

x1 <- t(apply(X, 1, sort))
y1 <- t(apply(Y, 1, sort))

and then do a match on the pasted rows of each dataset to return the row index of the match

match(do.call(paste, as.data.frame(y1)), do.call(paste, as.data.frame(x1)))
#[1] 1 4 9

Match rows between two matrices

A[apply(A, 1, function(x) all(B[1,] %in% x)),]   
# [,1] [,2] [,3] [,4]
#[1,] 121 114 117 200
#[2,] 413 121 719 117
#[3,] 117 428 121 211

Match list to rows of matrix in R

Having a few columns and trying to take advantage of columns with > 1 unique values or no non-zero values to reduce computations:

ff = function(a, b)
{
i = seq_len(nrow(b)) #starting candidate matches
for(j in seq_len(ncol(a))) {
aj = a[, j]
nzaj = aj[aj != 0L]
if(!length(nzaj)) next #if all(a[, j] == 0) save some operations
if(sum(tabulate(nzaj) > 0L) > 1L) return(integer()) #if no unique values in a column break looping
i = i[b[i, j] == nzaj[[1L]]] #update candidate matches
}

return(i)
}
lapply(a, function(x) ff(x, b))
#[[1]]
#integer(0)
#
#[[2]]
#[1] 3 4
#
#[[3]]
#[1] 6

With data of your actual size:

set.seed(911)
a2 = replicate(300L, matrix(sample(0:3, 20 * 5, TRUE, c(0.97, 0.01, 0.01, 0.01)), 20, 5), simplify = FALSE)
b2 = matrix(sample(1:3, 15 * 5, TRUE), 15, 5)
identical(OP(a2, b2), lapply(a2, function(x) ff(x, b2)))
#[1] TRUE
microbenchmark::microbenchmark(OP(a2, b2), lapply(a2, function(x) ff(x, b2)), times = 50)
#Unit: milliseconds
# expr min lq mean median uq max neval cld
# OP(a2, b2) 686.961815 730.840732 760.029859 753.790094 785.310056 863.04577 50 b
# lapply(a2, function(x) ff(x, b2)) 8.110542 8.450888 9.381802 8.949924 9.872826 15.51568 50 a

OP is:

OP = function (a, b) 
{
temp = Map(function(y) t(y), Map(function(a) apply(a, 1,
function(x) {
apply(b, 1, function(y) identical(x[x != 0], y[x !=
0]))
}), a))
lapply(temp, function(x) which(apply(x, 2, prod) == 1))
}

returning matrix column indices matching value(s) in R

res <- arrayInd(match(values, mat), .dim = dim(mat))
res[res[, 1] != seq_len(nrow(res)), 2] <- NA
# [,1] [,2]
# [1,] 1 2
# [2,] 2 1
# [3,] 3 3
# [4,] 2 NA
# [5,] 5 4
# [6,] 6 1
# [7,] 7 10
# [8,] 3 NA
# [9,] 9 1
#[10,] 10 1

R match rowwise values with column names in multiple columns and get column value

Another option in base R is split-unsplit:

data$New_Col <- unsplit(Map(`[`, 
data[paste0("Name_", LETTERS[1:4])],
split(seq_len(nrow(data)), data$PartName)),
data$PartName)

It scales better than indexing the data frame with a matrix of the form cbind(i, j). The latter approach has significant overhead due to an intermediate coercion of the data frame to matrix, which involves a deep copy of all of the variables.

If you do go with split-unsplit, then make sure that PartName is a factor with suitable levels, as you need the second and third arguments of Map to correspond elementwise. In this case, it would be good practice to start with:

data$PartName <- factor(data$PartName, levels = LETTERS[1:4])

For the curious:

set.seed(1L)
n <- 1e+06L
r <- 25L
x <- as.data.frame(replicate(r, rnorm(n), simplify = FALSE))
names(x) <- paste0("Name_", LETTERS[1:r])
x$PartName <- LETTERS[1:r][sample.int(r, n, TRUE)]

library("data.table")
setDTthreads(4L)
y <- as.data.table(x)

f1 <- function(x) {
n <- length(x)
i <- seq_len(nrow(x))
j <- match(x$PartName, sub("^Name_", "", names(x)[-n]))
x[-n][cbind(i, j)]
}
f2 <- function(x) {
nms <- names(x)[-length(x)]
g <- factor(x$PartName, levels = sub("^Name_", "", nms))
unsplit(Map(`[`, x[nms], split(seq_len(nrow(x)), g)), g)
}
f3 <- function(x) {
x[, New_Col := .SD[[paste0("Name_", .BY[[1L]])]], by = PartName]
}

bench::mark(f1(x), f2(x), f3(y), iterations = 100L, check = FALSE, filter_gc = FALSE)
## # A tibble: 3 × 13
## expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
## <bch:expr> <bch:tm> <bch:t> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
## 1 f1(x) 86.1ms 92.3ms 10.9 225.1MB 6.95 100 64 9.21s <NULL> <Rprofmem> <bench_tm> <tibble>
## 2 f2(x) 43.4ms 45.8ms 21.2 61.1MB 3.60 100 17 4.73s <NULL> <Rprofmem> <bench_tm> <tibble>
## 3 f3(y) 77.9ms 79.7ms 12.4 21.1MB 0.247 100 2 8.08s <NULL> <Rprofmem> <bench_tm> <tibble>

Extracting rows and columns of a matrix if row names and column names have a partial match

An easier option is to reshape to 'long' by converting to data.frame from table, and then subset the rows based on the values of 'Var1' and 'Var2'

out <- subset(as.data.frame.table(a), Var1 == sub("\\d+", "", Var2),
select =c(Var2, Freq))
with(out, setNames(Freq, Var2))
aaa1 aaa2 aaa3 bbb1 bbb2 bbb3 ccc1 ccc2 ccc3
0.01495641 1.57504185 2.32762287 0.42652979 0.41329383 0.07119408 0.64530516 1.39629918 0.17042160

Or with row/column indexing

i1 <- match( sub("\\d+", "", colnames(a)), rownames(a))
a[cbind(i1, seq_along(i1))]
[1] 0.01495641 1.57504185 2.32762287 0.42652979 0.41329383 0.07119408 0.64530516 1.39629918 0.17042160


Related Topics



Leave a reply



Submit