In R, Match Function for Rows or Columns of Matrix

In R, match function for rows or columns of matrix

match will work on lists of atomic vectors. So to match rows of one matrix to another, you could do:

match(data.frame(t(x)), data.frame(t(y)))

t transposes the rows into columns, then data.frame creates a list of the columns in the transposed matrix.

How would I identify which columns and rows match between two data matrices?

You might want to have a look at the %in% operator in R. According to your question, you might want something like this:

m1[,1] %in% m2[,1]
#[1]  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE FALSE

You can then pair it with functions such as mean or sum which will help you to find the percentage as required:

sum(m1[,1] %in% m2[,1])
#[1] 5
mean(m1[,1] %in% m2[,1])
#[1] 0.625

EDIT: As required by the OP in the comments of this post, there are various methods for that, my personal favourite being the which function:

m1[which(m1[,1] %in% m2[,1]),]
#[1] "Taxon1" "Taxon3" "Taxon4" "Taxon6" "Taxon7"
m1[which(!(m1[,1] %in% m2[,1])),]
#[1] "Taxon2" "Taxon5" "Taxon8"

Again, this is only one method, out of many (I can count 3 right now...), so I suggest you to explore the other options...

R::How would I match the rows of one matrix to the rows in another matrix, regardless of the column order?

We can sort by row on each dataset

x1 <- t(apply(X, 1, sort))
y1 <- t(apply(Y, 1, sort))

and then do a match on the pasted rows of each dataset to return the row index of the match

match(do.call(paste, as.data.frame(y1)), do.call(paste, as.data.frame(x1)))
#[1] 1 4 9

Match rows between two matrices

A[apply(A, 1, function(x) all(B[1,] %in% x)),]   
#     [,1] [,2] [,3] [,4]
#[1,]  121  114  117  200
#[2,]  413  121  719  117
#[3,]  117  428  121  211

Match list to rows of matrix in R

Having a few columns and trying to take advantage of columns with > 1 unique values or no non-zero values to reduce computations:

ff = function(a, b)
{
    i = seq_len(nrow(b))  #starting candidate matches
    for(j in seq_len(ncol(a))) {
        aj = a[, j]
        nzaj = aj[aj != 0L]
        if(!length(nzaj)) next  #if all(a[, j] == 0) save some operations
        if(sum(tabulate(nzaj) > 0L) > 1L) return(integer())  #if no unique values in a column break looping 
        i = i[b[i, j] == nzaj[[1L]]]  #update candidate matches
    }

    return(i)
}
lapply(a, function(x) ff(x, b))
#[[1]]
#integer(0)
#
#[[2]]
#[1] 3 4
#
#[[3]]
#[1] 6

With data of your actual size:

set.seed(911)
a2 = replicate(300L, matrix(sample(0:3, 20 * 5, TRUE, c(0.97, 0.01, 0.01, 0.01)), 20, 5), simplify = FALSE)
b2 = matrix(sample(1:3, 15 * 5, TRUE), 15, 5)
identical(OP(a2, b2), lapply(a2, function(x) ff(x, b2)))
#[1] TRUE
microbenchmark::microbenchmark(OP(a2, b2), lapply(a2, function(x) ff(x, b2)), times = 50)
#Unit: milliseconds
#                              expr        min         lq       mean     median         uq       max neval cld
#                        OP(a2, b2) 686.961815 730.840732 760.029859 753.790094 785.310056 863.04577    50   b
# lapply(a2, function(x) ff(x, b2))   8.110542   8.450888   9.381802   8.949924   9.872826  15.51568    50  a

OP is:

OP = function (a, b) 
{
    temp = Map(function(y) t(y), Map(function(a) apply(a, 1, 
        function(x) {
            apply(b, 1, function(y) identical(x[x != 0], y[x != 
                0]))
        }), a))
    lapply(temp, function(x) which(apply(x, 2, prod) == 1))
}

returning matrix column indices matching value(s) in R

res <- arrayInd(match(values, mat), .dim = dim(mat))
res[res[, 1] != seq_len(nrow(res)), 2] <- NA
#      [,1] [,2]
# [1,]    1    2
# [2,]    2    1
# [3,]    3    3
# [4,]    2   NA
# [5,]    5    4
# [6,]    6    1
# [7,]    7   10
# [8,]    3   NA
# [9,]    9    1
#[10,]   10    1

R match rowwise values with column names in multiple columns and get column value

Another option in base R is split-unsplit:

data$New_Col <- unsplit(Map(`[`, 
                            data[paste0("Name_", LETTERS[1:4])],
                            split(seq_len(nrow(data)), data$PartName)),
                        data$PartName)

It scales better than indexing the data frame with a matrix of the form cbind(i, j). The latter approach has significant overhead due to an intermediate coercion of the data frame to matrix, which involves a deep copy of all of the variables.

If you do go with split-unsplit, then make sure that PartName is a factor with suitable levels, as you need the second and third arguments of Map to correspond elementwise. In this case, it would be good practice to start with:

data$PartName <- factor(data$PartName, levels = LETTERS[1:4])

For the curious:

set.seed(1L)
n <- 1e+06L
r <- 25L
x <- as.data.frame(replicate(r, rnorm(n), simplify = FALSE))
names(x) <- paste0("Name_", LETTERS[1:r])
x$PartName <- LETTERS[1:r][sample.int(r, n, TRUE)]

library("data.table")
setDTthreads(4L)
y <- as.data.table(x)

f1 <- function(x) {
    n <- length(x)
    i <- seq_len(nrow(x))
    j <- match(x$PartName, sub("^Name_", "", names(x)[-n]))
    x[-n][cbind(i, j)]
}
f2 <- function(x) {
    nms <- names(x)[-length(x)]
    g <- factor(x$PartName, levels = sub("^Name_", "", nms))
    unsplit(Map(`[`, x[nms], split(seq_len(nrow(x)), g)), g)
}
f3 <- function(x) {
    x[, New_Col := .SD[[paste0("Name_", .BY[[1L]])]], by = PartName]
}

bench::mark(f1(x), f2(x), f3(y), iterations = 100L, check = FALSE, filter_gc = FALSE)
## # A tibble: 3 × 13
##   expression      min  median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory     time       gc      
##   <bch:expr> <bch:tm> <bch:t>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>     <list>     <list>  
## 1 f1(x)        86.1ms  92.3ms      10.9   225.1MB    6.95    100    64      9.21s <NULL> <Rprofmem> <bench_tm> <tibble>
## 2 f2(x)        43.4ms  45.8ms      21.2    61.1MB    3.60    100    17      4.73s <NULL> <Rprofmem> <bench_tm> <tibble>
## 3 f3(y)        77.9ms  79.7ms      12.4    21.1MB    0.247   100     2      8.08s <NULL> <Rprofmem> <bench_tm> <tibble>

Extracting rows and columns of a matrix if row names and column names have a partial match

An easier option is to reshape to 'long' by converting to data.frame from table, and then subset the rows based on the values of 'Var1' and 'Var2'

out <- subset(as.data.frame.table(a), Var1 == sub("\\d+", "", Var2),
     select =c(Var2, Freq))
with(out, setNames(Freq, Var2))
    aaa1       aaa2       aaa3       bbb1       bbb2       bbb3       ccc1       ccc2       ccc3 
0.01495641 1.57504185 2.32762287 0.42652979 0.41329383 0.07119408 0.64530516 1.39629918 0.17042160

Or with row/column indexing

i1 <- match( sub("\\d+", "", colnames(a)), rownames(a))
a[cbind(i1, seq_along(i1))]
[1] 0.01495641 1.57504185 2.32762287 0.42652979 0.41329383 0.07119408 0.64530516 1.39629918 0.17042160

In R, Match Function for Rows or Columns of Matrix