In R, match function for rows or columns of matrix
match
will work on list
s of atomic vectors. So to match rows of one matrix to another, you could do:
match(data.frame(t(x)), data.frame(t(y)))
t
transposes the rows into columns, then data.frame
creates a list
of the columns in the transposed matrix.
How would I identify which columns and rows match between two data matrices?
You might want to have a look at the %in%
operator in R. According to your question, you might want something like this:
m1[,1] %in% m2[,1]
#[1] TRUE FALSE TRUE TRUE FALSE TRUE TRUE FALSE
You can then pair it with functions such as mean
or sum
which will help you to find the percentage as required:
sum(m1[,1] %in% m2[,1])
#[1] 5
mean(m1[,1] %in% m2[,1])
#[1] 0.625
EDIT: As required by the OP in the comments of this post, there are various methods for that, my personal favourite being the which
function:
m1[which(m1[,1] %in% m2[,1]),]
#[1] "Taxon1" "Taxon3" "Taxon4" "Taxon6" "Taxon7"
m1[which(!(m1[,1] %in% m2[,1])),]
#[1] "Taxon2" "Taxon5" "Taxon8"
Again, this is only one method, out of many (I can count 3 right now...), so I suggest you to explore the other options...
R::How would I match the rows of one matrix to the rows in another matrix, regardless of the column order?
We can sort
by row on each dataset
x1 <- t(apply(X, 1, sort))
y1 <- t(apply(Y, 1, sort))
and then do a match
on the paste
d rows of each dataset to return the row index of the match
match(do.call(paste, as.data.frame(y1)), do.call(paste, as.data.frame(x1)))
#[1] 1 4 9
Match rows between two matrices
A[apply(A, 1, function(x) all(B[1,] %in% x)),]
# [,1] [,2] [,3] [,4]
#[1,] 121 114 117 200
#[2,] 413 121 719 117
#[3,] 117 428 121 211
Match list to rows of matrix in R
Having a few columns and trying to take advantage of columns with > 1 unique values or no non-zero values to reduce computations:
ff = function(a, b)
{
i = seq_len(nrow(b)) #starting candidate matches
for(j in seq_len(ncol(a))) {
aj = a[, j]
nzaj = aj[aj != 0L]
if(!length(nzaj)) next #if all(a[, j] == 0) save some operations
if(sum(tabulate(nzaj) > 0L) > 1L) return(integer()) #if no unique values in a column break looping
i = i[b[i, j] == nzaj[[1L]]] #update candidate matches
}
return(i)
}
lapply(a, function(x) ff(x, b))
#[[1]]
#integer(0)
#
#[[2]]
#[1] 3 4
#
#[[3]]
#[1] 6
With data of your actual size:
set.seed(911)
a2 = replicate(300L, matrix(sample(0:3, 20 * 5, TRUE, c(0.97, 0.01, 0.01, 0.01)), 20, 5), simplify = FALSE)
b2 = matrix(sample(1:3, 15 * 5, TRUE), 15, 5)
identical(OP(a2, b2), lapply(a2, function(x) ff(x, b2)))
#[1] TRUE
microbenchmark::microbenchmark(OP(a2, b2), lapply(a2, function(x) ff(x, b2)), times = 50)
#Unit: milliseconds
# expr min lq mean median uq max neval cld
# OP(a2, b2) 686.961815 730.840732 760.029859 753.790094 785.310056 863.04577 50 b
# lapply(a2, function(x) ff(x, b2)) 8.110542 8.450888 9.381802 8.949924 9.872826 15.51568 50 a
OP
is:
OP = function (a, b)
{
temp = Map(function(y) t(y), Map(function(a) apply(a, 1,
function(x) {
apply(b, 1, function(y) identical(x[x != 0], y[x !=
0]))
}), a))
lapply(temp, function(x) which(apply(x, 2, prod) == 1))
}
returning matrix column indices matching value(s) in R
res <- arrayInd(match(values, mat), .dim = dim(mat))
res[res[, 1] != seq_len(nrow(res)), 2] <- NA
# [,1] [,2]
# [1,] 1 2
# [2,] 2 1
# [3,] 3 3
# [4,] 2 NA
# [5,] 5 4
# [6,] 6 1
# [7,] 7 10
# [8,] 3 NA
# [9,] 9 1
#[10,] 10 1
R match rowwise values with column names in multiple columns and get column value
Another option in base R is split
-unsplit
:
data$New_Col <- unsplit(Map(`[`,
data[paste0("Name_", LETTERS[1:4])],
split(seq_len(nrow(data)), data$PartName)),
data$PartName)
It scales better than indexing the data frame with a matrix of the form cbind(i, j)
. The latter approach has significant overhead due to an intermediate coercion of the data frame to matrix, which involves a deep copy of all of the variables.
If you do go with split
-unsplit
, then make sure that PartName
is a factor with suitable levels
, as you need the second and third arguments of Map
to correspond elementwise. In this case, it would be good practice to start with:
data$PartName <- factor(data$PartName, levels = LETTERS[1:4])
For the curious:
set.seed(1L)
n <- 1e+06L
r <- 25L
x <- as.data.frame(replicate(r, rnorm(n), simplify = FALSE))
names(x) <- paste0("Name_", LETTERS[1:r])
x$PartName <- LETTERS[1:r][sample.int(r, n, TRUE)]
library("data.table")
setDTthreads(4L)
y <- as.data.table(x)
f1 <- function(x) {
n <- length(x)
i <- seq_len(nrow(x))
j <- match(x$PartName, sub("^Name_", "", names(x)[-n]))
x[-n][cbind(i, j)]
}
f2 <- function(x) {
nms <- names(x)[-length(x)]
g <- factor(x$PartName, levels = sub("^Name_", "", nms))
unsplit(Map(`[`, x[nms], split(seq_len(nrow(x)), g)), g)
}
f3 <- function(x) {
x[, New_Col := .SD[[paste0("Name_", .BY[[1L]])]], by = PartName]
}
bench::mark(f1(x), f2(x), f3(y), iterations = 100L, check = FALSE, filter_gc = FALSE)
## # A tibble: 3 × 13
## expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
## <bch:expr> <bch:tm> <bch:t> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
## 1 f1(x) 86.1ms 92.3ms 10.9 225.1MB 6.95 100 64 9.21s <NULL> <Rprofmem> <bench_tm> <tibble>
## 2 f2(x) 43.4ms 45.8ms 21.2 61.1MB 3.60 100 17 4.73s <NULL> <Rprofmem> <bench_tm> <tibble>
## 3 f3(y) 77.9ms 79.7ms 12.4 21.1MB 0.247 100 2 8.08s <NULL> <Rprofmem> <bench_tm> <tibble>
Extracting rows and columns of a matrix if row names and column names have a partial match
An easier option is to reshape to 'long' by converting to data.frame
from table
, and then subset
the rows based on the values of 'Var1' and 'Var2'
out <- subset(as.data.frame.table(a), Var1 == sub("\\d+", "", Var2),
select =c(Var2, Freq))
with(out, setNames(Freq, Var2))
aaa1 aaa2 aaa3 bbb1 bbb2 bbb3 ccc1 ccc2 ccc3
0.01495641 1.57504185 2.32762287 0.42652979 0.41329383 0.07119408 0.64530516 1.39629918 0.17042160
Or with row/column
indexing
i1 <- match( sub("\\d+", "", colnames(a)), rownames(a))
a[cbind(i1, seq_along(i1))]
[1] 0.01495641 1.57504185 2.32762287 0.42652979 0.41329383 0.07119408 0.64530516 1.39629918 0.17042160
Related Topics
Error in Unserialize(Socklist[[N]]):Error Reading from Connection on Unix
Get the Last Row of a Previous Group in Data.Table
Exporting R Regression Summary for Publishable Paper
How to Save Output from Ggforce::Facet_Grid_Paginate in Only One PDF
How to Insert Pictures into Each Individual Bar in a Ggplot Graph
What/Where Are the Attributes of a Function Object
Combine Voronoi Polygons and Maps
Merging Data Frames with Different Number of Rows and Different Columns
Rotate Labels in a Chorddiagram (R Circlize)
Using 'Fread' to Import CSV File from an Archive into 'R' Without Extracting to Disk
Shiny R - Download the Result of a Table
Find *All* Duplicated Records in Data.Table (Not All-But-One)
Get Name of X When Defining '(<-' Operator
How to Manipulate Null Elements in a Nested List
Ggplot2: Add P-Values to the Plot
R: How to Select Files in Directory Which Satisfy Conditions Both on the Beginning and End of Name
R - How to Add Row Index to a Data Frame, Based on Combination of Factors