Using grep to subset rows from a data.table, comparing row content
If you're happy using the stringi
package, this is a way that takes advantage of the fact that the stringi
functions vectorise both pattern and string:
DT[stri_detect_fixed(num, y), x := num])
Depending on the data, it may be faster than the method posted by Veerenda Gadekar.
DT <- data.table(num=paste0(sample(1000), sample(2001:2010, 1000, TRUE)),
y=as.character(sample(2001:2010, 1000, TRUE)))
microbenchmark(
vg = DT[, x := grep(y, num, value=TRUE, fixed=TRUE), by = .(num, y)],
nk = DT[stri_detect_fixed(num, y), x := num]
)
#Unit: microseconds
# expr min lq mean median uq max neval
# vg 6027.674 6176.397 6513.860 6278.689 6370.789 9590.398 100
# nk 975.260 1007.591 1116.594 1047.334 1110.734 3833.051 100
Using grep to help subset a data frame
It's pretty straightforward using [
to extract:
grep
will give you the position in which it matched your search pattern (unless you use value = TRUE
).
grep("^G45", My.Data$x)
# [1] 2
Since you're searching within the values of a single column, that actually corresponds to the row index. So, use that with [
(where you would use My.Data[rows, cols]
to get specific rows and columns).
My.Data[grep("^G45", My.Data$x), ]
# x y
# 2 G459 2
The help-page for subset
shows how you can use grep
and grepl
with subset
if you prefer using this function over [
. Here's an example.
subset(My.Data, grepl("^G45", My.Data$x))
# x y
# 2 G459 2
As of R 3.3, there's now also the startsWith
function, which you can again use with subset
(or with any of the other approaches above). According to the help page for the function, it's considerably faster than using substring
or grepl
.
subset(My.Data, startsWith(as.character(x), "G45"))
# x y
# 2 G459 2
Select data.table columns with grep-like partial matching
library(data.table)
dt <- data.table(CurrentAssets=rnorm(10),FixedAssets=rnorm(10), CurrentLiabilities=rnorm(10),Capital=rnorm(10))
dt
## CurrentAssets FixedAssets CurrentLiabilities Capital
## 1: -1.27610992 -0.2989316 0.20688252 0.6504636
## 2: 0.01065576 1.3088539 1.22533006 0.7550024
## 3: 0.53308022 -1.3459419 -0.99627142 -0.7589336
## 4: 0.30737237 -0.4291044 2.20328357 0.2157515
## 5: -1.37391990 0.8581097 -0.08161687 0.7067757
## 6: 0.28664468 0.2308479 0.38675487 -0.3467660
## 7: -0.22902454 1.3365470 0.10128697 0.3246363
## 8: 0.05159736 -2.0702850 0.78404464 -1.7612696
## 9: 0.51817847 -0.8365225 -0.04778573 0.6170114
##10: 0.50859575 0.5683021 -0.13780167 -0.9243434
Just some random columns. The accounts don't balance.
You can define the columns, then do ...
colnames <- c("CurrentAssets","FixedAssets", "CurrentLiabilities","Capital")
dt[,.SD,.SDcols=grep("Assets",colnames,value =TRUE)]
If you don't want to type colnames
and value=TRUE
all the time you can build your own function like the following.
mygrep <- function(x){
colnames <- c("CurrentAssets","FixedAssets", "CurrentLiabilities","Capital")
grep(x,colnames,value=TRUE)
}
Now the drawback is of mygrep
is that you need to put the column name manually. An improvement would be to pass the data.table to the function.
mygrep <- function(x,dt){
colnames <- colnames(dt)
grep(x,colnames,value=TRUE)
}
dt[,.SD,.SDcols=mygrep("Assets",dt)]
Edit
Just found another way to do the same thing using macro in R. You will need the package gtools to use macros.
We define a macro subdt
.
library(gtools)
subdt <- defmacro(dt,pattern,expr={
dt[,.SD,.SDcols=grep(pattern,colnames(dt),value=TRUE)]
})
then do
subdt(dt,"Assets")
macros are powerful as they write the code before evaluation.
Find a row in a data.table that is same as the header
Anti-join:
setkeyv(foo, names(foo)) # Reordes data though
foo[!list(names(foo))]
a b
1: 1 a
2: 1 b
3: 2 a
4: 2 b
Without setting keys:
nfoo <- names(foo)
foo[!setNames(as.list(nfoo), nfoo), on = nfoo]
Looping grepl() through data.table (R)
Data.table is good at grouped operations; I think that's how it can help, assuming you have many rows with the same industry:
DT[ DT[, .I[grep(industry, category)], by = industry]$V1 ]
This uses the current idiom for subsetting by group, thanks to @eddi .
Comments. These might help further:
If you have many rows with the same industry-category combo, try
by=.(industry,category)
.Try something else in the place of
grep
(like the options in Ken and Richard's answers).
Memory and Performance using grepl on large data.table
Use stringi
library, it's more performant.
stri_detect_fixed(Dt$title1, Dt$title2)
should be what you're looking for.
(thanks to Frank. Frank actually found the exact DT answer:
Dt[, stri_detect_fixed(title1, title2)]
The functions with suffix ..._fixed
are faster than the _regex
ones.
How to subset multiple columns from df including grep match
Assuming I understood what you would like to do, a possible solution that may not be useful and/or may be redundant:
my_selector <- function(df,partial_name,...){
positional_names <- match(...,names(df))
df[,c(positional_names,grep(partial_name,names(df)))]
}
my_selector(iris, partial_name = "Petal","Species")
A "simpler" option would be to use grep
and the like to match the target names at once:
iris[grep("Spec.*|Peta.*", names(iris))]
Or even simpler, as suggested by @akrun , we can simply do:
iris[grep("(Spec|Peta).*", names(iris))]
For more columns, we could do something like:
my_selector(iris, partial_name = "Petal",c("Species","Sepal.Length"))
Species Sepal.Length Petal.Length Petal.Width
1 setosa 5.1 1.4 0.2
2 setosa 4.9 1.4 0.2
Note however that in the above function, the columns are selected counter-intuitively in that the names supplied last are selected first.
Result for the first part(truncated):
Species Petal.Length Petal.Width
1 setosa 1.4 0.2
2 setosa 1.4 0.2
3 setosa 1.3 0.2
4 setosa 1.5 0.2
5 setosa 1.4 0.2
6 setosa 1.7 0.4
7 setosa 1.4 0.3
Efficient way to subset data.table based on value in any of selected columns
One option is to specify the 'cols' of interest in .SDcols
, loop through the Subset of Data.table (.SD
), generate a list
of logical vectors, Reduce
it to single logical vector with (|
) and use that to subset the rows
i1 <- dt[, Reduce(`|`, lapply(.SD, `==`, 10)), .SDcols = cols]
test2 <- dt[i1]
identical(test1, test2)
#[1] TRUE
Selecting rows in data.table on the basis of a substring match to any of multiple columns
We can specify the columns to compare in .SDcol
, loop through it with lapply
, convert it to logical using %like%
, check whether there is at least one TRUE per each row using Reduce
, use that to subset the elements from 'DetailCol1'.
the_dt[the_dt[, Reduce(`|`, lapply(.SD, `%like%`, "ARP")),
.SDcols= DataCol1:DataCol3], DetailCol1]
r data.table grep error with large file but not with example
We specify the columns of interest in .SDcols
, loop through the Subset of Datatable (.SD
) with lapply
, check for the string "RCP" with grepl
to return a list
of logical vectors, that is Reduce
d to a single logical vector
with |
(or
)
i1 <- livestock[, Reduce("|", lapply(.SD, function(x)
grepl("RCP", x))), .SDcols = c("Abstract", "Author.Keywords")]
If the substring "RCP" needs to be in all the columns specified in .SDcols
, then use &
instead of |
in Reduce
i1 <- livestock[, Reduce("&", lapply(.SD, function(x)
grepl("RCP", x))), .SDcols = c("Abstract", "Author.Keywords")]
Use the logical vector in i
to subset the rows and assign the "RCP" to RCP
column
livestock[i1, RCP := "RCP"]
Related Topics
Number of Rows Each Data Frame in a List
Data.Table := Assignments When Variable Has Same Name as a Column
Convert Factor to Date Class for Multiple Columns
Delete Rows with Less Than 7 Characters
How to Sum Data.Frame Column Values
R: Interpolation of Nas by Group
Upload and View a PDF in R Shiny
R Lpsolve Binary Find All Possible Solutions
Convert Vector to Matrix Without Recycling
Meaning of Tilde and Dot Notation in Dplyr
Outputting Difftime as Hh:Mm:Ss:Mm in R
Aggregate by Multiple Columns and Reshape from Long to Wide
Robust and Clustered Standard Error in R for Probit and Logit Regression