Using Grep to Subset Rows from a Data.Table, Comparing Row Content

Using grep to subset rows from a data.table, comparing row content

If you're happy using the stringi package, this is a way that takes advantage of the fact that the stringi functions vectorise both pattern and string:

DT[stri_detect_fixed(num, y), x := num])

Depending on the data, it may be faster than the method posted by Veerenda Gadekar.

DT <- data.table(num=paste0(sample(1000), sample(2001:2010, 1000, TRUE)),
y=as.character(sample(2001:2010, 1000, TRUE)))
microbenchmark(
vg = DT[, x := grep(y, num, value=TRUE, fixed=TRUE), by = .(num, y)],
nk = DT[stri_detect_fixed(num, y), x := num]
)

#Unit: microseconds
# expr min lq mean median uq max neval
# vg 6027.674 6176.397 6513.860 6278.689 6370.789 9590.398 100
# nk 975.260 1007.591 1116.594 1047.334 1110.734 3833.051 100

Using grep to help subset a data frame

It's pretty straightforward using [ to extract:

grep will give you the position in which it matched your search pattern (unless you use value = TRUE).

grep("^G45", My.Data$x)
# [1] 2

Since you're searching within the values of a single column, that actually corresponds to the row index. So, use that with [ (where you would use My.Data[rows, cols] to get specific rows and columns).

My.Data[grep("^G45", My.Data$x), ]
# x y
# 2 G459 2

The help-page for subset shows how you can use grep and grepl with subset if you prefer using this function over [. Here's an example.

subset(My.Data, grepl("^G45", My.Data$x))
# x y
# 2 G459 2

As of R 3.3, there's now also the startsWith function, which you can again use with subset (or with any of the other approaches above). According to the help page for the function, it's considerably faster than using substring or grepl.

subset(My.Data, startsWith(as.character(x), "G45"))
# x y
# 2 G459 2

Select data.table columns with grep-like partial matching

library(data.table)

dt <- data.table(CurrentAssets=rnorm(10),FixedAssets=rnorm(10), CurrentLiabilities=rnorm(10),Capital=rnorm(10))

dt

##    CurrentAssets FixedAssets CurrentLiabilities    Capital
## 1: -1.27610992 -0.2989316 0.20688252 0.6504636
## 2: 0.01065576 1.3088539 1.22533006 0.7550024
## 3: 0.53308022 -1.3459419 -0.99627142 -0.7589336
## 4: 0.30737237 -0.4291044 2.20328357 0.2157515
## 5: -1.37391990 0.8581097 -0.08161687 0.7067757
## 6: 0.28664468 0.2308479 0.38675487 -0.3467660
## 7: -0.22902454 1.3365470 0.10128697 0.3246363
## 8: 0.05159736 -2.0702850 0.78404464 -1.7612696
## 9: 0.51817847 -0.8365225 -0.04778573 0.6170114
##10: 0.50859575 0.5683021 -0.13780167 -0.9243434

Just some random columns. The accounts don't balance.
You can define the columns, then do ...

colnames <- c("CurrentAssets","FixedAssets", "CurrentLiabilities","Capital")
dt[,.SD,.SDcols=grep("Assets",colnames,value =TRUE)]

If you don't want to type colnames and value=TRUE all the time you can build your own function like the following.

mygrep <- function(x){
colnames <- c("CurrentAssets","FixedAssets", "CurrentLiabilities","Capital")
grep(x,colnames,value=TRUE)
}

Now the drawback is of mygrep is that you need to put the column name manually. An improvement would be to pass the data.table to the function.

mygrep <- function(x,dt){
colnames <- colnames(dt)
grep(x,colnames,value=TRUE)
}

dt[,.SD,.SDcols=mygrep("Assets",dt)]

Edit
Just found another way to do the same thing using macro in R. You will need the package gtools to use macros.

We define a macro subdt.

library(gtools)
subdt <- defmacro(dt,pattern,expr={
dt[,.SD,.SDcols=grep(pattern,colnames(dt),value=TRUE)]
})

then do

subdt(dt,"Assets")

macros are powerful as they write the code before evaluation.

Find a row in a data.table that is same as the header

Anti-join:

setkeyv(foo, names(foo)) # Reordes data though
foo[!list(names(foo))]

a b
1: 1 a
2: 1 b
3: 2 a
4: 2 b

Without setting keys:

nfoo <- names(foo)
foo[!setNames(as.list(nfoo), nfoo), on = nfoo]

Looping grepl() through data.table (R)

Data.table is good at grouped operations; I think that's how it can help, assuming you have many rows with the same industry:

DT[ DT[, .I[grep(industry, category)], by = industry]$V1 ]

This uses the current idiom for subsetting by group, thanks to @eddi .


Comments. These might help further:

  • If you have many rows with the same industry-category combo, try by=.(industry,category).

  • Try something else in the place of grep (like the options in Ken and Richard's answers).

Memory and Performance using grepl on large data.table

Use stringi library, it's more performant.

stri_detect_fixed(Dt$title1, Dt$title2) should be what you're looking for.

(thanks to Frank. Frank actually found the exact DT answer:

Dt[, stri_detect_fixed(title1, title2)]

The functions with suffix ..._fixed are faster than the _regex ones.

How to subset multiple columns from df including grep match

Assuming I understood what you would like to do, a possible solution that may not be useful and/or may be redundant:

my_selector <- function(df,partial_name,...){
positional_names <- match(...,names(df))
df[,c(positional_names,grep(partial_name,names(df)))]
}
my_selector(iris, partial_name = "Petal","Species")

A "simpler" option would be to use grep and the like to match the target names at once:

iris[grep("Spec.*|Peta.*", names(iris))]

Or even simpler, as suggested by @akrun , we can simply do:

iris[grep("(Spec|Peta).*", names(iris))]

For more columns, we could do something like:

my_selector(iris, partial_name = "Petal",c("Species","Sepal.Length"))
Species Sepal.Length Petal.Length Petal.Width
1 setosa 5.1 1.4 0.2
2 setosa 4.9 1.4 0.2

Note however that in the above function, the columns are selected counter-intuitively in that the names supplied last are selected first.

Result for the first part(truncated):

         Species Petal.Length Petal.Width
1 setosa 1.4 0.2
2 setosa 1.4 0.2
3 setosa 1.3 0.2
4 setosa 1.5 0.2
5 setosa 1.4 0.2
6 setosa 1.7 0.4
7 setosa 1.4 0.3

Efficient way to subset data.table based on value in any of selected columns

One option is to specify the 'cols' of interest in .SDcols, loop through the Subset of Data.table (.SD), generate a list of logical vectors, Reduce it to single logical vector with (|) and use that to subset the rows

i1 <- dt[, Reduce(`|`, lapply(.SD, `==`, 10)), .SDcols = cols]
test2 <- dt[i1]
identical(test1, test2)
#[1] TRUE

Selecting rows in data.table on the basis of a substring match to any of multiple columns

We can specify the columns to compare in .SDcol, loop through it with lapply, convert it to logical using %like%, check whether there is at least one TRUE per each row using Reduce, use that to subset the elements from 'DetailCol1'.

the_dt[the_dt[, Reduce(`|`, lapply(.SD, `%like%`, "ARP")),
.SDcols= DataCol1:DataCol3], DetailCol1]

r data.table grep error with large file but not with example

We specify the columns of interest in .SDcols, loop through the Subset of Datatable (.SD) with lapply, check for the string "RCP" with grepl to return a list of logical vectors, that is Reduced to a single logical vector with | (or)

i1 <- livestock[, Reduce("|", lapply(.SD, function(x) 
grepl("RCP", x))), .SDcols = c("Abstract", "Author.Keywords")]

If the substring "RCP" needs to be in all the columns specified in .SDcols, then use & instead of | in Reduce

i1 <- livestock[, Reduce("&", lapply(.SD, function(x) 
grepl("RCP", x))), .SDcols = c("Abstract", "Author.Keywords")]

Use the logical vector in i to subset the rows and assign the "RCP" to RCP column

livestock[i1, RCP := "RCP"]


Related Topics



Leave a reply



Submit