How to Select R Data.Table Rows Based on Substring Match (A La SQL Like)

How to select R data.table rows based on substring match (a la SQL like)

data.table has a like function.

Months[like(Name,"mb")]
Name Number
1: September 9
2: November 11
3: December 12

Or, %like% looks nicer :

> Months[Name %like% "mb"]
Name Number
1: September 9
2: November 11
3: December 12

Note that %like% and like() use grepl (returns logical vector) rather than grep (returns integer locations). That's so it can be combined with other logical conditions :

> Months[Number<12 & Name %like% "mb"]
Name Number
1: September 9
2: November 11

and you get the power of regular expression search (not just % or * wildcard), too.

Selecting rows in data.table on the basis of a substring match to any of multiple columns

We can specify the columns to compare in .SDcol, loop through it with lapply, convert it to logical using %like%, check whether there is at least one TRUE per each row using Reduce, use that to subset the elements from 'DetailCol1'.

the_dt[the_dt[, Reduce(`|`, lapply(.SD, `%like%`, "ARP")),
.SDcols= DataCol1:DataCol3], DetailCol1]

R data.table select rows based on partial string match from character vector

I have a solution in mind using lapply and tstrsplit. There's probably more elegant but it does the job

lapply(1:nrow(dt), function(i) {
dt[i,'match' := any(trimws(tstrsplit(as.character(dt[i,'sha']),";")) %in% pselection)]
})

dt[(match)]
title sha match
1: First title 12345 TRUE
2: Second Title 2345; 66543; 33423 TRUE
3: Third Title 22222; 12345678; TRUE

The idea is to split every row of sha column (trim whitespace otherwise row 3 will not match) and check if any sha appears

Subset a data.table by a vector of substrings

We can use grep by pasteing the vector into a single string by collapseing with |.

X[grep(paste(Vec, collapse="|"), H)]

Or we can use the same approach by pasteing the pattern vector collapsed by | (as suggested by @Tensibal)

X[like(H, pattern = paste(Vec, collapse="|"))]

Using grep to subset rows from a data.table, comparing row content

If you're happy using the stringi package, this is a way that takes advantage of the fact that the stringi functions vectorise both pattern and string:

DT[stri_detect_fixed(num, y), x := num])

Depending on the data, it may be faster than the method posted by Veerenda Gadekar.

DT <- data.table(num=paste0(sample(1000), sample(2001:2010, 1000, TRUE)),
y=as.character(sample(2001:2010, 1000, TRUE)))
microbenchmark(
vg = DT[, x := grep(y, num, value=TRUE, fixed=TRUE), by = .(num, y)],
nk = DT[stri_detect_fixed(num, y), x := num]
)

#Unit: microseconds
# expr min lq mean median uq max neval
# vg 6027.674 6176.397 6513.860 6278.689 6370.789 9590.398 100
# nk 975.260 1007.591 1116.594 1047.334 1110.734 3833.051 100


Related Topics



Leave a reply



Submit