Dplyr Filter() with SQL-Like %Wildcard%

What is the R equivalent of SQL's LIKE '%searched_word%'?

Given

schools <- data.frame(rank = 1:20, 
name = rep(c("X Public School", "Y Private School"), 10))

try this:

subset(schools, grepl("Public School", name))

or this:

schools[ grep("Public School", schools$name), ]

or this:

library(sqldf)
sqldf("SELECT * FROM schools WHERE name LIKE '%Public School%'")

or this:

library(data.table)
data.table(schools)[ grep("Public School", name) ]

or this:

library(dplyr)
schools %>% filter(grepl("Public School", name))

extracting part of a data frame with a wild card

I would use grepl, like in the following example:

df <- data.frame(fruits = c("apple", "banana", "cherry"))
df %>%
filter(grepl("app", fruits))

grepl uses regular expressions, and you can use them to check for patterns in your strings.

Pass SQL functions in dplyr filter function on database



A "dplyr-only" solution would be this

tbl(my_con, "my_table") %>% 
filter(batch_name %like% "batch_A_%") %>%
collect()

Full reprex:

suppressPackageStartupMessages({
library(dplyr)
library(dbplyr)
library(RPostgreSQL)
})

my_con <-
dbConnect(
PostgreSQL(),
user = "my_user",
password = "my_password",
host = "my_host",
dbname = "my_db"
)

my_table <- tribble(
~batch_name, ~value,
"batch_A_1", 1,
"batch_A_2", 2,
"batch_A_2", 3,
"batch_B_1", 8,
"batch_B_2", 9
)

copy_to(my_con, my_table)

tbl(my_con, "my_table") %>%
filter(batch_name %like% "batch_A_%") %>%
collect()
#> # A tibble: 3 x 2
#> batch_name value
#> * <chr> <dbl>
#> 1 batch_A_1 1
#> 2 batch_A_2 2
#> 3 batch_A_2 3

dbDisconnect(my_con)
#> [1] TRUE

This works because any functions that dplyr doesn't know how to
translate will be passed along as is, see
?dbplyr::translate\_sql.

Hat-tip to @PaulRougieux for his recent comment
here

Sparklyr Spark SQL Filter based on multiple wildcards

I can't use multiple rlike() statements separated with | (OR) because the real example includes about 200 values in f_params

That sounds like a rather artificial constraint, but if your really want to avoid a single regular expression you can always compose an explicit disjunction:

library(rlang)

sc_df %>%
filter(!!rlang::parse_quo(glue::glue_collapse(glue::glue(
"(names %rlike% '{f_params}')"),
" %or% " # or " | "
), rlang::caller_env()))
# Source: spark<?> [?? x 2]
names place
<chr> <chr>
1 Brandon Pasadena
2 Brandi West Hollywood
3 Eric South Bay
4 Erin South Bay

If f_params are guaranteed to be valid regexp literals it should be much faster to simply concatenate the string:

sc_df %>% 
filter(names %rlike% glue::glue_collapse(glue::glue("{f_params}"), "|"))
# Source: spark<?> [?? x 2]
names place
<chr> <chr>
1 Brandon Pasadena
2 Brandi West Hollywood
3 Eric South Bay
4 Erin South Bay

If not, you can try to apply Hmisc::escapeRegexp first:

sc_df %>% 
filter(
names %rlike% glue::glue_collapse(glue::glue(
"{Hmisc::escapeRegex(f_params)}"
), "|")
)

but please keep in mind that Spark uses Java regular expression, so it might not cover some edge cases.

Pattern matching using a wildcard

If you want to examine elements inside a dataframe you should not be using ls() which only looks at the names of objects in the current workspace (or if used inside a function in the current environment). Rownames or elements inside such objects are not visible to ls() (unless of course you add an environment argument to the ls(.)-call). Try using grep() which is the workhorse function for pattern matching of character vectors:

result <- a[ grep("blue", a$x) , ]  # Note need to use `a$` to get at the `x`

If you want to use subset then consider the closely related function grepl() which returns a vector of logicals can be used in the subset argument:

subset(a, grepl("blue", a$x))
x
2 blue1
3 blue2

Edit: Adding one "proper" use of glob2rx within subset():

result <- subset(a,  grepl(glob2rx("blue*") , x) )
result
x
2 blue1
3 blue2

I don't think I actually understood glob2rx until I came back to this question. (I did understand the scoping issues that were ar the root of the questioner's difficulties. Anybody reading this should now scroll down to Gavin's answer and upvote it.)

Subset numeric data using wildcards?

You can use grepl() (and other regular expression matching functions) with the regular expression 01$. The $ signifies that we want the matching to start at the end of the string.

myData[grepl("01$", myData$ID), ]
# ID someData
# 1 202001 10
# 4 203001 40

@thelatemail has one dplyr method in the comments, also using grepl().

filter(myData, grepl("01$", ID))

And speaking of ways to skin a cat

filter(myData, substr(ID, 5, 7) == "01")
# ID someData
# 1 202001 10
# 2 203001 40

r-dplyr equivalent of sql query returning monthly utilisation of contracts

You get lubridate errors with working with certain date time formats. It works if you remove as.Date and %m+%.

df %>% 
filter(start< "2016-03-01" &
start + months(duration) >="2016-03-01")


Related Topics



Leave a reply



Submit