What is the R equivalent of SQL's LIKE '%searched_word%'?
Given
schools <- data.frame(rank = 1:20,
name = rep(c("X Public School", "Y Private School"), 10))
try this:
subset(schools, grepl("Public School", name))
or this:
schools[ grep("Public School", schools$name), ]
or this:
library(sqldf)
sqldf("SELECT * FROM schools WHERE name LIKE '%Public School%'")
or this:
library(data.table)
data.table(schools)[ grep("Public School", name) ]
or this:
library(dplyr)
schools %>% filter(grepl("Public School", name))
extracting part of a data frame with a wild card
I would use grepl, like in the following example:
df <- data.frame(fruits = c("apple", "banana", "cherry"))
df %>%
filter(grepl("app", fruits))
grepl uses regular expressions, and you can use them to check for patterns in your strings.
Pass SQL functions in dplyr filter function on database
A "dplyr
-only" solution would be this
tbl(my_con, "my_table") %>%
filter(batch_name %like% "batch_A_%") %>%
collect()
Full reprex:
suppressPackageStartupMessages({
library(dplyr)
library(dbplyr)
library(RPostgreSQL)
})
my_con <-
dbConnect(
PostgreSQL(),
user = "my_user",
password = "my_password",
host = "my_host",
dbname = "my_db"
)
my_table <- tribble(
~batch_name, ~value,
"batch_A_1", 1,
"batch_A_2", 2,
"batch_A_2", 3,
"batch_B_1", 8,
"batch_B_2", 9
)
copy_to(my_con, my_table)
tbl(my_con, "my_table") %>%
filter(batch_name %like% "batch_A_%") %>%
collect()
#> # A tibble: 3 x 2
#> batch_name value
#> * <chr> <dbl>
#> 1 batch_A_1 1
#> 2 batch_A_2 2
#> 3 batch_A_2 3
dbDisconnect(my_con)
#> [1] TRUE
This works because any functions that dplyr doesn't know how to
translate will be passed along as is, see?dbplyr::translate\_sql
.
Hat-tip to @PaulRougieux for his recent comment
here
Sparklyr Spark SQL Filter based on multiple wildcards
I can't use multiple rlike() statements separated with | (OR) because the real example includes about 200 values in f_params
That sounds like a rather artificial constraint, but if your really want to avoid a single regular expression you can always compose an explicit disjunction:
library(rlang)
sc_df %>%
filter(!!rlang::parse_quo(glue::glue_collapse(glue::glue(
"(names %rlike% '{f_params}')"),
" %or% " # or " | "
), rlang::caller_env()))
# Source: spark<?> [?? x 2]
names place
<chr> <chr>
1 Brandon Pasadena
2 Brandi West Hollywood
3 Eric South Bay
4 Erin South Bay
If f_params
are guaranteed to be valid regexp literals it should be much faster to simply concatenate the string:
sc_df %>%
filter(names %rlike% glue::glue_collapse(glue::glue("{f_params}"), "|"))
# Source: spark<?> [?? x 2]
names place
<chr> <chr>
1 Brandon Pasadena
2 Brandi West Hollywood
3 Eric South Bay
4 Erin South Bay
If not, you can try to apply Hmisc::escapeRegexp
first:
sc_df %>%
filter(
names %rlike% glue::glue_collapse(glue::glue(
"{Hmisc::escapeRegex(f_params)}"
), "|")
)
but please keep in mind that Spark uses Java regular expression, so it might not cover some edge cases.
Pattern matching using a wildcard
If you want to examine elements inside a dataframe you should not be using ls()
which only looks at the names of objects in the current workspace (or if used inside a function in the current environment). Rownames or elements inside such objects are not visible to ls()
(unless of course you add an environment argument to the ls(.)
-call). Try using grep()
which is the workhorse function for pattern matching of character vectors:
result <- a[ grep("blue", a$x) , ] # Note need to use `a$` to get at the `x`
If you want to use subset then consider the closely related function grepl()
which returns a vector of logicals can be used in the subset argument:
subset(a, grepl("blue", a$x))
x
2 blue1
3 blue2
Edit: Adding one "proper" use of glob2rx within subset():
result <- subset(a, grepl(glob2rx("blue*") , x) )
result
x
2 blue1
3 blue2
I don't think I actually understood glob2rx
until I came back to this question. (I did understand the scoping issues that were ar the root of the questioner's difficulties. Anybody reading this should now scroll down to Gavin's answer and upvote it.)
Subset numeric data using wildcards?
You can use grepl()
(and other regular expression matching functions) with the regular expression 01$
. The $
signifies that we want the matching to start at the end of the string.
myData[grepl("01$", myData$ID), ]
# ID someData
# 1 202001 10
# 4 203001 40
@thelatemail has one dplyr method in the comments, also using grepl()
.
filter(myData, grepl("01$", ID))
And speaking of ways to skin a cat
filter(myData, substr(ID, 5, 7) == "01")
# ID someData
# 1 202001 10
# 2 203001 40
r-dplyr equivalent of sql query returning monthly utilisation of contracts
You get lubridate
errors with working with certain date time formats. It works if you remove as.Date
and %m+%
.
df %>%
filter(start< "2016-03-01" &
start + months(duration) >="2016-03-01")
Related Topics
Access Data.Table Columns with Strings
R How to Extract First Row of Each Matrix Within a List
Function Commenting Conventions in R
Chain Arithmetic Operators in Dplyr with %>% Pipe
How to Show Corpus Text in R Tm Package
R: How to Make a Barplot with Labels Parallel (Horizontal) to Bars
Tm: Read in Data Frame, Keep Text Id'S, Construct Dtm and Join to Other Dataset
How to Skip Error Checking at Rmarkdown Compiling
Names' Attribute Must Be the Same Length as the Vector
Reshape Long Structured Data.Table into a Wide Structure Using Data.Table Functionality
Ellipse Containing Percentage of Given Points in R
How to Use Superscript with Ggplot2
Arrange_() Multiple Columns with Descending Order
How to Check If a Vector Contains N Consecutive Numbers
Taking a Disproportionate Sample from a Dataset in R