Pass SQL functions in dplyr filter function on database
A "dplyr
-only" solution would be this
tbl(my_con, "my_table") %>%
filter(batch_name %like% "batch_A_%") %>%
collect()
Full reprex:
suppressPackageStartupMessages({
library(dplyr)
library(dbplyr)
library(RPostgreSQL)
})
my_con <-
dbConnect(
PostgreSQL(),
user = "my_user",
password = "my_password",
host = "my_host",
dbname = "my_db"
)
my_table <- tribble(
~batch_name, ~value,
"batch_A_1", 1,
"batch_A_2", 2,
"batch_A_2", 3,
"batch_B_1", 8,
"batch_B_2", 9
)
copy_to(my_con, my_table)
tbl(my_con, "my_table") %>%
filter(batch_name %like% "batch_A_%") %>%
collect()
#> # A tibble: 3 x 2
#> batch_name value
#> * <chr> <dbl>
#> 1 batch_A_1 1
#> 2 batch_A_2 2
#> 3 batch_A_2 3
dbDisconnect(my_con)
#> [1] TRUE
This works because any functions that dplyr doesn't know how to
translate will be passed along as is, see?dbplyr::translate\_sql
.
Hat-tip to @PaulRougieux for his recent comment
here
How to pass database query to strings using dplyr filter function
collect()
will return an object of class data.frame
which is a table that can not be converted into a character vector implicitly. Instead of as.character()
, you can do write_csv("query_result.csv")
to save the received table into a file or pull(col1) %>% as.character()
to get a character vector of the column named col1
.
In dplyr how does the sql builder work?
Maybe the thing that is confusing you is that the dplyr functions tbl
and filter
don't actually send any code to the database for execution. When you run
tbl(con, "table1") %>% filter(col1 > 12)
what is returned is a tbl_dbi object that contains a sql query. When you run this line of code interactively in R the returned tbl_dbi object is then passed to the print
function. In order for the tbl_dbi to be printed the query must be executed in the database. You can see this by saving the output to a variable.
q <- tbl(con, "table1") %>% filter(col1 > 12)
class(q)
In the above two lines nothing was sent to the database. The tbl
function returned a tbl_dbi object and filter modified that tbl_dbi object. Finally the result was saved to the variable q
.
When we print q
then the SQL is sent to the database. So the tbl
function does not need to know about any other dplyr functions that are called after it (like filter
in this case). It behaves the same no matter what. It always returns a tbl_dbi object.
Now how dbplyr builds up more complex queries from simpler ones is beyond me.
Here is some code that implements your example.
library(dplyr)
shoppingList <- function(x){
stopifnot(is.character(x))
class(x) <- c("first", "shoppingList", class(x))
x
}
item <- function(x, y){
if("first" %in% class(x)){
out <- paste(x, y)
} else {
out <- paste0(x, " and ", y)
}
class(out) <- c("shoppingList", class(out))
out
}
print.shoppingList <- function(x){
# code that only runs when we print an object of class shoppingList
if("first" %in% class(x)) x <- paste(x, "nothing")
print(paste0("***", x, "***"))
}
shoppingList("I need to get")
#> [1] "***I need to get nothing***"
shoppingList("I need to get") %>% item("apples") %>% item("oranges")
#> [1] "***I need to get apples and oranges***"
But how does print
know to send SQL to the database? My (oversimplified) conceptual answer is that print
is a generic function that will behave differently depending on the class of object passed in. There are actually many print
functions. In the example above I created a special print function for objects of class shoppingList. You could imagine a special print.tbl_dbi
function that knows how to handle tbl_dbi objects by sending the the query they contain to the database they connect to and then printing the result. I think the actual implementation is more complicated but hopefully this provides some intuition.
Non-standard evaluation (NSE) in dplyr's filter_ & pulling data from MySQL
It's not really related to SQL. This example in R does not work either:
df <- data.frame(
v1 = sample(5, 10, replace = TRUE),
v2 = sample(5,10, replace = TRUE)
)
df %>% filter_(~ "v1" == 1)
It does not work because you need to pass to filter_
the expression ~ v1 == 1
— not the expression ~ "v1" == 1
.
To solve the problem, simply use the quoting operator quo
and the dequoting operator !!
library(dplyr)
which_column = quot(v1)
df %>% filter(!!which_column == 1)
How to build a wrapper function for querying database using dbplyr and dplyr, having the query vary
@Waldi hits on the crux of the problem, which is the pipe expects a function not an expression as the rhs. In the specific/choose from a list case, you control the expression building so this is manageable. You can use magrittr
semantics and the dot placeholder to build from kind_of_query
. This in turn can be used to create the complete expression (query
) with rlang::quo
and the !!
operator.
get_data_from_db <- function(kind_of_query) {
con <- DBI::dbConnect(RSQLite::SQLite(), filename = ":memory:")
on.exit(DBI::dbDisconnect(con))
mtcars_db <- dplyr::copy_to(con, mtcars)
if (kind_of_query == "from_hadley_book") {
my_query <-
rlang::expr(
{
filter(., cyl > 2) %>%
select(mpg:hp) %>%
head(10)
}
)
}
if (kind_of_query == "mins_for_mpg_disp_drat") {
my_query <-
rlang::expr(
{summarise(., min_mpg = min(mpg), min_disp = min(disp), min_drat = min(drat))}
)
}
query <- quo(
mtcars_db %>%
!!my_query %>%
collect()
)
eval_tidy(query)
}
This is actually an overly sophisticated approach. If you're writing the expression for the kind_of_query
, you might as well just simplify it by writing a function.
get_data_from_db2 <- function(kind_of_query) {
con <- DBI::dbConnect(RSQLite::SQLite(), filename = ":memory:")
on.exit(DBI::dbDisconnect(con))
mtcars_db <- dplyr::copy_to(con, mtcars)
if (kind_of_query == "from_hadley_book") {
my_fx <- function(x){
x %>%
filter(cyl > 2) %>%
select(mpg:hp) %>%
head(10)
}
}
if (kind_of_query == "mins_for_mpg_disp_drat") {
my_fx <- function(x){
summarise(x, min_mpg = min(mpg), min_disp = min(disp), min_drat = min(drat))
}
}
mtcars_db %>%
my_fx %>%
collect()
}
The problem comes with the general case. In the current proposed interface, you are trying to inject an argument value into a user-defined expression. The !!
operator forces evaluation so when building the new expression, the user expression is inserted within ()
to force its evaluation before anything is passed from the lhs of the pipe. Manipulating the expression then likely requires deparse
as suggested by @Waldi or some low level manipulation of the abstract syntax tree.
The simpler solution, if possible, would be to have your users pass in a function, similar to purrr::map
or lapply
. This would drastically simplify the function implementation
get_data_from_db_general <- function(kind_of_query) {
con <- DBI::dbConnect(RSQLite::SQLite(), filename = ":memory:")
on.exit(DBI::dbDisconnect(con))
mtcars_db <- dplyr::copy_to(con, mtcars)
mtcars_db %>%
kind_of_query %>%
collect()
}
get_data_from_db_general(
kind_of_query = function(x){
x %>%
filter(cyl > 2) %>%
select(mpg:hp) %>%
head(10)
}
)
# A tibble: 10 x 4
mpg cyl disp hp
<dbl> <dbl> <dbl> <dbl>
1 21 6 160 110
2 21 6 160 110
3 22.8 4 108 93
4 21.4 6 258 110
5 18.7 8 360 175
6 18.1 6 225 105
7 14.3 8 360 245
8 24.4 4 147. 62
9 22.8 4 141. 95
10 19.2 6 168. 123
Related Topics
Restoring a Database from .Bak File on Another Machine
What Is Db/Development_Structure.SQL in a Rails Project
Multiple Inner Join from The Same Table
Sql Server: Arithmetic Overflow Error Converting Expression to Data Type Int
Passing a Dataframe List to a Where Clause in a SQL Query Embedded in R
Oracle SQL Group by "Not a Group by Expression" Help
Oracle SQL: Understanding the Behavior of Sys_Guid() When Present in an Inline View
Best Practices for Multithreaded Processing of Database Records
Why Bulk Import Is Faster Than Bunch of Inserts
SQL Datedifference in a Where Clause
SQL Query of Multi-Member File on As400
Can You Have an Inner Join Without the on Keyword
Creating SQL Table Using Dynamic Variable Name
Is Cut() Style Binning Available in Dplyr