Using Sqldf and Rpostgresql Together

Using sqldf and RPostgreSQL together

Using sqldf with RPostgreSQL

sqldf will automatically work with the test database in PostgreSQL if it sees that RPostgreSQL is loaded. So you can create a test database in PostgreSQL and then use sqldf with that

or, you can specify the name of a different database.

See: sqldf FAQ 12

Using sqldf with RSQLite

If you want to use sqldf with RSQLite rather than with RPostgreSQL you can use sqldf's drv argument to force it use a non-default driver. e.g.

sqldf("select foo from bar...",drv="SQLite")

or, you can set the driver globally using the "sqldf.driver" option. From within R:

options(sqldf.driver = "SQLite")

or, another possibility if you wish to use RSQLite is to detach RPostgreSQL before you use sqldf and load it again afterwards.

See ?sqldf for details.

Using RpostgreSQL with sqldf crashes R

If the only reason not to use H2 is date_trunc then here are some ways around that.

1) Macros Here is a workaround for truncation to the beginning of year/quarter/month/week. These functions act as macros which expand into code accepted by H2. Be sure to prefix sqldf with fn$ and surround each with back quotes to turn on substitution. Note that in each case the argument is a string. Add the verbose=TRUE argument to sqldf to see the generated code.

library(RH2)
library(sqldf)

trunc_year <- function(x) sprintf("DATEADD(day, 1-day_of_year(%s), %s)", x, x)
trunc_qtr <- function(x) sprintf("DATEADD(month, 3*(quarter(%s)-1), %s)", x, 
  trunc_year(x))
trunc_month <- function(x) sprintf("DATEADD(day, 1-day_of_month(%s), %s)", x, x)
trunc_week <- function(x) sprintf("DATEADD(day, -iso_day_of_week(%s), %s)", x, x)

# test
DF <- data.frame(x = as.Date("2021-11-15"))
fn$sqldf("select x, 
                `trunc_year('x')` year, 
                `trunc_qtr('x')` qtr,
                `trunc_month('x')` month,
                `trunc_week('x')` week
          from DF")
##            x       year        qtr      month       week
## 1 2021-11-15 2021-01-01 2021-10-01 2021-11-01 2021-11-14

2) Patch RH2 Another possibility is to patch your installation of RH2 with the newer version of H2 which has date_trunc. To do this below we remove the the h2-1.3.175.jar file in RH2, which contains H2, and replace it with the newer version h2-1.4.200.jar. This should work as long as you have write permission in the java subdirectory of your RH2 installation. Just run this code in R and your RH2 installation will be patched.

u <- "https://h2database.com/h2-2019-10-14.zip"
z <- basename(u)
tmp <- tempdir()  # create temporary dir 
old.dir <- setwd(tmp)  # go to it
download.file(u, z)  # download u
unzip(z, "h2/bin/h2-1.4.200.jar")  # extract jar from zip file

# copy new jar file to RH2 installation and remove old one
new.jar <- file.path(tmp, "h2", "bin", "h2-1.4.200.jar")
old.jar <- dir(system.file("java", package = "RH2"), "h2.*jar$", full.names = TRUE)
if (basename(old.jar) != basename(new.jar)) {
  file.copy(new.jar, dirname(old.jar))
  file.remove(old.jar)
}
setwd(old.dir)  # return to original directory

# test
library(RH2)
library(sqldf)
sqldf("select h2version()")
##   '1.4.200'
## 1   1.4.200
# see more tests in (3) below

3) Rebuild RH2 Another possibility is create a new source version of RH2 with the newer version of H2 that has date_trunc. The first argument is 'year', 'quarter', 'month' or 'week' and the second argument is a date. Thus, if you are willing to rebuild RH2 with this new H2 version then it is available.

(a) Download the source of RH2 from https://cran.r-project.org/package=RH2 and detar it to create a directory tree RH2.

(b) Download this h2 zip https://h2database.com/h2-2019-10-14.zip and extract h2/bin/h2-1.4.200.jar from that zip file and place the jar file in the RH2/inst/java directory of the RH2 source removing the old one, h2-1.3.175.jar

# rebuild RH2
# cd to the RH2 directory containing its DESCRIPTION file
setwd("...whatever.../RH2")

# build RH2 
#  on Windows this needs Rtools40 (not an R package)
#  https://cran.r-project.org/bin/windows/Rtools/
library(devtools)
build()
install(args = "--no-multiarch")

# test
library(RH2)
library(sqldf)
sqldf("select h2version()") # check it's the latest version
##   '1.4.200'
## 1   1.4.200

DF <- data.frame(x = as.Date("2000-11-15"))
sqldf("select date_trunc('month', x) from DF")
##   DATE_TRUNC('month', x)
## 1             2000-11-01

Use sqldf to join exactly on id and on the most recent date in a lagged window

Consider window functions such as RANK() of which likely dplyr::row_number() is adopted (among other SQL semantics like select, group_by, case_when). SQLite (default dialect of sqldf) recently added support of window functions in version 3.25.0 (September 2018 release).

If not available in sqldf (depending on version), use a Postgres backend via RPostgreSQL. See author docs. Possibly too or soon, RMySQL will be another supported backend as MySQL 8 recently added support of window functions.

library(RPostgreSQL)
library(sqldf)

D <- sqldf('WITH cte AS
               (SELECT *,
                       RANK() OVER (PARTITION BY "B".row ORDER BY "B".date DESC) AS rn
                FROM "A"
                LEFT JOIN "B"
                    ON "A".id = "B".id
                   AND ("A".date - "B".date) BETWEEN 3*30 and 3*365
               )

           SELECT * FROM cte
           WHERE rn = 1')

Can sqldf be used to import the data of an already existing table in a database into a data.frame in R?

Thanks to G. Grothendieck for the answer. Indeed it is perfectly possible to select data from already existing tables in the database. My mistake was that I was thinking that the name of the dataframe and the corresponding table must always be the same, whereas if I understand correctly, this is only the case when a data.frame data is mapped to a temporary table in the database. As a result when I tried to select data, I had an error message saying that a table with the same name already existed in my database.

Anyway, just as a test to see whether this works, I did the following in PostgreSQL (postgres user and test database which is owned by postgres)

test=# create table person(fname text, lname text, email text);
CREATE TABLE
test=# insert into person(fname, lname, email) values ('fname-01', 'lname-01', 'fname-01.lname-01@gmail.com'), ('fname-02', 'lname-02', 'fname-02.lname-02@gmail.com'), ('fname-03', 'lname-03', 'fname-03.lname-03@gmail.com');
INSERT 0 3
test=# select * from person;
  fname   |  lname   |            email            
----------+----------+-----------------------------
 fname-01 | lname-01 | fname-01.lname-01@gmail.com
 fname-02 | lname-02 | fname-02.lname-02@gmail.com
 fname-03 | lname-03 | fname-03.lname-03@gmail.com
(3 rows)

test=#

Then I wrote the following in R

options(sqldf.RPostgreSQL.user = "postgres", 
  sqldf.RPostgreSQL.password = "postgres",
  sqldf.RPostgreSQL.dbname = "test",
  sqldf.RPostgreSQL.host = "localhost", 
  sqldf.RPostgreSQL.port = 5432)
###
###
library(tidyverse)
library(RPostgreSQL)
library(sqldf)
###
###
result_df <- sqldf("select * from person")

And indeed we can see that result_df contains the data stored in the table person.

> result_df
     fname    lname                       email
1 fname-01 lname-01 fname-01.lname-01@gmail.com
2 fname-02 lname-02 fname-02.lname-02@gmail.com
3 fname-03 lname-03 fname-03.lname-03@gmail.com
> 
>

case when in sqldf / R

Known as conditional aggregation in SQL (often used for pivoting data), as @Gregor comments simply run an aggregate like MAX() (even MIN() will work) around the CASE statements:

SELECT A,
       MAX(CASE WHEN A = '1' AND B = '1' THEN Col ELSE NULL END) as Test_1, 
       MAX(CASE WHEN A = '1' AND B = '2' THEN Col ELSE NULL END) as Test_2, 
       MAX(CASE WHEN A = '1' AND B = '3' THEN Col ELSE NULL END) as Test_3, 
       MAX(CASE WHEN A = '2' AND B = '1' THEN Col ELSE NULL END) as Test_4, 
       MAX(CASE WHEN A = '2' AND B = '2' THEN Col ELSE NULL END) as Test_5, 
       MAX(CASE WHEN A = '2' AND B = '3' THEN Col ELSE NULL END) as Test_6
FROM DF
GROUP BY A

Using Sqldf and Rpostgresql Together