Transfer data from database to Spark using sparklyr
Sparklyr >= 0.6.0
You can use spark_read_jdbc
.
Sparklyr < 0.6.0
I hope there is a more elegant solution out there but here is a minimal example using low level API:
Make sure that Spark has access to the required JDBC driver, for example by adding its coordinates to
spark.jars.packages
. For example with PostgreSQL (adjust for current version) you could add:spark.jars.packages org.postgresql:postgresql:9.4.1212
to
SPARK_HOME/conf/spark-defaults.conf
Load data and register as temporary view:
name <- "foo"
spark_session(sc) %>%
invoke("read") %>%
# JDBC URL and table name
invoke("option", "url", "jdbc:postgresql://host/database") %>%
invoke("option", "dbtable", "table") %>%
# Add optional credentials
invoke("option", "user", "scott") %>%
invoke("option", "password", "tiger") %>%
# Driver class, here for PostgreSQL
invoke("option", "driver", "org.postgresql.Driver") %>%
# Read and register as a temporary view
invoke("format", "jdbc") %>%
invoke("load") %>%
# Spark 2.x, registerTempTable in 1.x
invoke("createOrReplaceTempView", name)You can pass multiple
options
at once using anenvironment
:invoke("options", as.environment(list(
user="scott", password="tiger", url="jdbc:..."
)))Load temporary view with
dplyr
:dplyr::tbl(sc, name)
Be sure to read about further JDBC options, with focus on
partitionColumn
,*Bound
andnumPartitions
.For additional details see for example How to use JDBC source to write and read data in (Py)Spark? and How to improve performance for slow Spark jobs using DataFrame and JDBC connection?
How to access a Databricks database with sparklyr
If it helps anyone, here's what I found that seems to work.
- Setting the default data base
- Read table in default data base
library(sparklyr)
library(dplyr)
sc <- spark_connect(method="databricks")
tbl_change_db(sc, "mydb")
foo <- spark_read_table(sc,"sometable")
sparklyr spark_read_table from a specific database
You need to use tbl_change_db function to change current database:
tbl_change_db(sc, "marketing")
data <- spark_read_table(sc, "scv_tbl")
How to store data in a Spark cluster using sparklyr?
Spark is technically an engine that runs on the computer/cluster to execute tasks. It is not a database or file-system. You can save the data when you are done to a file-system and load it up during your next session.
https://en.wikipedia.org/wiki/Apache_Spark
Read SQL table into SparklyR
Dan,
You can try something like this:
install.packages('devtools')
devtools::install_github('imanuelcostigan/RSQLServer')
require(RSQLServer)
require(dplyr)
src <- RSQLServer::src_sqlserver("corsql10.corwin.local", database = "Project_DB")
data <- tbl(src, "Participants")
DBI::dbWriteTable(sc, "spark_Participants", data)
First, define the data source from SQL Server. Second, write it to Spark. tbl
should create a reference to the SQL Server table without loading it into memory. It looks like the RSQLServer package is not well maintained and CRAN took it down because the author didn't fix its bugs... So you will have to trouble shoot it. Here is a good resource: Accessing MSSQL Server with R
use sparklyr with Oracle database connection
Knowing that you asked for an ODBC way, here is an JDBC solution (probably useful for other users and due to the fact, that ODBC is not mentioned in the question title.
You need to have ojdbc7.jar
somewhere (in this case in your working directory, but I recommend to store it central and provide the path here).
Change the required values like spark_home
etc.
If you are running R on your client computer (and not on an edge node in the cluster), you might use Livy to connect to Spark.
library(sparklyr)
library(RJDBC)
##### Spark
config <- spark_config()
### tell config location of oracle jar
config[["sparklyr.jars.default"]] <- "ojdbc7.jar"
### example spark_home
sc <- spark_connect(master = "yarn-client",
spark_home = "/usr/lib/spark",
version = "2.2.0",
config = config)
datspark <- spark_read_jdbc(sc, "table", options = list(
url = "jdbc:oracle:thin:@//<ip>:1521/<schema>",
driver = "oracle.jdbc.OracleDriver",
user = "user",
password = "password",
dbtable = "table"),
memory = FALSE # don't cache the whole (big) table
)
### your R code here
spark_disconnect(sc)
Related Topics
Independently Move 2 Legends Ggplot2 on a Map
How to Scale the Size of Line and Point Separately in Ggplot2
Stacking an Existing Rasterstack Multiple Times
How to Represent Polynomials with Numeric Vectors in R
How to Join Data from 2 Different CSV-Files in R
How to Add "Author" Metadata to a PDF Created from R
Calculate the Derivative of a Data-Function in R
R Find the Distance Between Two Us Zipcode Columns
Pass String as Name of Attached Data Column Name
Scale Back Linear Regression Coefficients in R from Scaled and Centered Data
How to Turn the Filename into a Variable When Reading Multiple CSVS into R
Preview a Saved Png in an R Device Window
Split Column in Data.Table to Multiple Rows
Why Is 'Unlist(Lapply)' Faster Than 'Sapply'
How to Extract Unique Elements from a Data.Frame in R
Convert Map Data to Data Frame Using Fortify {Ggplot2} for Spatial Objects in R