How to Load CSV File into Sparkr on Rstudio

Loading csv-files in sparkR

Liste is a local list which can be written with write.csv, data is a SparkR DataFrame which can't be written with write.csv: it only writes its pointer, not the DataFrame. That's why it only is 33 kb

using sparklyr in RStudio, can I upload a LOCAL csv file to a spark cluster

You cannot. File has to be reachable from each machine in your cluster either as a local copy or placed on distributed files system / object storage.

How to read csv into sparkR ver 1.4?

You have to start sparkR console each time like this:

sparkR --packages com.databricks:spark-csv_2.10:1.0.3

Empty output when reading a csv file into Rstudio using SparkR

Pre-built Spark distributions are still built with Scala 2.10, not 2.11. So, if you use such a distribution (which I think you do), you need also a spark-csv build that is for Scala 2.10, not for Scala 2.11 (as the one you use in your code). The following code should then work fine:

 library(rJava)
library(SparkR)
library(nycflights13)

df <- flights[1:4, 1:4]
df
year month day dep_time
1 2013 1 1 517
2 2013 1 1 533
3 2013 1 1 542
4 2013 1 1 544

write.csv(df, file="~/scripts/temp.csv", quote=FALSE, row.names=FALSE)

sc <- sparkR.init(sparkHome= "/usr/local/bin/spark-1.5.1-bin-hadoop2.6/",
master="local",
sparkPackages="com.databricks:spark-csv_2.10:1.2.0") # 2.10 here
sqlContext <- sparkRSQL.init(sc)
df_spark <- read.df(sqlContext, "/home/vagrant/scripts/temp.csv", "com.databricks.spark.csv", header="true")
head(df_spark)
year month day dep_time
1 2013 1 1 517
2 2013 1 1 533
3 2013 1 1 542
4 2013 1 1 544

importing csv file in rstudio from hdfs using sparkR

You can use the fread function of the data.table library to read from HDFS. You'd have to specify the path of the hdfs executable in your system. For instance, assuming that the path to hdfs is /usr/bin/hdfs, you can try something like this:

your_table <- fread("/usr/bin/hdfs dfs -text /afs/Accounts.csv")

If your "Accounts.csv" is a directory, you can use a wildcard as well /afs/Accounts.csv/* You can also specify the column classes. For instance:

your_table <- fread("/usr/bin/hdfs dfs -text /afs/Accounts.csv", fill = TRUE, header = TRUE, 
colClasses = c("numeric", "character", ...))

I hope this helps.

Spark 2.0.0: SparkR CSV Import

I have the same problem.
But similary problem with this simple code

createDataFrame(iris)

May be some wrong in installation ?

UPD. YES ! I find solution.

This solution based on this: Apache Spark MLlib with DataFrame API gives java.net.URISyntaxException when createDataFrame() or read().csv(...)

For R just start session by this code:

sparkR.session(sparkConfig = list(spark.sql.warehouse.dir="/file:C:/temp"))

Loading com.databricks.spark.csv via RStudio

This is the right syntax (after hours of trying):
(Note - You've to focus on the first line. Notice to double-quotes)

Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "com.databricks:spark-csv_2.10:1.0.3" "sparkr-shell"')

library(SparkR)
library(magrittr)

# Initialize SparkContext and SQLContext
sc <- sparkR.init(appName="SparkR-Flights-example")
sqlContext <- sparkRSQL.init(sc)

# The SparkSQL context should already be created for you as sqlContext
sqlContext
# Java ref type org.apache.spark.sql.SQLContext id 1

# Load the flights CSV file using `read.df`. Note that we use the CSV reader Spark package here.
flights <- read.df(sqlContext, "nycflights13.csv", "com.databricks.spark.csv", header="true")


Related Topics



Leave a reply



Submit