Empty output when reading a csv file into Rstudio using SparkR
Pre-built Spark distributions are still built with Scala 2.10, not 2.11. So, if you use such a distribution (which I think you do), you need also a spark-csv
build that is for Scala 2.10, not for Scala 2.11 (as the one you use in your code). The following code should then work fine:
library(rJava)
library(SparkR)
library(nycflights13)
df <- flights[1:4, 1:4]
df
year month day dep_time
1 2013 1 1 517
2 2013 1 1 533
3 2013 1 1 542
4 2013 1 1 544
write.csv(df, file="~/scripts/temp.csv", quote=FALSE, row.names=FALSE)
sc <- sparkR.init(sparkHome= "/usr/local/bin/spark-1.5.1-bin-hadoop2.6/",
master="local",
sparkPackages="com.databricks:spark-csv_2.10:1.2.0") # 2.10 here
sqlContext <- sparkRSQL.init(sc)
df_spark <- read.df(sqlContext, "/home/vagrant/scripts/temp.csv", "com.databricks.spark.csv", header="true")
head(df_spark)
year month day dep_time
1 2013 1 1 517
2 2013 1 1 533
3 2013 1 1 542
4 2013 1 1 544
How to read csv into sparkR ver 1.4?
You have to start sparkR console each time like this:
sparkR --packages com.databricks:spark-csv_2.10:1.0.3
Read a csv file in sparkR where columns have spaces
Following worked for me
df = collect(df)
colnames_df<-colnames(df)
colnames_df<-gsub(" ","_",colnames_df)
colnames(df)<-colnames_df
df <- createDataFrame(sqlContext, df)
printSchema(df)
Here we need to locally collect the data first, which will convert spark data frame to normal R data frame. I am sceptical whether this is a good solution as I don't want to call collect. However I investigated and found that even to use ggplot libraries we need to convert this into a local data frame
importing csv file in rstudio from hdfs using sparkR
You can use the fread
function of the data.table
library to read from HDFS. You'd have to specify the path of the hdfs
executable in your system. For instance, assuming that the path to hdfs is /usr/bin/hdfs
, you can try something like this:
your_table <- fread("/usr/bin/hdfs dfs -text /afs/Accounts.csv")
If your "Accounts.csv" is a directory, you can use a wildcard as well /afs/Accounts.csv/*
You can also specify the column classes. For instance:
your_table <- fread("/usr/bin/hdfs dfs -text /afs/Accounts.csv", fill = TRUE, header = TRUE,
colClasses = c("numeric", "character", ...))
I hope this helps.
How to use write.df store a csv file when using Sparkr and Rstudio?
Spark partitions your data into blocks, so it can distribute those partitions over the nodes in your cluster. When writing the data, it retains this partitioning: it creates a directory and writes each partition to a separate file. This way it can take advantage of distributed file systems better (writing each block in parallel to HDFS/S3), and it doesn't have to collect all the data to a single machine which may not be capable of handling the the amount of data.
The two files with the long names are the 2 partitions of your data and hold the actual CSV data. You can see this by copying them, renaming the copies with a .csv
extension and double clicking them, or with something like head longfilename
.
You can test whether the write was successful by trying to read it back in: give Spark the path to the directory and it will recognize it as a partitioned file, through the metadata and _SUCCESS
files you mentioned.
If you do need all the data in one file, you can do that by using repartition
to reduce the amount of partitions to 1 and then write it:
b <- repartition(a, 1)
write.df(b,"mine/b.csv")
This will result in just one long-named file which is a CSV file with all the data.
(I don't use SparkR so untested; in Scala/PySpark you would prefer to use coalesce
rather than repartition
but I couldn't find an equivalent SparkR function)
Spark 2.0.0: SparkR CSV Import
I have the same problem.
But similary problem with this simple code
createDataFrame(iris)
May be some wrong in installation ?
UPD. YES ! I find solution.
This solution based on this: Apache Spark MLlib with DataFrame API gives java.net.URISyntaxException when createDataFrame() or read().csv(...)
For R just start session by this code:
sparkR.session(sparkConfig = list(spark.sql.warehouse.dir="/file:C:/temp"))
Related Topics
Visual Bug When Changing Robinson Projection's Central Meridian with Ggplot2
Conditionally Remove Leading or Trailing '.' Character in R
Reshape R Data with User Entries in Rows, Collapsing for Each User
Programmatically Create Tab and Plot in Markdown
R - Check If String Contains Dates Within Specific Date Range
Why Does "Hello" > 0 Return True
The Difference Between & and && in R
R Cmd Check Not Looking for Gcc in Rtools Directory
Object 'C_Stri_Join' Not Found - Using Knitr in Rstudio
Count Number of Distinct Values in a Vector
Extract Names of Dataframes Passed with Dots
R Histogram from Frequency Table
Find Closest Points (Lat/Lon) from One Data Set to a Second Data Set
R - Identify Consecutive Sequences
How to Configure R-3.0.1 with --Enable-R-Shlib