Empty Output When Reading a CSV File into Rstudio Using Sparkr

Empty output when reading a csv file into Rstudio using SparkR

Pre-built Spark distributions are still built with Scala 2.10, not 2.11. So, if you use such a distribution (which I think you do), you need also a spark-csv build that is for Scala 2.10, not for Scala 2.11 (as the one you use in your code). The following code should then work fine:

 library(rJava)
library(SparkR)
library(nycflights13)

df <- flights[1:4, 1:4]
df
year month day dep_time
1 2013 1 1 517
2 2013 1 1 533
3 2013 1 1 542
4 2013 1 1 544

write.csv(df, file="~/scripts/temp.csv", quote=FALSE, row.names=FALSE)

sc <- sparkR.init(sparkHome= "/usr/local/bin/spark-1.5.1-bin-hadoop2.6/",
master="local",
sparkPackages="com.databricks:spark-csv_2.10:1.2.0") # 2.10 here
sqlContext <- sparkRSQL.init(sc)
df_spark <- read.df(sqlContext, "/home/vagrant/scripts/temp.csv", "com.databricks.spark.csv", header="true")
head(df_spark)
year month day dep_time
1 2013 1 1 517
2 2013 1 1 533
3 2013 1 1 542
4 2013 1 1 544

How to read csv into sparkR ver 1.4?

You have to start sparkR console each time like this:

sparkR --packages com.databricks:spark-csv_2.10:1.0.3

Read a csv file in sparkR where columns have spaces

Following worked for me

df = collect(df)
colnames_df<-colnames(df)
colnames_df<-gsub(" ","_",colnames_df)
colnames(df)<-colnames_df
df <- createDataFrame(sqlContext, df)
printSchema(df)

Here we need to locally collect the data first, which will convert spark data frame to normal R data frame. I am sceptical whether this is a good solution as I don't want to call collect. However I investigated and found that even to use ggplot libraries we need to convert this into a local data frame

importing csv file in rstudio from hdfs using sparkR

You can use the fread function of the data.table library to read from HDFS. You'd have to specify the path of the hdfs executable in your system. For instance, assuming that the path to hdfs is /usr/bin/hdfs, you can try something like this:

your_table <- fread("/usr/bin/hdfs dfs -text /afs/Accounts.csv")

If your "Accounts.csv" is a directory, you can use a wildcard as well /afs/Accounts.csv/* You can also specify the column classes. For instance:

your_table <- fread("/usr/bin/hdfs dfs -text /afs/Accounts.csv", fill = TRUE, header = TRUE, 
colClasses = c("numeric", "character", ...))

I hope this helps.

How to use write.df store a csv file when using Sparkr and Rstudio?

Spark partitions your data into blocks, so it can distribute those partitions over the nodes in your cluster. When writing the data, it retains this partitioning: it creates a directory and writes each partition to a separate file. This way it can take advantage of distributed file systems better (writing each block in parallel to HDFS/S3), and it doesn't have to collect all the data to a single machine which may not be capable of handling the the amount of data.

The two files with the long names are the 2 partitions of your data and hold the actual CSV data. You can see this by copying them, renaming the copies with a .csv extension and double clicking them, or with something like head longfilename.

You can test whether the write was successful by trying to read it back in: give Spark the path to the directory and it will recognize it as a partitioned file, through the metadata and _SUCCESS files you mentioned.

If you do need all the data in one file, you can do that by using repartition to reduce the amount of partitions to 1 and then write it:

b <- repartition(a, 1)
write.df(b,"mine/b.csv")

This will result in just one long-named file which is a CSV file with all the data.

(I don't use SparkR so untested; in Scala/PySpark you would prefer to use coalesce rather than repartition but I couldn't find an equivalent SparkR function)

Spark 2.0.0: SparkR CSV Import

I have the same problem.
But similary problem with this simple code

createDataFrame(iris)

May be some wrong in installation ?

UPD. YES ! I find solution.

This solution based on this: Apache Spark MLlib with DataFrame API gives java.net.URISyntaxException when createDataFrame() or read().csv(...)

For R just start session by this code:

sparkR.session(sparkConfig = list(spark.sql.warehouse.dir="/file:C:/temp"))


Related Topics



Leave a reply



Submit