How to Export a Table Dataframe in Pyspark to Csv

How to export a table dataframe in PySpark to csv?

If data frame fits in a driver memory and you want to save to local files system you can convert Spark DataFrame to local Pandas DataFrame using toPandas method and then simply use to_csv:


Otherwise you can use spark-csv:

  • Spark 1.3'mycsv.csv', 'com.databricks.spark.csv')
  • Spark 1.4+


In Spark 2.0+ you can use csv data source directly:


How to export data from Spark SQL to CSV

You can use below statement to write the contents of dataframe in CSV format

If you need to write the whole dataframe into a single CSV file, then use

For spark 1.x, you can use spark-csv to write the results into CSV files

Below scala snippet would help

import org.apache.spark.sql.hive.HiveContext
// sc - existing spark context
val sqlContext = new HiveContext(sc)
val df = sqlContext.sql("SELECT * FROM testtable")

To write the contents into a single file

import org.apache.spark.sql.hive.HiveContext
// sc - existing spark context
val sqlContext = new HiveContext(sc)
val df = sqlContext.sql("SELECT * FROM testtable")

How can I convert a Pyspark dataframe to a CSV without sending it to a file?

Easy way: convert your dataframe to Pandas dataframe with toPandas(), then save to a string. To save to a string, not a file, you'll have to call to_csv with path_or_buf=None. Then send the string in an API call.

From to_csv() documentation:


path_or_bufstr or file handle, default None

File path or object, if None is provided the result is returned as a string.

So your code would likely look like this:

csv_string = df.toPandas().to_csv(path_or_bufstr=None)

Alternatives: use tempfile.SpooledTemporaryFile with a large buffer to create an in-memory file. Or you can even use a regular file, just make your buffer large enough and don't flush or close the file. Take a look at Corey Goldberg's explanation of why this works.

Related Topics

Leave a reply
