How to export a table dataframe in PySpark to csv?
If data frame fits in a driver memory and you want to save to local files system you can convert Spark DataFrame to local Pandas DataFrame using toPandas
method and then simply use to_csv
:
df.toPandas().to_csv('mycsv.csv')
Otherwise you can use spark-csv:
Spark 1.3
df.save('mycsv.csv', 'com.databricks.spark.csv')
Spark 1.4+
df.write.format('com.databricks.spark.csv').save('mycsv.csv')
In Spark 2.0+ you can use csv
data source directly:
df.write.csv('mycsv.csv')
How to export data from Spark SQL to CSV
You can use below statement to write the contents of dataframe in CSV formatdf.write.csv("/data/home/csv")
If you need to write the whole dataframe into a single CSV file, then usedf.coalesce(1).write.csv("/data/home/sample.csv")
For spark 1.x, you can use spark-csv to write the results into CSV files
Below scala snippet would help
import org.apache.spark.sql.hive.HiveContext
// sc - existing spark context
val sqlContext = new HiveContext(sc)
val df = sqlContext.sql("SELECT * FROM testtable")
df.write.format("com.databricks.spark.csv").save("/data/home/csv")
To write the contents into a single file
import org.apache.spark.sql.hive.HiveContext
// sc - existing spark context
val sqlContext = new HiveContext(sc)
val df = sqlContext.sql("SELECT * FROM testtable")
df.coalesce(1).write.format("com.databricks.spark.csv").save("/data/home/sample.csv")
How can I convert a Pyspark dataframe to a CSV without sending it to a file?
Easy way: convert your dataframe to Pandas dataframe with toPandas()
, then save to a string. To save to a string, not a file, you'll have to call to_csv
with path_or_buf=None
. Then send the string in an API call.
From to_csv() documentation:
Parameters
path_or_bufstr or file handle, default None
File path or object, if None is provided the result is returned as a string.
So your code would likely look like this:
csv_string = df.toPandas().to_csv(path_or_bufstr=None)
Alternatives: use tempfile.SpooledTemporaryFile with a large buffer to create an in-memory file. Or you can even use a regular file, just make your buffer large enough and don't flush or close the file. Take a look at Corey Goldberg's explanation of why this works.
Related Topics
How to Install Pypdf2 Module Using Windows
How to Write to an Existing Excel File Without Overwriting Data (Using Pandas)
How to Skip Blank Line While Reading CSV File Using Python
How to Convert .Dat to .Csv Using Python
How to Find the Shortest Word in a List in Python
How to Share Data Between a Parent and Forked Child Process in Python
Python Regex - Finding Phone Number
Background Color for Tk in Python
When to Use Cla(), Clf() or Close() for Clearing a Plot in Matplotlib
Numpy: How to Pick Rows from Two 2D Arrays Based on Conditions in 1D Arrays
How to Make a Discord Bot Leave a Server from a Command in Another Server
Python, Anaconda, Spyder - Uninstalling Python Package Using Pip Does Not Work in Spyder + Ipython
Finding Non-Numeric Rows in Dataframe in Pandas
How to Pad a String With Leading Zeros in Python 3
Conversion of String to Upper Case Without Inbuilt Methods
How to Split an Array According to Conditional Statement