How to read gz compressed file by pyspark
Spark document clearly specify that you can read gz
file automatically:
All of Spark’s file-based input methods, including textFile, support
running on directories, compressed files, and wildcards as well. For
example, you can use textFile("/my/directory"),
textFile("/my/directory/.txt"), and textFile("/my/directory/.gz").
I'd suggest running the following command, and see the result:
rdd = sc.textFile("data/label.gz")
print rdd.take(10)
Assuming that spark finds the the file data/label.gz
, it will print the 10 rows from the file.
Note, that the default location for a file like data/label.gz
will be in the hdfs folder of the spark-user. Is it there?
How to read .gz compressed file using spark DF or DS?
Reading a compressed csv is done in the same way as reading an uncompressed csv file. For Spark version 2.0+ it can be done as follows using Scala (note the extra option for the tab delimiter):
val df = spark.read.option("sep", "\t").csv("file.csv.gz")
PySpark:
df = spark.read.csv("file.csv.gz", sep='\t')
The only extra consideration to take into account is that the gz file is not splittable, therefore Spark needs to read the whole file using a single core which will slow things down. After the read is done the data can be shuffled to increase parallelism.
How to read a gzip compressed json lines file into PySpark dataframe?
#This command just considered all the 90 Million rows to one row
df = spark.read.option('multiline', 'true').json('file.jl.gz')
#This command below worked fine for me:
df = spark.read.json('file.jl.gz')
Related Topics
Best Practice to Run Multiple Spark Instance At a Time in Same Jvm
Spark Add New Column With Value Form Previous Some Columns
Python Strftime - Date Without Leading 0
Django Model Choice Option as a Multi Select Box
Shifting the Elements of an Array in Python
How to Tell If Tensorflow Is Using Gpu Acceleration from Inside Python Shell
How to Increase the Font Size of the Markdown Table in Jupyter Notebook
Calculate Rgb Value for a Range of Values to Create Heat Map
Python Data Frame How to Find the Local Maximum in a 2D Array
Replace Single Quote to Double Quote Python Pandas Dataframe
Key Error: None of [Int64Index...] Dtype='Int64] Are in the Columns
How to Upgrade the Sqlite Version Used by Python'S Sqlite3 Module on Mac
How to Split by Commas That Are Not Within Parentheses
How to Make Multiple Empty Lists in Python
Pandas Join Dataframes Based on Conditions