How to Read Gz Compressed File by Pyspark

How to read gz compressed file by pyspark

Spark document clearly specify that you can read gz file automatically:

All of Spark’s file-based input methods, including textFile, support
running on directories, compressed files, and wildcards as well. For
example, you can use textFile("/my/directory"),
textFile("/my/directory/.txt"), and textFile("/my/directory/.gz").

I'd suggest running the following command, and see the result:

rdd = sc.textFile("data/label.gz")

print rdd.take(10)

Assuming that spark finds the the file data/label.gz, it will print the 10 rows from the file.

Note, that the default location for a file like data/label.gz will be in the hdfs folder of the spark-user. Is it there?

How to read .gz compressed file using spark DF or DS?

Reading a compressed csv is done in the same way as reading an uncompressed csv file. For Spark version 2.0+ it can be done as follows using Scala (note the extra option for the tab delimiter):

val df = spark.read.option("sep", "\t").csv("file.csv.gz")

PySpark:

df = spark.read.csv("file.csv.gz", sep='\t')

The only extra consideration to take into account is that the gz file is not splittable, therefore Spark needs to read the whole file using a single core which will slow things down. After the read is done the data can be shuffled to increase parallelism.

How to read a gzip compressed json lines file into PySpark dataframe?

#This command just considered all the 90 Million rows to one row

df = spark.read.option('multiline', 'true').json('file.jl.gz')


#This command below worked fine for me:

df = spark.read.json('file.jl.gz')

How to Read Gz Compressed File by Pyspark