Load CSV File with Spark

Spark - load CSV file as DataFrame?

spark-csv is part of core Spark functionality and doesn't require a separate library.
So you could just do for example

df = spark.read.format("csv").option("header", "true").load("csvfile.csv")

In scala,(this works for any format-in delimiter mention "," for csv, "\t" for tsv etc)

val df = sqlContext.read.format("com.databricks.spark.csv")
.option("delimiter", ",")
.load("csvfile.csv")

Load CSV file with PySpark

Are you sure that all the lines have at least 2 columns? Can you try something like, just to check?:

sc.textFile("file.csv") \
.map(lambda line: line.split(",")) \
.filter(lambda line: len(line)>1) \
.map(lambda line: (line[0],line[1])) \
.collect()

Alternatively, you could print the culprit (if any):

sc.textFile("file.csv") \
.map(lambda line: line.split(",")) \
.filter(lambda line: len(line)<=1) \
.collect()

How to properly import CSV files with PySpark

If you can't correct the input file, then you can try to load it as text then split the values to get the desired columns. Here's an example:

input file
1,2,3,4,5,6,7,8,9,10,0,12,121
1,2,3,4,5,6,7,8,9,10,0,12,121
read and parse
from pyspark.sql import functions as F

nb_cols = 5

df = spark.read.text("file.csv")

df = df.withColumn(
"values",
F.split("value", ",")
).select(
*[F.col("values")[i].alias(f"col_{i}") for i in range(nb_cols)],
F.array_join(F.expr(f"slice(values, {nb_cols + 1}, size(values))"), ",").alias(f"col_{nb_cols}")
)

df.show()
#+-----+-----+-----+-----+-----+-------------------+
#|col_0|col_1|col_2|col_3|col_4| col_5|
#+-----+-----+-----+-----+-----+-------------------+
#| 1| 2| 3| 4| 5|6,7,8,9,10,0,12,121|
#| 1| 2| 3| 4| 5|6,7,8,9,10,0,12,121|
#+-----+-----+-----+-----+-----+-------------------+

Read CSV file on Spark

You can try changing the driver memory by creating a spark session variable like below:

from pyspark.sql import SparkSession

spark = SparkSession.builder \
.master('local[*]') \
.config("spark.driver.memory", "4g") \
.appName('read-csv') \
.getOrCreate()


Related Topics



Leave a reply



Submit