How to Remove the Double Quote When the Value Is Empty in Spark

How to remove the double quote when the value is empty in Spark?

You have empty string in your data frame, if you want to write them as nulls, you can replace empty string with null, and then set nullValues=None when saving it:

df.replace('', None)              # replace empty string with null
.write.save(
path('out'),
format='csv',
delimiter='|',
header=True,
nullValue=None # write null value as None
)

And it will saved as:

id|first_name|last_name|zip_code
1||Elsner|57315
2|Noelle||
3|James|Moser|48256

How to remove double quotes in csv generated after pyspark conversion

.option('nullValue', None) would work

Spark CSV writer outputs double quotes for empty string

The Spark DataFrameWriter has two parameters for the .csv format option that you can set: nullValue and emptyValue, which you can both set to be null instead of empty strings. See the DataFrameWriter documentation here.

In your specific example you can just add the options to your write statement:

myDataset
.withColumn("map_str", mapToStringUDF(col("map")))
.drop("map")
.write
.option("emptyValue", null)
.option("nullValue", null)
.option("header", "false")
.option("delimiter", "\t")
.csv("output.csv")

Or here's a full example, including test data:

import org.apache.spark.sql.Row
import org.apache.spark.sql.types._

val data = Seq(
Row(null, "20200506", "Hello"),
Row(2, "20200607", null),
Row(3, null, "World")
)

val schema = List(
StructField("Item", IntegerType, true),
StructField("Date", StringType, true),
StructField("Message", StringType, true)
)

val testDF = spark.createDataFrame(
spark.sparkContext.parallelize(data),
StructType(schema)
)

testDF.write
.option("emptyValue", null)
.option("nullValue", null)
.option("header", "true")
.csv(PATH)

The resulting raw .csv should look like this:

Item,Date,Message
,20151231,Hello
2,20160101,
3,,World

How to ignore double quotes in Spark Dataframe where we read the input data from CSV?

I tried this with Spark over Scala and it removed the quotes from the columns:

df = df.withColumn("ename", regexp_replace(col("ename"), "“", ""))
.withColumn("eloc", regexp_replace(col("eloc"), "“", ""))
.withColumn("ename", regexp_replace(col("ename"), "”", ""))
.withColumn("eloc", regexp_replace(col("eloc"), "”", ""))

There must be something similar in the Python API of Spark too....

How to ignore double quotes when reading CSV file in Spark?

From the documentation for pyspark.sql.DataFrameReader.csv (emphasis mine):

quote – sets a single character used for escaping quoted values where the separator can be part of the value. If None is set, it uses the default value, ". If you would like to turn off quotations, you need to set an empty string.

dfr = spark.read.csv(
path="path/to/some/file.csv",
header="true",
inferSchema="true",
quote=""
)
dfr.show()
#+----+----+----+----+
#|col1|col2|col3|col4|
#+----+----+----+----+
#| "A| B"| "C"| D"|
#+----+----+----+----+

How to replace double quotes with space in Scala

From using regexp_replace and mentioning columns I assume you mean Spark (if so, you should mention it in any future questions). Look at the signature of the two overloads:

def regexp_replace(e: Column, pattern: Column, replacement: Column): Column
def regexp_replace(e: Column, pattern: String, replacement: String): Column

'\"' is a Char, not a String, so you need "\"" instead.

In Scala without Spark you'd use methods like replace{All,FirstSome}In on scala.util.matching.Regex (mentioned mostly for anyone else who finds this question).

Update:

val colString = insertColumns.mkString(",") + s"${month},concat(year(from_unixtime(unix_timestamp((regexp replace(column_name,'\"',"")), 'yyyy-MM-dd-HH.mm.ss.SSSSSS'))),'-',lpad(month(from_unixtime(unix_timestamp((regexp_replace(column_name,'\"',"")), 'yyyy-MM-dd-HH.mm.ss.SSSSSS'))),2,'0'),'-',lpad(day(from_unixtime(unix_timestamp((regexp_replace(column_name,'\"',"")), 'yyyy-MM-dd-HH.mm.ss.SSSSSS'))),2,'0')) AS column_name"

Here the string after + is only

s"${month},concat(year(from_unixtime(unix_timestamp((regexp replace(column_name,'\"',"

and then you have separate string literals

")), 'yyyy-MM-dd-HH.mm.ss.SSSSSS'))),'-',lpad(month(from_unixtime(unix_timestamp((regexp_replace(column_name,'\"',"

etc.

\" inside an s"..." doesn't work as expected, so escaping the quotes with \" won't work; you should use triple-quoted strings.



Related Topics



Leave a reply



Submit