How to remove the double quote when the value is empty in Spark?
You have empty string in your data frame, if you want to write them as nulls, you can replace empty string with null, and then set nullValues=None
when saving it:
df.replace('', None) # replace empty string with null
.write.save(
path('out'),
format='csv',
delimiter='|',
header=True,
nullValue=None # write null value as None
)
And it will saved as:
id|first_name|last_name|zip_code
1||Elsner|57315
2|Noelle||
3|James|Moser|48256
How to remove double quotes in csv generated after pyspark conversion
.option('nullValue', None)
would work
Spark CSV writer outputs double quotes for empty string
The Spark DataFrameWriter has two parameters for the .csv
format option that you can set: nullValue
and emptyValue
, which you can both set to be null
instead of empty strings. See the DataFrameWriter documentation here.
In your specific example you can just add the options to your write
statement:
myDataset
.withColumn("map_str", mapToStringUDF(col("map")))
.drop("map")
.write
.option("emptyValue", null)
.option("nullValue", null)
.option("header", "false")
.option("delimiter", "\t")
.csv("output.csv")
Or here's a full example, including test data:
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
val data = Seq(
Row(null, "20200506", "Hello"),
Row(2, "20200607", null),
Row(3, null, "World")
)
val schema = List(
StructField("Item", IntegerType, true),
StructField("Date", StringType, true),
StructField("Message", StringType, true)
)
val testDF = spark.createDataFrame(
spark.sparkContext.parallelize(data),
StructType(schema)
)
testDF.write
.option("emptyValue", null)
.option("nullValue", null)
.option("header", "true")
.csv(PATH)
The resulting raw .csv
should look like this:
Item,Date,Message
,20151231,Hello
2,20160101,
3,,World
How to ignore double quotes in Spark Dataframe where we read the input data from CSV?
I tried this with Spark over Scala and it removed the quotes from the columns:
df = df.withColumn("ename", regexp_replace(col("ename"), "“", ""))
.withColumn("eloc", regexp_replace(col("eloc"), "“", ""))
.withColumn("ename", regexp_replace(col("ename"), "”", ""))
.withColumn("eloc", regexp_replace(col("eloc"), "”", ""))
There must be something similar in the Python API of Spark too....
How to ignore double quotes when reading CSV file in Spark?
From the documentation for pyspark.sql.DataFrameReader.csv
(emphasis mine):
quote – sets a single character used for escaping quoted values where the separator can be part of the value. If None is set, it uses the default value, ". If you would like to turn off quotations, you need to set an empty string.
dfr = spark.read.csv(
path="path/to/some/file.csv",
header="true",
inferSchema="true",
quote=""
)
dfr.show()
#+----+----+----+----+
#|col1|col2|col3|col4|
#+----+----+----+----+
#| "A| B"| "C"| D"|
#+----+----+----+----+
How to replace double quotes with space in Scala
From using regexp_replace
and mentioning columns I assume you mean Spark (if so, you should mention it in any future questions). Look at the signature of the two overloads:
def regexp_replace(e: Column, pattern: Column, replacement: Column): Column
def regexp_replace(e: Column, pattern: String, replacement: String): Column
'\"'
is a Char
, not a String
, so you need "\""
instead.
In Scala without Spark you'd use methods like replace{All,FirstSome}In
on scala.util.matching.Regex
(mentioned mostly for anyone else who finds this question).
Update:
val colString = insertColumns.mkString(",") + s"${month},concat(year(from_unixtime(unix_timestamp((regexp replace(column_name,'\"',"")), 'yyyy-MM-dd-HH.mm.ss.SSSSSS'))),'-',lpad(month(from_unixtime(unix_timestamp((regexp_replace(column_name,'\"',"")), 'yyyy-MM-dd-HH.mm.ss.SSSSSS'))),2,'0'),'-',lpad(day(from_unixtime(unix_timestamp((regexp_replace(column_name,'\"',"")), 'yyyy-MM-dd-HH.mm.ss.SSSSSS'))),2,'0')) AS column_name"
Here the string after +
is only
s"${month},concat(year(from_unixtime(unix_timestamp((regexp replace(column_name,'\"',"
and then you have separate string literals
")), 'yyyy-MM-dd-HH.mm.ss.SSSSSS'))),'-',lpad(month(from_unixtime(unix_timestamp((regexp_replace(column_name,'\"',"
etc.
\"
inside an s"..."
doesn't work as expected, so escaping the quotes with \"
won't work; you should use triple-quoted strings.
Related Topics
Get Discord User Id from Username
How to Count the Number of Messages
Best Way to Get the Max Value in a Spark Dataframe Column
How to Allocate Array With Shape and Data Type
How to Make a Roll the Dice Command With My Discord Bot
Spark Equivalent of If Then Else
How to Continue a Loop After Catching Exception in Try ... Except
Python: How to Check If Cell in CSV File Is Empty
Python Selenium - Element Is Not Currently Interactable and May Not Be Manipulated
How to Change Python Version in Anaconda Spyder
How to Run Python Script from Another Machine Without Installing Imported Modules
Get the Row(S) Which Have the Max Value in Groups Using Groupby
Pandas: How to Assign Values Based on Multiple Conditions for Existing Columns
How to Divide Each Column of Pandas Dataframe by a Series
Python Format Size Application (Converting B to Kb, Mb, Gb, Tb)
How to Downgrade Python from 3.7 to 3.5 in Anaconda