Difference between === null and isNull in Spark DataDrame
First and foremost don't use null
in your Scala code unless you really have to for compatibility reasons.
Regarding your question it is plain SQL. col("c1") === null
is interpreted as c1 = NULL
and, because NULL
marks undefined values, result is undefined for any value including NULL
itself.
spark.sql("SELECT NULL = NULL").show
+-------------+
|(NULL = NULL)|
+-------------+
| null|
+-------------+
spark.sql("SELECT NULL != NULL").show
+-------------------+
|(NOT (NULL = NULL))|
+-------------------+
| null|
+-------------------+
spark.sql("SELECT TRUE != NULL").show
+------------------------------------+
|(NOT (true = CAST(NULL AS BOOLEAN)))|
+------------------------------------+
| null|
+------------------------------------+
spark.sql("SELECT TRUE = NULL").show
+------------------------------+
|(true = CAST(NULL AS BOOLEAN))|
+------------------------------+
| null|
+------------------------------+
The only valid methods to check for NULL
are:
IS NULL
:spark.sql("SELECT NULL IS NULL").show
+--------------+
|(NULL IS NULL)|
+--------------+
| true|
+--------------+spark.sql("SELECT TRUE IS NULL").show
+--------------+
|(true IS NULL)|
+--------------+
| false|
+--------------+IS NOT NULL
:spark.sql("SELECT NULL IS NOT NULL").show
+------------------+
|(NULL IS NOT NULL)|
+------------------+
| false|
+------------------+spark.sql("SELECT TRUE IS NOT NULL").show
+------------------+
|(true IS NOT NULL)|
+------------------+
| true|
+------------------+
implemented in DataFrame
DSL as Column.isNull
and Column.isNotNull
respectively.
Note:
For NULL
-safe comparisons use IS DISTINCT
/ IS NOT DISTINCT
:
spark.sql("SELECT NULL IS NOT DISTINCT FROM NULL").show
+---------------+
|(NULL <=> NULL)|
+---------------+
| true|
+---------------+
spark.sql("SELECT NULL IS NOT DISTINCT FROM TRUE").show
+--------------------------------+
|(CAST(NULL AS BOOLEAN) <=> true)|
+--------------------------------+
| false|
+--------------------------------+
or not(_ <=> _)
/ <=>
spark.sql("SELECT NULL AS col1, NULL AS col2").select($"col1" <=> $"col2").show
+---------------+
|(col1 <=> col2)|
+---------------+
| true|
+---------------+
spark.sql("SELECT NULL AS col1, TRUE AS col2").select($"col1" <=> $"col2").show
+---------------+
|(col1 <=> col2)|
+---------------+
| false|
+---------------+
in SQL and DataFrame
DSL respectively.
Related:
Including null values in an Apache Spark Join
Differences between null and NaN in spark? How to deal with it?
null values represents "no value" or "nothing", it's not even an empty string or zero. It can be used to represent that nothing useful exists.
NaN stands for "Not a Number", it's usually the result of a mathematical operation that doesn't make sense, e.g. 0.0/0.0
.
One possible way to handle null values is to remove them with:
df.na.drop()
Or you can change them to an actual value (here I used 0) with:
df.na.fill(0)
Another way would be to select the rows where a specific column is null for further processing:
df.where(col("a").isNull())
df.where(col("a").isNotNull())
Rows with NaN can also be selected using the equivalent method:
from pyspark.sql.functions import isnan
df.where(isnan(col("a")))
Difference between na().drop() and filter(col.isNotNull) (Apache Spark)
With df.na.drop()
you drop the rows containing any null or NaN values.
With df.filter(df.col("onlyColumnInOneColumnDataFrame").isNotNull())
you drop those rows which have null only in the column onlyColumnInOneColumnDataFrame
.
If you would want to achieve the same thing, that would be df.na.drop(["onlyColumnInOneColumnDataFrame"])
.
Spark dataframe not able to Compare Null values
use <=>
instead of ===
val df3 = df1.join(df2, df1("id") === df2("o_id"))
.withColumn("result", when(df1("name") <=> df2("o_name") &&
df1("age") <=> df2("o_age") &&
df1("lastname") <=> df2("o_lastname") &&
df1("city") <=> df2("o_city") &&
df1("country") <=> df2("o_country"), "Varified")
.otherwise("Not Varified")).show()
spark.sql("SELECT NULL AS city1, NULL AS city2").select($"city1" <=> $"city2").show
results
+-----------------+
|(city1 <=> city2)|
+-----------------+
| true |
+-----------------+
Spark Scala : Check if string isn't null or empty
Does this work you to use <=>
val employees = spark.createDataFrame(Seq(("E1","100.0"), ("E2","200.0"),("E3",null),("E4",""))).toDF("employee","salary")
employees.show()
employees.filter(notNullAndEmpty($"salary")).show()
def notNullAndEmpty(c:Column): Column ={
(c.isNotNull and !(c <=> lit("")))
}
Data -
+--------+------+
|employee|salary|
+--------+------+
| E1| 100.0|
| E2| 200.0|
| E3| null|
| E4| |
+--------+------+
Result
+--------+------+
|employee|salary|
+--------+------+
| E1| 100.0|
| E2| 200.0|
+--------+------+
how to filter out a null value from spark dataframe
Let's say you have this data setup (so that results are reproducible):
// declaring data types
case class Company(cName: String, cId: String, details: String)
case class Employee(name: String, id: String, email: String, company: Company)
// setting up example data
val e1 = Employee("n1", null, "n1@c1.com", Company("c1", "1", "d1"))
val e2 = Employee("n2", "2", "n2@c1.com", Company("c1", "1", "d1"))
val e3 = Employee("n3", "3", "n3@c1.com", Company("c1", "1", "d1"))
val e4 = Employee("n4", "4", "n4@c2.com", Company("c2", "2", "d2"))
val e5 = Employee("n5", null, "n5@c2.com", Company("c2", "2", "d2"))
val e6 = Employee("n6", "6", "n6@c2.com", Company("c2", "2", "d2"))
val e7 = Employee("n7", "7", "n7@c3.com", Company("c3", "3", "d3"))
val e8 = Employee("n8", "8", "n8@c3.com", Company("c3", "3", "d3"))
val employees = Seq(e1, e2, e3, e4, e5, e6, e7, e8)
val df = sc.parallelize(employees).toDF
Data is:
+----+----+---------+---------+
|name| id| email| company|
+----+----+---------+---------+
| n1|null|n1@c1.com|[c1,1,d1]|
| n2| 2|n2@c1.com|[c1,1,d1]|
| n3| 3|n3@c1.com|[c1,1,d1]|
| n4| 4|n4@c2.com|[c2,2,d2]|
| n5|null|n5@c2.com|[c2,2,d2]|
| n6| 6|n6@c2.com|[c2,2,d2]|
| n7| 7|n7@c3.com|[c3,3,d3]|
| n8| 8|n8@c3.com|[c3,3,d3]|
+----+----+---------+---------+
Now to filter employees with null
ids, you will do --
df.filter("id is null").show
which will correctly show you following:
+----+----+---------+---------+
|name| id| email| company|
+----+----+---------+---------+
| n1|null|n1@c1.com|[c1,1,d1]|
| n5|null|n5@c2.com|[c2,2,d2]|
+----+----+---------+---------+
Coming to the second part of your question, you can replace the null
ids with 0 and other values with 1 with this --
df.withColumn("id", when($"id".isNull, 0).otherwise(1)).show
This results in:
+----+---+---------+---------+
|name| id| email| company|
+----+---+---------+---------+
| n1| 0|n1@c1.com|[c1,1,d1]|
| n2| 1|n2@c1.com|[c1,1,d1]|
| n3| 1|n3@c1.com|[c1,1,d1]|
| n4| 1|n4@c2.com|[c2,2,d2]|
| n5| 0|n5@c2.com|[c2,2,d2]|
| n6| 1|n6@c2.com|[c2,2,d2]|
| n7| 1|n7@c3.com|[c3,3,d3]|
| n8| 1|n8@c3.com|[c3,3,d3]|
+----+---+---------+---------+
Including null values in an Apache Spark Join
Spark provides a special NULL
safe equality operator:
numbersDf
.join(lettersDf, numbersDf("numbers") <=> lettersDf("numbers"))
.drop(lettersDf("numbers"))
+-------+-------+
|numbers|letters|
+-------+-------+
| 123| abc|
| 456| def|
| null| zzz|
| | hhh|
+-------+-------+
Be careful not to use it with Spark 1.5 or earlier. Prior to Spark 1.6 it required a Cartesian product (SPARK-11111 - Fast null-safe join).
In Spark 2.3.0 or later you can use Column.eqNullSafe
in PySpark:
numbers_df = sc.parallelize([
("123", ), ("456", ), (None, ), ("", )
]).toDF(["numbers"])
letters_df = sc.parallelize([
("123", "abc"), ("456", "def"), (None, "zzz"), ("", "hhh")
]).toDF(["numbers", "letters"])
numbers_df.join(letters_df, numbers_df.numbers.eqNullSafe(letters_df.numbers))
+-------+-------+-------+
|numbers|numbers|letters|
+-------+-------+-------+
| 456| 456| def|
| null| null| zzz|
| | | hhh|
| 123| 123| abc|
+-------+-------+-------+
and %<=>%
in SparkR:
numbers_df <- createDataFrame(data.frame(numbers = c("123", "456", NA, "")))
letters_df <- createDataFrame(data.frame(
numbers = c("123", "456", NA, ""),
letters = c("abc", "def", "zzz", "hhh")
))
head(join(numbers_df, letters_df, numbers_df$numbers %<=>% letters_df$numbers))
numbers numbers letters
1 456 456 def
2 <NA> <NA> zzz
3 hhh
4 123 123 abc
With SQL (Spark 2.2.0+) you can use IS NOT DISTINCT FROM
:
SELECT * FROM numbers JOIN letters
ON numbers.numbers IS NOT DISTINCT FROM letters.numbers
This is can be used with DataFrame
API as well:
numbersDf.alias("numbers")
.join(lettersDf.alias("letters"))
.where("numbers.numbers IS NOT DISTINCT FROM letters.numbers")
Spark replacing null with 0 performance comparison
They are not the same but performance should be similar. na.fill
uses coalesce
but it replaces NaN
and NULLs
, not only NULLS
.
val y = when($"x" === 0, $"x".cast("double")).when($"x" === 1, lit(null)).otherwise(lit("NaN").cast("double"))
val df = spark.range(0, 3).toDF("x").withColumn("y", y)
df.withColumn("y", when($"y".isNull, 0.0).otherwise($"y")).show()
df.na.fill(0.0, Seq("y")).show()
Related Topics
Why Can't I Use an Alias in a Delete Statement
Sqlite Order by Date1530019888000
Ora-00972 Identifier Is Too Long Alias Column Name
How to Do a Case Sensitive Group By
MySQL Full Text Search with Partial Words
Getting "Comma-Separated List Near 'Xx.Yy' Invalid" with Dbms_Utility.Comma_To_Table
Search an Oracle Database for Tables with Specific Column Names
Getting Max Value from Rows and Joining to Another Table
Hamming Distance/Similarity Searches in a Database
Postgresql Tables Exists, But Getting "Relation Does Not Exist" When Querying
How to Find Sum of Multiple Columns in a Table in SQL Server 2005
Order by Items Must Appear in the Select List If Select Distinct Is Specified
How to Pass a Temp Table as a Parameter into a Separate Stored Procedure
How to Search All Columns in a Table
Ssis Best Practice to Load N Tables from Source to Target Server