Difference Between === Null and Isnull in Spark Datadrame

Difference between === null and isNull in Spark DataDrame

First and foremost don't use null in your Scala code unless you really have to for compatibility reasons.

Regarding your question it is plain SQL. col("c1") === null is interpreted as c1 = NULL and, because NULL marks undefined values, result is undefined for any value including NULL itself.

spark.sql("SELECT NULL = NULL").show
+-------------+
|(NULL = NULL)|
+-------------+
| null|
+-------------+
spark.sql("SELECT NULL != NULL").show
+-------------------+
|(NOT (NULL = NULL))|
+-------------------+
| null|
+-------------------+
spark.sql("SELECT TRUE != NULL").show
+------------------------------------+
|(NOT (true = CAST(NULL AS BOOLEAN)))|
+------------------------------------+
| null|
+------------------------------------+
spark.sql("SELECT TRUE = NULL").show
+------------------------------+
|(true = CAST(NULL AS BOOLEAN))|
+------------------------------+
| null|
+------------------------------+

The only valid methods to check for NULL are:

  • IS NULL:

    spark.sql("SELECT NULL IS NULL").show
    +--------------+
    |(NULL IS NULL)|
    +--------------+
    | true|
    +--------------+
    spark.sql("SELECT TRUE IS NULL").show
    +--------------+
    |(true IS NULL)|
    +--------------+
    | false|
    +--------------+
  • IS NOT NULL:

    spark.sql("SELECT NULL IS NOT NULL").show
    +------------------+
    |(NULL IS NOT NULL)|
    +------------------+
    | false|
    +------------------+
    spark.sql("SELECT TRUE IS NOT NULL").show
    +------------------+
    |(true IS NOT NULL)|
    +------------------+
    | true|
    +------------------+

implemented in DataFrame DSL as Column.isNull and Column.isNotNull respectively.

Note:

For NULL-safe comparisons use IS DISTINCT / IS NOT DISTINCT:

spark.sql("SELECT NULL IS NOT DISTINCT FROM NULL").show
+---------------+
|(NULL <=> NULL)|
+---------------+
| true|
+---------------+
spark.sql("SELECT NULL IS NOT DISTINCT FROM TRUE").show
+--------------------------------+
|(CAST(NULL AS BOOLEAN) <=> true)|
+--------------------------------+
| false|
+--------------------------------+

or not(_ <=> _) / <=>

spark.sql("SELECT NULL AS col1, NULL AS col2").select($"col1" <=> $"col2").show
+---------------+
|(col1 <=> col2)|
+---------------+
| true|
+---------------+
spark.sql("SELECT NULL AS col1, TRUE AS col2").select($"col1" <=> $"col2").show
+---------------+
|(col1 <=> col2)|
+---------------+
| false|
+---------------+

in SQL and DataFrame DSL respectively.

Related:

Including null values in an Apache Spark Join

Differences between null and NaN in spark? How to deal with it?

null values represents "no value" or "nothing", it's not even an empty string or zero. It can be used to represent that nothing useful exists.

NaN stands for "Not a Number", it's usually the result of a mathematical operation that doesn't make sense, e.g. 0.0/0.0.

One possible way to handle null values is to remove them with:

df.na.drop()

Or you can change them to an actual value (here I used 0) with:

df.na.fill(0)

Another way would be to select the rows where a specific column is null for further processing:

df.where(col("a").isNull())
df.where(col("a").isNotNull())

Rows with NaN can also be selected using the equivalent method:

from pyspark.sql.functions import isnan
df.where(isnan(col("a")))

Difference between na().drop() and filter(col.isNotNull) (Apache Spark)

With df.na.drop() you drop the rows containing any null or NaN values.

With df.filter(df.col("onlyColumnInOneColumnDataFrame").isNotNull()) you drop those rows which have null only in the column onlyColumnInOneColumnDataFrame.

If you would want to achieve the same thing, that would be df.na.drop(["onlyColumnInOneColumnDataFrame"]).

Spark dataframe not able to Compare Null values

use <=> instead of ===

val df3 = df1.join(df2, df1("id") === df2("o_id"))
.withColumn("result", when(df1("name") <=> df2("o_name") &&
df1("age") <=> df2("o_age") &&
df1("lastname") <=> df2("o_lastname") &&
df1("city") <=> df2("o_city") &&
df1("country") <=> df2("o_country"), "Varified")
.otherwise("Not Varified")).show()
spark.sql("SELECT NULL AS city1, NULL AS city2").select($"city1" <=> $"city2").show

results

+-----------------+
|(city1 <=> city2)|
+-----------------+
| true |
+-----------------+

Spark Scala : Check if string isn't null or empty

Does this work you to use <=>

    val employees = spark.createDataFrame(Seq(("E1","100.0"), ("E2","200.0"),("E3",null),("E4",""))).toDF("employee","salary")
employees.show()
employees.filter(notNullAndEmpty($"salary")).show()

def notNullAndEmpty(c:Column): Column ={
(c.isNotNull and !(c <=> lit("")))
}

Data -

+--------+------+
|employee|salary|
+--------+------+
| E1| 100.0|
| E2| 200.0|
| E3| null|
| E4| |
+--------+------+

Result

+--------+------+
|employee|salary|
+--------+------+
| E1| 100.0|
| E2| 200.0|
+--------+------+

how to filter out a null value from spark dataframe

Let's say you have this data setup (so that results are reproducible):

// declaring data types
case class Company(cName: String, cId: String, details: String)
case class Employee(name: String, id: String, email: String, company: Company)

// setting up example data
val e1 = Employee("n1", null, "n1@c1.com", Company("c1", "1", "d1"))
val e2 = Employee("n2", "2", "n2@c1.com", Company("c1", "1", "d1"))
val e3 = Employee("n3", "3", "n3@c1.com", Company("c1", "1", "d1"))
val e4 = Employee("n4", "4", "n4@c2.com", Company("c2", "2", "d2"))
val e5 = Employee("n5", null, "n5@c2.com", Company("c2", "2", "d2"))
val e6 = Employee("n6", "6", "n6@c2.com", Company("c2", "2", "d2"))
val e7 = Employee("n7", "7", "n7@c3.com", Company("c3", "3", "d3"))
val e8 = Employee("n8", "8", "n8@c3.com", Company("c3", "3", "d3"))
val employees = Seq(e1, e2, e3, e4, e5, e6, e7, e8)
val df = sc.parallelize(employees).toDF

Data is:

+----+----+---------+---------+
|name| id| email| company|
+----+----+---------+---------+
| n1|null|n1@c1.com|[c1,1,d1]|
| n2| 2|n2@c1.com|[c1,1,d1]|
| n3| 3|n3@c1.com|[c1,1,d1]|
| n4| 4|n4@c2.com|[c2,2,d2]|
| n5|null|n5@c2.com|[c2,2,d2]|
| n6| 6|n6@c2.com|[c2,2,d2]|
| n7| 7|n7@c3.com|[c3,3,d3]|
| n8| 8|n8@c3.com|[c3,3,d3]|
+----+----+---------+---------+

Now to filter employees with null ids, you will do --

df.filter("id is null").show

which will correctly show you following:

+----+----+---------+---------+
|name| id| email| company|
+----+----+---------+---------+
| n1|null|n1@c1.com|[c1,1,d1]|
| n5|null|n5@c2.com|[c2,2,d2]|
+----+----+---------+---------+

Coming to the second part of your question, you can replace the null ids with 0 and other values with 1 with this --

df.withColumn("id", when($"id".isNull, 0).otherwise(1)).show

This results in:

+----+---+---------+---------+
|name| id| email| company|
+----+---+---------+---------+
| n1| 0|n1@c1.com|[c1,1,d1]|
| n2| 1|n2@c1.com|[c1,1,d1]|
| n3| 1|n3@c1.com|[c1,1,d1]|
| n4| 1|n4@c2.com|[c2,2,d2]|
| n5| 0|n5@c2.com|[c2,2,d2]|
| n6| 1|n6@c2.com|[c2,2,d2]|
| n7| 1|n7@c3.com|[c3,3,d3]|
| n8| 1|n8@c3.com|[c3,3,d3]|
+----+---+---------+---------+

Including null values in an Apache Spark Join

Spark provides a special NULL safe equality operator:

numbersDf
.join(lettersDf, numbersDf("numbers") <=> lettersDf("numbers"))
.drop(lettersDf("numbers"))
+-------+-------+
|numbers|letters|
+-------+-------+
| 123| abc|
| 456| def|
| null| zzz|
| | hhh|
+-------+-------+

Be careful not to use it with Spark 1.5 or earlier. Prior to Spark 1.6 it required a Cartesian product (SPARK-11111 - Fast null-safe join).

In Spark 2.3.0 or later you can use Column.eqNullSafe in PySpark:

numbers_df = sc.parallelize([
("123", ), ("456", ), (None, ), ("", )
]).toDF(["numbers"])

letters_df = sc.parallelize([
("123", "abc"), ("456", "def"), (None, "zzz"), ("", "hhh")
]).toDF(["numbers", "letters"])

numbers_df.join(letters_df, numbers_df.numbers.eqNullSafe(letters_df.numbers))
+-------+-------+-------+
|numbers|numbers|letters|
+-------+-------+-------+
| 456| 456| def|
| null| null| zzz|
| | | hhh|
| 123| 123| abc|
+-------+-------+-------+

and %<=>% in SparkR:

numbers_df <- createDataFrame(data.frame(numbers = c("123", "456", NA, "")))
letters_df <- createDataFrame(data.frame(
numbers = c("123", "456", NA, ""),
letters = c("abc", "def", "zzz", "hhh")
))

head(join(numbers_df, letters_df, numbers_df$numbers %<=>% letters_df$numbers))
  numbers numbers letters
1 456 456 def
2 <NA> <NA> zzz
3 hhh
4 123 123 abc

With SQL (Spark 2.2.0+) you can use IS NOT DISTINCT FROM:

SELECT * FROM numbers JOIN letters 
ON numbers.numbers IS NOT DISTINCT FROM letters.numbers

This is can be used with DataFrame API as well:

numbersDf.alias("numbers")
.join(lettersDf.alias("letters"))
.where("numbers.numbers IS NOT DISTINCT FROM letters.numbers")

Spark replacing null with 0 performance comparison

They are not the same but performance should be similar. na.fill uses coalesce but it replaces NaN and NULLs, not only NULLS.

val y = when($"x" === 0, $"x".cast("double")).when($"x" === 1, lit(null)).otherwise(lit("NaN").cast("double"))
val df = spark.range(0, 3).toDF("x").withColumn("y", y)

df.withColumn("y", when($"y".isNull, 0.0).otherwise($"y")).show()
df.na.fill(0.0, Seq("y")).show()


Related Topics



Leave a reply



Submit