Spark SQL - Case When Then

SPARK SQL - case when then

Before Spark 1.2.0

The supported syntax (which I just tried out on Spark 1.0.2) seems to be

SELECT IF(1=1, 1, 0) FROM table

This recent thread http://apache-spark-user-list.1001560.n3.nabble.com/Supported-SQL-syntax-in-Spark-SQL-td9538.html links to the SQL parser source, which may or may not help depending on your comfort with Scala. At the very least the list of keywords starting (at time of writing) on line 70 should help.

Here's the direct link to the source for convenience: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/SqlParser.scala.

Update for Spark 1.2.0 and beyond

As of Spark 1.2.0, the more traditional syntax is supported, in response to SPARK-3813: search for "CASE WHEN" in the test source. For example:

SELECT CASE WHEN key = 1 THEN 1 ELSE 2 END FROM testData

Update for most recent place to figure out syntax from the SQL Parser

The parser source can now be found here.

Update for more complex examples

In response to a question below, the modern syntax supports complex Boolean conditions.

SELECT
CASE WHEN id = 1 OR id = 2 THEN "OneOrTwo" ELSE "NotOneOrTwo" END AS IdRedux
FROM customer

You can involve multiple columns in the condition.

SELECT
CASE WHEN id = 1 OR state = 'MA'
THEN "OneOrMA"
ELSE "NotOneOrMA" END AS IdRedux
FROM customer

You can also nest CASE WHEN THEN expression.

SELECT
CASE WHEN id = 1
THEN "OneOrMA"
ELSE
CASE WHEN state = 'MA' THEN "OneOrMA" ELSE "NotOneOrMA" END
END AS IdRedux
FROM customer

Spark SQL: cast a variable from a case when

You can put the alias outside of the cast:

SELECT cast(CASE WHEN my_var = 'one'
THEN null
ELSE 0
END as int) as new_var
FROM my_table

case statement in Spark SQL

This is the solution for my problem

 Select inputTable.*,

case
when OPP_amount_eur between 0 and 30000 then 1
when OPP_amount_eur between 30000 and 50000 then 2
when OPP_amount_eur between 50000 and 100000 then 3
when OPP_amount_eur between 100000 and 300000 then 4
when OPP_amount_eur between 300000 and 500000 then 5
when OPP_amount_eur between 500000 and 1000000 then 6
else '7'

end as OPP_amount_eur_binned

from inputTable

How to do case when exists... in spark sql

The normal way to do this is to left outer join to a summary of table b:

Select a.id, Case When IsNull(b.id) Then 'N' else 'Y' end as Col_1
From A Left Outer Join
(Select distinct id from tableb) b On A.id=b.id

That way you are not repeatedly executing a lookup query for every id in A.

Addition
Your comment indicated that you are trying to create multiple Y/N columns based on b values. Your example had a Y/N for col1 when there was a 1,2,3 and a Y/N for col2 when there was a 4,5,6.
You can get there easily with one summarization of table b :

Select a.id, Case When IsNull(b.val123) Then 'N' else 'Y' end as Col_1,
Case When IsNull(b.val456) Then 'N' Else 'Y' end as Col_2
From A Left Outer Join
(Select id, max(Case When value in (1,2,3) Then 'Y' End) as val123
max(Case When value in (4,5,6) Then 'Y' End) as val456
From tableb
Group By id) b On A.id=b.id

This still accomplishes that lookup with only one summarization of table b.

How to run case when statement with spark sql?

Try with == true.

spark.sql("""select 
object,
case when green == true then 'A'
when green == false then 'B'
else 'C' end as case_when_statement
from tmp""").show()

//+------+-------------------+
//|object|case_when_statement|
//+------+-------------------+
//| 1| A|
//| 2| B|
//| 3| A|
//+------+-------------------+

In dataframe Api:

Use === true

df.withColumn("case_when_statement", when(col("green") === true,lit("A")).when(col("green")===false,lit("B")).otherwise(lit("c"))).show()
//+------+-----+-------------------+
//|object|green|case_when_statement|
//+------+-----+-------------------+
//| 1| true| A|
//| 2|false| B|
//| 3| true| A|
//+------+-----+-------------------+

Spark: Using null checking in a CASE WHEN expression to protect against type errors

In many situations, the Spark optimiser will execute ALL parts of your case expression, even though some appear to be unreachable.

In the example given in the question, we can show that Spark executes BOTH:

when str_col_r is null or str_col_l is null  then -1

AND

else rel_length_py(str_col_l, str_col_r) 

even in cases where str_col_r is null or str_col_l is null

Here is some example code. The dataframe is as follows, where the second row is repeated 100 time.


| str_col_l | str_col_r |
|:------------|:------------|
| a | b |
| a string | null |
| a string | null |
| a string | null |
...96 repeats...
| a string | null |

I have set:

conf.set("spark.sql.shuffle.partitions", "1")
conf.set("spark.default.parallelism", "1")

We run a UDF that sleeps for 1 second whenever it is executed:

%%time

def rel_length(str1, str2):
time.sleep(1)
if str1 is None or str2 is None:
return -0.9

return len(str1)/len(str2)

spark.udf.register("rel_length_py", rel_length, DoubleType())

rows = [{"str_col_l": "a", "str_col_r": "b"}] + [{"str_col_l": "a string", "str_col_r": None}]*100

df = spark.createDataFrame(Row(**x) for x in rows)
df.createOrReplaceTempView("str_comp")

sql = """select
case
when str_col_r is null or str_col_l is null then -1
else rel_length_py(str_col_l, str_col_r)
end
as rel
from str_comp
"""
spark.sql(sql).toPandas().head(2)
CPU times: user 183 ms, sys: 61.8 ms, total: 245 ms
Wall time: 1min 46s

i.e. around 100 seconds.

Here the sleep statement is repositioned in the UDF so it sleeps for 1 second ONLY for the first row.

%%time

def rel_length(str1, str2):
if str1 is None or str2 is None:
return -0.9
time.sleep(1)
return len(str1)/len(str2)

spark.udf.register("rel_length_py", rel_length, DoubleType())

rows = [{"str_col_l": "a", "str_col_r": "b"}] + [{"str_col_l": "a string", "str_col_r": None}]*100

df = spark.createDataFrame(Row(**x) for x in rows)
df.createOrReplaceTempView("str_comp")

sql = """select
case
when str_col_r is null or str_col_l is null then -1
else rel_length_py(str_col_l, str_col_r)
end
as rel
from str_comp
"""
spark.sql(sql).toPandas().head(2)
CPU times: user 14.5 ms, sys: 6.42 ms, total: 20.9 ms
Wall time: 1.36 s

This proves that at least in some instances all parts of the case statement will execute. I don't believe that all parts are guaranteed to execute, because I've seen working examples that would error if all parts were being executed.

Implement SQL/CASE Statement in Pyspark where a column 'contains' a list of string or a column 'like' a list of string

rlike(other): SQL RLIKE expression (LIKE with Regex). Returns a boolean Column based on a regex match. Parameters: other – an extended regex expression

df = spark.createDataFrame(
[
('A00',),
('B00',),
('C00',),
('K00',),
('M00',),
('000',),
],["col1"])

df.withColumn('col2', when( col('col1').rlike("[a-zA-Z]"), regexp_extract(col('col1'), "[a-zA-Z]", 0) ).otherwise(lit(None))).show(10)

# +----+----+
# |col1|col2|
# +----+----+
# | A00| A|
# | B00| B|
# | C00| C|
# | K00| K|
# | M00| M|
# | 000|null|
# +----+----+


Related Topics



Leave a reply



Submit