SPARK SQL - case when then
Before Spark 1.2.0
The supported syntax (which I just tried out on Spark 1.0.2) seems to be
SELECT IF(1=1, 1, 0) FROM table
This recent thread http://apache-spark-user-list.1001560.n3.nabble.com/Supported-SQL-syntax-in-Spark-SQL-td9538.html links to the SQL parser source, which may or may not help depending on your comfort with Scala. At the very least the list of keywords starting (at time of writing) on line 70 should help.
Here's the direct link to the source for convenience: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/SqlParser.scala.
Update for Spark 1.2.0 and beyond
As of Spark 1.2.0, the more traditional syntax is supported, in response to SPARK-3813: search for "CASE WHEN" in the test source. For example:
SELECT CASE WHEN key = 1 THEN 1 ELSE 2 END FROM testData
Update for most recent place to figure out syntax from the SQL Parser
The parser source can now be found here.
Update for more complex examples
In response to a question below, the modern syntax supports complex Boolean conditions.
SELECT
CASE WHEN id = 1 OR id = 2 THEN "OneOrTwo" ELSE "NotOneOrTwo" END AS IdRedux
FROM customer
You can involve multiple columns in the condition.
SELECT
CASE WHEN id = 1 OR state = 'MA'
THEN "OneOrMA"
ELSE "NotOneOrMA" END AS IdRedux
FROM customer
You can also nest CASE WHEN THEN expression.
SELECT
CASE WHEN id = 1
THEN "OneOrMA"
ELSE
CASE WHEN state = 'MA' THEN "OneOrMA" ELSE "NotOneOrMA" END
END AS IdRedux
FROM customer
Spark SQL: cast a variable from a case when
You can put the alias outside of the cast:
SELECT cast(CASE WHEN my_var = 'one'
THEN null
ELSE 0
END as int) as new_var
FROM my_table
case statement in Spark SQL
This is the solution for my problem
Select inputTable.*,
case
when OPP_amount_eur between 0 and 30000 then 1
when OPP_amount_eur between 30000 and 50000 then 2
when OPP_amount_eur between 50000 and 100000 then 3
when OPP_amount_eur between 100000 and 300000 then 4
when OPP_amount_eur between 300000 and 500000 then 5
when OPP_amount_eur between 500000 and 1000000 then 6
else '7'
end as OPP_amount_eur_binned
from inputTable
How to do case when exists... in spark sql
The normal way to do this is to left outer join to a summary of table b:
Select a.id, Case When IsNull(b.id) Then 'N' else 'Y' end as Col_1
From A Left Outer Join
(Select distinct id from tableb) b On A.id=b.id
That way you are not repeatedly executing a lookup query for every id in A.
Addition
Your comment indicated that you are trying to create multiple Y/N columns based on b values. Your example had a Y/N for col1 when there was a 1,2,3 and a Y/N for col2 when there was a 4,5,6.
You can get there easily with one summarization of table b :
Select a.id, Case When IsNull(b.val123) Then 'N' else 'Y' end as Col_1,
Case When IsNull(b.val456) Then 'N' Else 'Y' end as Col_2
From A Left Outer Join
(Select id, max(Case When value in (1,2,3) Then 'Y' End) as val123
max(Case When value in (4,5,6) Then 'Y' End) as val456
From tableb
Group By id) b On A.id=b.id
This still accomplishes that lookup with only one summarization of table b.
How to run case when statement with spark sql?
Try with ==
true.
spark.sql("""select
object,
case when green == true then 'A'
when green == false then 'B'
else 'C' end as case_when_statement
from tmp""").show()
//+------+-------------------+
//|object|case_when_statement|
//+------+-------------------+
//| 1| A|
//| 2| B|
//| 3| A|
//+------+-------------------+
In dataframe Api:
Use ===
true
df.withColumn("case_when_statement", when(col("green") === true,lit("A")).when(col("green")===false,lit("B")).otherwise(lit("c"))).show()
//+------+-----+-------------------+
//|object|green|case_when_statement|
//+------+-----+-------------------+
//| 1| true| A|
//| 2|false| B|
//| 3| true| A|
//+------+-----+-------------------+
Spark: Using null checking in a CASE WHEN expression to protect against type errors
In many situations, the Spark optimiser will execute ALL parts of your case expression, even though some appear to be unreachable.
In the example given in the question, we can show that Spark executes BOTH:
when str_col_r is null or str_col_l is null then -1
AND
else rel_length_py(str_col_l, str_col_r)
even in cases where str_col_r is null
or str_col_l is null
Here is some example code. The dataframe is as follows, where the second row is repeated 100 time.
| str_col_l | str_col_r |
|:------------|:------------|
| a | b |
| a string | null |
| a string | null |
| a string | null |
...96 repeats...
| a string | null |
I have set:
conf.set("spark.sql.shuffle.partitions", "1")
conf.set("spark.default.parallelism", "1")
We run a UDF that sleeps for 1 second whenever it is executed:
%%time
def rel_length(str1, str2):
time.sleep(1)
if str1 is None or str2 is None:
return -0.9
return len(str1)/len(str2)
spark.udf.register("rel_length_py", rel_length, DoubleType())
rows = [{"str_col_l": "a", "str_col_r": "b"}] + [{"str_col_l": "a string", "str_col_r": None}]*100
df = spark.createDataFrame(Row(**x) for x in rows)
df.createOrReplaceTempView("str_comp")
sql = """select
case
when str_col_r is null or str_col_l is null then -1
else rel_length_py(str_col_l, str_col_r)
end
as rel
from str_comp
"""
spark.sql(sql).toPandas().head(2)
CPU times: user 183 ms, sys: 61.8 ms, total: 245 ms
Wall time: 1min 46s
i.e. around 100 seconds.
Here the sleep statement is repositioned in the UDF so it sleeps for 1 second ONLY for the first row.
%%time
def rel_length(str1, str2):
if str1 is None or str2 is None:
return -0.9
time.sleep(1)
return len(str1)/len(str2)
spark.udf.register("rel_length_py", rel_length, DoubleType())
rows = [{"str_col_l": "a", "str_col_r": "b"}] + [{"str_col_l": "a string", "str_col_r": None}]*100
df = spark.createDataFrame(Row(**x) for x in rows)
df.createOrReplaceTempView("str_comp")
sql = """select
case
when str_col_r is null or str_col_l is null then -1
else rel_length_py(str_col_l, str_col_r)
end
as rel
from str_comp
"""
spark.sql(sql).toPandas().head(2)
CPU times: user 14.5 ms, sys: 6.42 ms, total: 20.9 ms
Wall time: 1.36 s
This proves that at least in some instances all parts of the case statement will execute. I don't believe that all parts are guaranteed to execute, because I've seen working examples that would error if all parts were being executed.
Implement SQL/CASE Statement in Pyspark where a column 'contains' a list of string or a column 'like' a list of string
rlike
(other): SQL RLIKE expression (LIKE with Regex). Returns a boolean Column based on a regex match. Parameters: other – an extended regex expression
df = spark.createDataFrame(
[
('A00',),
('B00',),
('C00',),
('K00',),
('M00',),
('000',),
],["col1"])
df.withColumn('col2', when( col('col1').rlike("[a-zA-Z]"), regexp_extract(col('col1'), "[a-zA-Z]", 0) ).otherwise(lit(None))).show(10)
# +----+----+
# |col1|col2|
# +----+----+
# | A00| A|
# | B00| B|
# | C00| C|
# | K00| K|
# | M00| M|
# | 000|null|
# +----+----+
Related Topics
Why Is Null Not Equal to Null False
Conditional Sum in Group by Query Mssql
Preventing Adjacent/Overlapping Entries with Exclude in Postgresql
Transpose Rows into Columns in Bigquery (Pivot Implementation)
Change Postgresql Columns Used in Views
Does Sparksql Support Subquery
Calendar Table - Week Number of Month
How to Set Variable from a SQL Query
What's the Difference Between Rank() and Dense_Rank() Functions in Oracle
Why Do People Hate SQL Cursors So Much
Partition Function Count() Over Possible Using Distinct
What Is a Self Join For? (In English)
How to See Active SQL Server Connections
How to Version Your Database Schema
Extract Date (Yyyy/Mm/Dd) from a Timestamp in Postgresql
How Can a Left Outer Join Return More Records Than Exist in the Left Table
How to Have Nhibernate Only Generate the SQL Without Executing It