Select "Where Clause" Evaluation Order

Select where clause evaluation order

There are no guarantees for evaluation order. The optimizer will try to find the most efficient way to execute the query, using available information.

In your case, since c is indexed and d isn't, the optimizer should look into the index to find all rows that match the predicate on c, then retrieve those rows from the table data to evaluate the predicate on d.

However, if it determines that the index on c isn't very selective (although not in your example, a gender column is rarely usefully indexed), it may decide to do the table scan anyway.

To determine execution order, you should get an explain plan for your query. However, realize that that plan may change depending on what the optimizer thinks is the best query right now.

Does the order of where clauses matter in SQL?

No, that order doesn't matter (or at least: shouldn't matter).

Any decent query optimizer will look at all the parts of the WHERE clause and figure out the most efficient way to satisfy that query.

I know the SQL Server query optimizer will pick a suitable index - no matter which order you have your two conditions in. I assume other RDBMS will have similar strategies.

What does matter is whether or not you have a suitable index for this!

In the case of SQL Server, it will likely use an index if you have:

  • an index on (LastName, FirstName)
  • an index on (FirstName, LastName)
  • an index on just (LastName), or just (FirstName) (or both)

On the other hand - again for SQL Server - if you use SELECT * to grab all columns from a table, and the table is rather small, then there's a good chance the query optimizer will just do a table (or clustered index) scan instead of using an index (because the lookup into the full data page to get all other columns just gets too expensive very quickly).

Sql - Explicit order of WHERE conditions?

Using a derived table:

SELECT *
FROM (
SELECT *
FROM INFORMATION_SCHEMA.TABLES
WHERE ISNUMERIC(table_name)=1
) AS i
WHERE CAST(table_name AS INT)<>0

Alternatively, and most likely run in order, you can use a CASE statement:

SELECT *
FROM INFORMATION_SCHEMA.TABLES
WHERE 0<>(CASE WHEN ISNUMERIC(table_name)=1
THEN CAST(table_name AS INT)
ELSE 0 END)

It should be noted that for SQL Server there exist situations where the CASE-trick will fail. See the documentation on CASE, Remarks:

The CASE statement evaluates its conditions sequentially and stops with the first condition whose condition is satisfied. In some situations, an expression is evaluated before a CASE statement receives the results of the expression as its input. Errors in evaluating these expressions are possible. Aggregate expressions that appear in WHEN arguments to a CASE statement are evaluated first, then provided to the CASE statement. For example, the following query produces a divide by zero error when producing the value of the MAX aggregate. This occurs prior to evaluating the CASE expression.

WITH Data (value) AS 
(
SELECT 0
UNION ALL
SELECT 1
)
SELECT
CASE
WHEN MIN(value) <= 0 THEN 0
WHEN MAX(1/value) >= 100 THEN 1
END
FROM Data ;

I suspect this might also be true for other RDBMS implementations.

Execution order of conditions in SQL 'where' clause

Are you sure you "don't have the authority" to see an execution plan? What about using AUTOTRACE?

SQL> set autotrace on
SQL> select * from emp
2 join dept on dept.deptno = emp.deptno
3 where emp.ename like 'K%'
4 and dept.loc like 'l%'
5 /

no rows selected

Execution Plan
----------------------------------------------------------

----------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)|
----------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 62 | 4 (0)|
| 1 | NESTED LOOPS | | 1 | 62 | 4 (0)|
|* 2 | TABLE ACCESS FULL | EMP | 1 | 42 | 3 (0)|
|* 3 | TABLE ACCESS BY INDEX ROWID| DEPT | 1 | 20 | 1 (0)|
|* 4 | INDEX UNIQUE SCAN | SYS_C0042912 | 1 | | 0 (0)|
----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - filter("EMP"."ENAME" LIKE 'K%' AND "EMP"."DEPTNO" IS NOT NULL)
3 - filter("DEPT"."LOC" LIKE 'l%')
4 - access("DEPT"."DEPTNO"="EMP"."DEPTNO")

As you can see, that gives quite a lot of detail about how the query will be executed. It tells me that:

  • the condition "emp.ename like 'K%'" will be applied first, on the full scan of EMP
  • then the matching DEPT records will be selected via the index on dept.deptno (via the NESTED LOOPS method)
  • finally the filter "dept.loc like 'l%' will be applied.

This order of application has nothing to do with the way the predicates are ordered in the WHERE clause, as we can show with this re-ordered query:

SQL> select * from emp
2 join dept on dept.deptno = emp.deptno
3 where dept.loc like 'l%'
4 and emp.ename like 'K%';

no rows selected

Execution Plan
----------------------------------------------------------

----------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)|
----------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 62 | 4 (0)|
| 1 | NESTED LOOPS | | 1 | 62 | 4 (0)|
|* 2 | TABLE ACCESS FULL | EMP | 1 | 42 | 3 (0)|
|* 3 | TABLE ACCESS BY INDEX ROWID| DEPT | 1 | 20 | 1 (0)|
|* 4 | INDEX UNIQUE SCAN | SYS_C0042912 | 1 | | 0 (0)|
----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - filter("EMP"."ENAME" LIKE 'K%' AND "EMP"."DEPTNO" IS NOT NULL)
3 - filter("DEPT"."LOC" LIKE 'l%')
4 - access("DEPT"."DEPTNO"="EMP"."DEPTNO")

Which performs first WHERE clause or JOIN clause

The conceptual order of query processing is:

1. FROM
2. WHERE
3. GROUP BY
4. HAVING
5. SELECT
6. ORDER BY

But this is just a conceptual order. In fact the engine may decide to rearrange clauses. Here is proof. Let's make 2 tables with 1000000 rows each:

CREATE TABLE test1 (id INT IDENTITY(1, 1), name VARCHAR(10))
CREATE TABLE test2 (id INT IDENTITY(1, 1), name VARCHAR(10))

;WITH cte AS(SELECT -1 + ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) d FROM
(VALUES(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) t1(n) CROSS JOIN
(VALUES(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) t2(n) CROSS JOIN
(VALUES(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) t3(n) CROSS JOIN
(VALUES(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) t4(n) CROSS JOIN
(VALUES(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) t5(n) CROSS JOIN
(VALUES(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) t6(n))

INSERT INTO test1(name) SELECT 'a' FROM cte

Now run 2 queries:

SELECT * FROM dbo.test1 t1
JOIN dbo.test2 t2 ON t2.id = t1.id AND t2.id = 100
WHERE t1.id > 1

SELECT * FROM dbo.test1 t1
JOIN dbo.test2 t2 ON t2.id = t1.id
WHERE t1.id = 1

Notice that the first query will filter most rows out in the join condition, but the second query filters in the where condition. Look at the produced plans:

1 TableScan - Predicate:[Test].[dbo].[test2].[id] as [t2].[id]=(100)

2 TableScan - Predicate:[Test].[dbo].[test2].[id] as [t2].[id]=(1)

This means that in the first query optimized, the engine decided first to evaluate the join condition to filter out rows. In the second query, it evaluated the where clause first.

Order of evaluation of predicates in Spark SQL where clause

General

Good question.

Inferred answer via testing a scenario and making deductions as could not find the suitable docs. 2nd attempt due to all sorts of statements on the web not able to be backed up.

This question I think is not about AQE Spark 3.x aspects, but it is
about say, a dataframe as part of Stage N of a Spark App that has
passed the stage of acquiring data from sources at rest, which is
subject to filtering with multiple predicates being applied.

Then the central point is does it matter how the predicates are
ordered or does Spark (Catalyst) re-order the predicates to minimize
the work to be done?

  • The premise here is that filtering the maximum amount of data out first makes more sense than evaluating a predicate that filters very
    little out.
    • This is a well-known RDBMS point referring to sargable predicates (subject to evolution of definition over time).
      • A lot of the discussion focused on indexes, Spark, Hive do not have this, but DF's are columnar.

Point 1

You can try for %sql

 EXPLAIN EXTENDED select k, sum(v) from values (1, 2), (1, 3) t(k, v) group by k;

From this you can see what's going on if there is re-arranging of
predicates, but I saw no such aspects in the Physical Plan in non-AQE
mode on Databricks. Refer to
https://docs.databricks.com/sql/language-manual/sql-ref-syntax-qry-explain.html.

Catalyst can re-arrange filtering I read here and there. To what
extent, is a lot of research; I was not able to confirm this.

Also an interesting read:
https://www.waitingforcode.com/apache-spark-sql/catalyst-optimizer-in-spark-sql/read

Point 2

I ran the following pathetic contrived examples with the same
functional query but with predicates reversed, using a column that has
high cardinality and tested for a value that does not in fact exist
and then compared the count of the accumulator used in an UDF when called.

Scenario 1

import org.apache.spark.sql.functions._

def randomInt1to1000000000 = scala.util.Random.nextInt(1000000000)+1
def randomInt1to10 = scala.util.Random.nextInt(10)+1
def randomInt1to1000000 = scala.util.Random.nextInt(1000000)+1

val df = sc.parallelize(Seq.fill(1000000){(randomInt1to1000000,randomInt1to1000000000,randomInt1to10)}).toDF("nuid","hc", "lc").withColumn("text", lpad($"nuid", 3, "0")).withColumn("literal",lit(1))

val accumulator = sc.longAccumulator("udf_call_count")

spark.udf.register("myUdf", (x: String) => {accumulator.add(1)
x.length}
)

accumulator.reset()
df.where("myUdf(text) = 3 and hc = -4").select(max($"text")).show(false)
println(s"Number of UDF calls ${accumulator.value}")

returns:

+---------+
|max(text)|
+---------+
|null |
+---------+

Number of UDF calls 1000000

Scenario 2

import org.apache.spark.sql.functions._

def randomInt1to1000000000 = scala.util.Random.nextInt(1000000000)+1
def randomInt1to10 = scala.util.Random.nextInt(10)+1
def randomInt1to1000000 = scala.util.Random.nextInt(1000000)+1

val dfA = sc.parallelize(Seq.fill(1000000){(randomInt1to1000000,randomInt1to1000000000,randomInt1to10)}).toDF("nuid","hc", "lc").withColumn("text", lpad($"nuid", 3, "0")).withColumn("literal",lit(1))

val accumulator = sc.longAccumulator("udf_call_count")

spark.udf.register("myUdf", (x: String) => {accumulator.add(1)
x.length}
)

accumulator.reset()
dfA.where("hc = -4 and myUdf(text) = 3").select(max($"text")).show(false)
println(s"Number of UDF calls ${accumulator.value}")

returns:

+---------+
|max(text)|
+---------+
|null |
+---------+

Number of UDF calls 0

My conclusion here is that:

  • There is left to right evaluation - in this case - as there are 0 calls for the udf as the accumulator value is 0 for scenario 2, as opposed to scenario 1 with 1M calls registered.

  • So, the order of predicate processing as say ORACLE and DB2 may do for Stage 1 predicates does not apply.

Point 3

I note from the manual however
https://docs.databricks.com/spark/latest/spark-sql/udf-scala.html the
following:

Evaluation order and null checking

Spark SQL (including SQL and the DataFrame and Dataset APIs) does not
guarantee the order of evaluation of subexpressions. In particular,
the inputs of an operator or function are not necessarily evaluated
left-to-right or in any other fixed order. For example, logical AND
and OR expressions do not have left-to-right “short-circuiting”
semantics.

Therefore, it is dangerous to rely on the side effects or order of
evaluation of Boolean expressions, and the order of WHERE and HAVING
clauses, since such expressions and clauses can be reordered during
query optimization and planning. Specifically, if a UDF relies on
short-circuiting semantics in SQL for null checking, there’s no
guarantee that the null check will happen before invoking the UDF. For
example,

spark.udf.register("strlen", (s: String) => s.length)
spark.sql("select s from test1 where s is not null and strlen(s) > 1") // no guarantee

This WHERE clause does not guarantee the strlen UDF to be invoked
after filtering out nulls.

To perform proper null checking, we recommend that you do either of
the following:

Make the UDF itself null-aware and do null checking inside the UDF
itself Use IF or CASE WHEN expressions to do the null check and invoke
the UDF in a conditional branch.

spark.udf.register("strlen_nullsafe", (s: String) => if (s != null) s.length else -1)
spark.sql("select s from test1 where s is not null and strlen_nullsafe(s) > 1") // ok
spark.sql("select s from test1 where if(s is not null, strlen(s), null) > 1") // ok

Slight contradiction.

SQL - Does the order of WHERE conditions matter?

No, the order of the WHERE clauses does not matter.

The optimizer reviews the query & determines the best means of getting the data based on indexes and such. Even if there were a covering index on the category_id and author columns - either would satisfy the criteria to use it (assuming there isn't something better).

Oracle SQL clause evaluation order

The select list cannot always be evaluated last because the ORDER BY can use aliases that are defined in the select list so they must be executed afterwards. For example:

SELECT foo+bar foobar FROM table1 ORDER BY foobar

I'd say that in general the order of execution could be something like this:

  • FROM
  • WHERE
  • GROUP BY
  • SELECT
  • HAVING
  • ORDER BY

The GROUP BY and the WHERE clauses could be swapped without changing the result, as could the HAVING and ORDER BY.

In reality things are more complex because the database can reorder the execution according to different execution plans. As long as the result remains the same it doesn't matter in what order it is executed.

Note also that if an index is chosen for the ORDER BY clause the rows could already be in the correct order when they are read from disk. In this case the ORDER BY clause isn't really executed at all.



Related Topics



Leave a reply



Submit