Proving SQL Query Equivalency

Proving SQL query equivalency

The best you can do is compare the 2 query outputs based on a given set of inputs looking for any differences. To say that they will always return the same results for all inputs really depends on the data.

For Oracle one of the better if not best approaches (very efficient) is here (Ctrl+F Comparing the Contents of Two Tables):

http://www.oracle.com/technetwork/issue-archive/2005/05-jan/o15asktom-084959.html

Which boils down to:

select c1,c2,c3, 
       count(src1) CNT1, 
       count(src2) CNT2
  from (select a.*, 
               1 src1, 
               to_number(null) src2 
          from a
        union all
        select b.*, 
               to_number(null) src1, 
               2 src2 
          from b
       )
group by c1,c2,c3
having count(src1) <> count(src2);

Check if two selects are equivalent

If you want to compare the query results try the following:

(select * from query1 MINUS select * from query2) 
UNION ALL
(select * from query2 MINUS select * from query1)

This will result in all rows that are returned by only one of the queries.

Query equivalence evaluation

You probably can't prove it, since the problem seems to be NP-complete; check this SO question on query equivalence (that one is about Oracle, but there are a couple of answers / links that should be relevant for you).

Query with join equivalency?

No, they are not equivalent. Example:

CREATE TABLE a
( keya int ) ;

CREATE TABLE b
( keyb int ) ;

CREATE TABLE c
( keyc int ) ;

INSERT INTO a
  VALUES
  (1) ;

INSERT INTO b
  VALUES
  (1),(2) ;

INSERT INTO c
  VALUES
  (2) ;

Results:

SELECT * 
FROM  a
  LEFT JOIN b 
    ON a.keya = b.keyb
  INNER JOIN c 
    ON c.keyc = b.keyb ;

Result
----------------------
| keya | keyb | keyc |
----------------------

SELECT * 
FROM a
  LEFT JOIN 
    ( SELECT b.*, c.* 
      FROM  b
        INNER JOIN c 
          ON c.keyc = b.keyb
    ) sub 
    ON a.keya = sub.keyb ;

Result
----------------------
| keya | keyb | keyc |
----------------------
|   1  | NULL | NULL |
----------------------

As to why this happens, a LEFT JOIN b INNER JOIN c is parsed as (a LEFT JOIN b) INNER JOIN c which is equivalent to (a INNER JOIN b) INNER JOIN c because the condition on the INNER join cancels the LEFT join.

You can also write the second query in this form - without subquery - which is parsed as a LEFT JOIN (b INNER JOIN c) because of the different placing of the ON clauses:

SELECT * 
FROM a
  LEFT JOIN 
        b
      INNER JOIN c 
        ON c.keyc = b.keyb
    ON a.keya = b.keyb ;

Result
----------------------
| keya | keyb | keyc |
----------------------
|   1  | NULL | NULL |
----------------------

Is Inner join Equivalence relation?

According to SQL standard they are equivalent. In fact, the database engine's query optimizer is free to rewrite it to a form that is faster to run.

In most database engines I have encountered, they are completely equivalent.

However, if you are talking about OUTER JOIN's, then of course the left/right ordering is significant, not just for performance reasons.

How to Prove that using subselect queries in SQL is killing performance of server

"Show, Don't Tell" - Examine and compare the query plans of the queries identified using SQL Profiler. Particularly look out for table scans and bookmark lookups (you want to see index seeks as often as possible). The 'goodness of fit' of query plans depends on up-to-date statistics, what indexes are defined, the holistic query workload.

Execution Plan Basics
Understanding More Complex Query Plans
Using SQL Server Profiler (2005 Version)

Run the queries in SQL Server Management Studio (SSMS) and turn on Query->Include Actual Execution Plan (CTRL+M)

Think yourself lucky they're only subselects (which in some cases the optimiser will produce equivalent 'join plans') and not correlated sub-queries!

Identify a query that is performing a high number of logical reads, re-write it using your preferred technique and then show how few logicals reads it does by comparison.

Here's a tip. To get the total number of logical reads performed, wrap a query in question with:

SET STATISTICS IO ON
GO

-- Run your query here

SET STATISTICS IO OFF
GO

Run your query, and switch to the messages tab in the results pane.

If you are interested in learning more, there is no better book than SQL Server 2008 Query Performance Tuning Distilled, which covers the essential techniques for monitoring, interpreting and fixing performance issues.

Are these SQL Statements equivalent?

As a final result, yes.
Regarding the execution: the query optimizer might end up creating the same query execution plan for both queries.

This will be the case if, according to its approximate statistics (approximate equi-depth histograms for instance - which are not all the time up-to-date, by the way), the optimizer will determine that the first join is more selective than the second one and, consequently, it will execute this one first.

Stmt1 allows you to specify the order of the joins and, considering that you know exactly what the tables contain, this might be a better solution.

SQL Query optimization - distributive law of natural join and difference

NaturalJoin (R,S-T) equivalence Difference(NaturalJoin(R,S),NaturalJoin(R,T))

The general way to deal with this is to replace operator calls by their definitions.

Here is an outline assuming certain equivalences between relation expressions and the tuples they hold. One actually needs to use the equivalences to justify that one's queries return the tuples that one has been asked to get, but this is not usually explained. (One instead learns via a lot of examples and handwaving.)

S & T have the same set of attributes.

X holds rows (...) where X(...), ie (...) IN X.

NATURALJOIN(X,Y) holds rows where X(...) AND Y(...).

DIFFERENCE(X,Y) holds rows where X(...) AND NOT Y(...).

The left holds rows where:

R(...) AND (S(...) AND NOT T(...))
R(...) AND S(...) AND NOT T(...)

The right holds rows where:

(R(...) AND S(...)) AND NOT (R(...) AND T(...))
(R(...) AND S(...)) AND (NOT R(...) OR NOT T(...))
((R(...) AND S(...)) AND NOT R(...)) OR ((R(...) AND S(...)) AND NOT T(...))
(R(...) AND S(...) AND NOT R(...)) OR (R(...) AND S(...) AND NOT T(...))
R(...) AND S(...) AND NOT T(...)

So they are equivalent.

You can convert this to a proof by replacing X(...) by x IN X and using appropriate quantifications (FORALL & FORSOME/EXISTS) and set comprehensions ({variable|wff}).

Re using natural join in reasoning & SQL see this answer and its links.

And if you know what query could be more optimal in a sense of runtime, that would be really helpful.

That depends on your DMBS and its query implementation/optimization. There is no "optimal" without an execution model, cost/benefit function and that function's input arguments. Moreover "optimal" is chaotic--a tiny change in relational & physical DDL, database contents & statistics, query DML, query & update patterns and DBMS implementation can give completely different tradeoffs.

Proving SQL Query Equivalency