Does SQL Server Support Is Distinct from Clause

Does SQL Server support IS DISTINCT FROM clause?

IS [NOT] DISTINCT FROM is scheduled to be included in SQL Server 2022 (currently in public preview), see:

  • What's new in SQL Server 2022
  • IS [NOT] DISTINCT FROM (Transact-SQL)

For earlier versions of SQL Server, the following SO question explains how to work around this issue with equivalent (but more verbose) SQL Server expressions:

  • How to rewrite IS DISTINCT FROM and IS NOT DISTINCT FROM in SQL Server 20008R2?

Does SELECT DISTINCT differ from SELECT when using a NOT IN clause?

From a functional point of view, the queries with or without DISTINCT are identical (they would delete the same set of rows).

From a performance point of view, I am certain that SQL Server will always produce the same execution plan for both queries (but I cannot prove this).

For other database engines, this may be different. See:

  • https://mariadb.com/kb/en/optimizing-group-by/
  • https://www.quora.com/Should-I-use-DISTINCT-in-a-subquery-when-using-IN
  • https://docs.oracle.com/javadb/10.8.3.0/tuning/ctuntransform867165.html

How to rewrite IS DISTINCT FROM and IS NOT DISTINCT FROM in SQL Server 20008R2?

The IS DISTINCT FROM predicate was introduced as feature T151 of SQL:1999, and its readable negation, IS NOT DISTINCT FROM, was added as feature T152 of SQL:2003. The purpose of these predicates is to guarantee that the result of comparing two values is either True or False, never Unknown.

These predicates work with any comparable type (including rows, arrays and multisets) making it rather complicated to emulate them exactly. However, SQL Server doesn't support most of these types, so we can get pretty far by checking for null arguments/operands:

  • a IS DISTINCT FROM b can be rewritten as:

    ((a <> b OR a IS NULL OR b IS NULL) AND NOT (a IS NULL AND b IS NULL))
  • a IS NOT DISTINCT FROM b can be rewritten as:

    (NOT (a <> b OR a IS NULL OR b IS NULL) OR (a IS NULL AND b IS NULL))

Your own answer is incorrect as it fails to consider that FALSE OR NULL evaluates to Unknown. For example, NULL IS DISTINCT FROM NULL should evaluate to False. Similarly, 1 IS NOT DISTINCT FROM NULL should evaluate to False. In both cases, your expressions yield Unknown.

DISTINCT clause used on non-distinct select error

From ENEA Polyhedra reference:

Inclusion of the distinct clause will generate an error if the select
statement could potentially return duplicate rows. Only select
statements whose output columns include all the primary key columns of
the tables specified in the from clause can be successfully executed
with a distinct clause.

So I guess this DBMS doesn't really implement distinct, as this constraint nullifies the interest of using this clause. Unless you join a table without any primary key, maybe ?

EDIT: Seems like this resource is old. Which version of Polyhedra are you using ?

SQL Distinct clause not working?

If you want the naming only appearing once, then group by comes to mind. One method is:

SELECT c.fldContactName,
MAX(c.fldsignonlinesetup) as fldsignonlinesetup,
MAX(c.fldorderdate) as fldorderdate,
MAX(c.fldemail) as fldemail
FROM tblcustomers c LEFT JOIN
tblorders o
ON c.fldcustomerid = o.fldcustomerid
WHERE o.fldorderdate BETWEEN '2013-01-01' AND '2016-12-31' AND
c.fldemail <> 'NULL' AND c.fldcontactname <> 'NULL' AND
c.fldcontactname <> '' AND c.fldemail <> '' AND
c.fldsignonlinesetup = 0
GROUP BY c.fldcontactname
HAVING COUNT(*) = 1
ORDER BY c.fldcontactname ASC;

SELECT DISTINCT just makes sure that all the columns in the result set are never duplicates. It has nothing to do with finding values with only one row. The HAVING clause does this.

Notes:

  • The use of table aliases is good, but abbreviations for table names make the query more understandable.
  • The MAX() is really a no-op. With one row, it returns the value from the one row.
  • The GROUP BY is on the field you care about -- the one you don't want duplicates for.
  • The HAVING clause gets the values with only one row.
  • MySQL does not require the MAX() functions, but I strongly recommend using an aggregation function, so you don't learn bad habits that don't work in other databases and can behave unexpected in MySQL.
  • Do you really mean fldemail <> 'NULL' or do you intend A.fldemail IS NOT NULL?

Distinct Counts in a Window Function

Unfortunately, SQL Server does not support COUNT(DISTINCT as a window function.

So you need to nest window functions. I find the simplest and most efficient method is MAX over a DENSE_RANK, but there are others.

The partitioning clause is the equivalent of GROUP BY in a normal aggregate, then the value you are DISTINCTing goes in the ORDER BY of the DENSE_RANK. So you calculate a ranking, while ignoring tied results, then take the maximum rank, per partition.

SELECT
PRODUCT_ID,
KEY_ID,
STORECLUSTER,
STORECLUSTER_COUNT = MAX(rn) OVER (PARTITION BY PRODUCT_ID, KEY_ID)
FROM (
SELECT *,
rn = DENSE_RANK() OVER (PARTITION BY PRODUCT_ID, KEY_ID ORDER BY STORECLUSTER)
FROM YourTable t
) t;

db<>fiddle

SELECT DISTINCT or SELECT efficiency for IN clause

SELECT * FROM account_item WHERE instrument_id IN (SELECT instrument_id FROM instrument WHERE market_id=202)

SELECT * FROM account_item WHERE instrument_id IN (SELECT distinct instrument_id FROM instrument WHERE market_id=202)

The distinct makes no difference (as you might guessed), both gives this execution plan:

ALL_ROWS    SELECT STATEMENT   Cost = 703
1.1 HASH JOIN
2.1 INDEX FAST FULL SCAN PRF_INSTRUMENT_COMPANY
2.2 TABLE ACCESS FULL ACCOUNT_ITEM

What makes difference is:

1) cardinality of the subselect

SELECT * FROM account_item WHERE instrument_id IN (SELECT instrument_id FROM     instrument WHERE symbol='MSFT')

ALL_ROWS SELECT STATEMENT Cost = 129
1.1 HASH JOIN
2.1 TABLE ACCESS BY INDEX ROWID INSTRUMENT
3.1 INDEX RANGE SCAN PRF_INSTRUMENT_MATCH
2.2 TABLE ACCESS FULL ACCOUNT_ITEM

2) whether you access to unindexed columns of tbl_A (aka select * hurts)

SELECT account_id FROM account_item WHERE instrument_id IN (SELECT distinct instrument_id FROM instrument WHERE market_id=202)

ALL_ROWS SELECT STATEMENT Cost = 608
1.1 HASH JOIN
2.1 INDEX FAST FULL SCAN PRF_INSTRUMENT_COMPANY
2.2 INDEX FAST FULL SCAN PRF_ACCOUNT_ITEM

3) and once you have good cardinality and accessing to columns which are in the indexes in use:

SELECT account_id FROM account_item WHERE instrument_id IN (SELECT instrument_id FROM instrument WHERE symbol='MSFT')

ALL_ROWS SELECT STATEMENT Cost = 33
1.1 HASH JOIN
2.1 TABLE ACCESS BY INDEX ROWID INSTRUMENT
3.1 INDEX RANGE SCAN PRF_INSTRUMENT_MATCH
2.2 INDEX FAST FULL SCAN PRF_ACCOUNT_ITEM

Your query can be rewritten of using a join instead of a subselect.

select a.account_id
from account_item a,
instrument i
where a.instrument_id=i.instrument_id
and i.symbol='MSFT'

ALL_ROWS SELECT STATEMENT Cost = 33
1.1 HASH JOIN
2.1 TABLE ACCESS BY INDEX ROWID INSTRUMENT
3.1 INDEX RANGE SCAN PRF_INSTRUMENT_MATCH
2.2 INDEX FAST FULL SCAN PRF_ACCOUNT_ITEM

With your example tables:

select a.* 
from tbl_A,tbl_B
where a.val=b.myField
and b...some other condition

The efficiency of subselect vs. join can fuel a new debate. Oracle is advertised so that it can convert in... subselect to joins, and as you can see, the execution plan is the same. However this is not always the case, depending on if you access to unindexed fields of tbl_a (account items), or the cardinality of the criteria on tbl_b (instruments) things can go quite weird.

Basically, the rule of thumb is:

  • each column used in a where criteria must have indexed, preferrably not individually but as a set of columns covering all the columns used in where criteria, e.g. create index prf_fastInstruments on instrument(market_id,symbol)
  • if you have a possibility to make an index unique, make it unique
  • if you're about to touch lots of rows, such as "all payment records of the century", consider using a time criteria and put that to your index as well. This works as a poor man's partitioning and speed up queries
  • limit the number of columns you load from the database. Usually select * is not really required. If you select only those columns which are indexed, the whole query is executed from the index, and needs to load really less data from the disk making your query blazing fast. But if you just select * or access even one additional unindexed column, that means that each matching row has to be loaded from the disk first.
  • avoid premature optimizations (such as in... vs in distinct) - as you can see other factors have much bigger impact
  • insert real data: real number of rows, with real cardinality (such as use a fake identity generator to create 1 million customers, but don't call them "user1"..."user100000") and do an execution plan on your selects before you change anything

COUNT DISTINCT with CONDITIONS

You can try this:

select
count(distinct tag) as tag_count,
count(distinct (case when entryId > 0 then tag end)) as positive_tag_count
from
your_table_name;

The first count(distinct...) is easy.
The second one, looks somewhat complex, is actually the same as the first one, except that you use case...when clause. In the case...when clause, you filter only positive values. Zeros or negative values would be evaluated as null and won't be included in count.

One thing to note here is that this can be done by reading the table once. When it seems that you have to read the same table twice or more, it can actually be done by reading once, in most of the time. As a result, it will finish the task a lot faster with less I/O.



Related Topics



Leave a reply



Submit