Does SQL Server support IS DISTINCT FROM clause?
IS [NOT] DISTINCT FROM
is scheduled to be included in SQL Server 2022 (currently in public preview), see:
- What's new in SQL Server 2022
- IS [NOT] DISTINCT FROM (Transact-SQL)
For earlier versions of SQL Server, the following SO question explains how to work around this issue with equivalent (but more verbose) SQL Server expressions:
- How to rewrite IS DISTINCT FROM and IS NOT DISTINCT FROM in SQL Server 20008R2?
Does SELECT DISTINCT differ from SELECT when using a NOT IN clause?
From a functional point of view, the queries with or without DISTINCT
are identical (they would delete the same set of rows).
From a performance point of view, I am certain that SQL Server will always produce the same execution plan for both queries (but I cannot prove this).
For other database engines, this may be different. See:
- https://mariadb.com/kb/en/optimizing-group-by/
- https://www.quora.com/Should-I-use-DISTINCT-in-a-subquery-when-using-IN
- https://docs.oracle.com/javadb/10.8.3.0/tuning/ctuntransform867165.html
How to rewrite IS DISTINCT FROM and IS NOT DISTINCT FROM in SQL Server 20008R2?
The IS DISTINCT FROM
predicate was introduced as feature T151 of SQL:1999, and its readable negation, IS NOT DISTINCT FROM
, was added as feature T152 of SQL:2003. The purpose of these predicates is to guarantee that the result of comparing two values is either True or False, never Unknown.
These predicates work with any comparable type (including rows, arrays and multisets) making it rather complicated to emulate them exactly. However, SQL Server doesn't support most of these types, so we can get pretty far by checking for null arguments/operands:
a IS DISTINCT FROM b
can be rewritten as:((a <> b OR a IS NULL OR b IS NULL) AND NOT (a IS NULL AND b IS NULL))
a IS NOT DISTINCT FROM b
can be rewritten as:(NOT (a <> b OR a IS NULL OR b IS NULL) OR (a IS NULL AND b IS NULL))
Your own answer is incorrect as it fails to consider that FALSE OR NULL
evaluates to Unknown. For example, NULL IS DISTINCT FROM NULL
should evaluate to False. Similarly, 1 IS NOT DISTINCT FROM NULL
should evaluate to False. In both cases, your expressions yield Unknown.
DISTINCT clause used on non-distinct select error
From ENEA Polyhedra reference:
Inclusion of the distinct clause will generate an error if the select
statement could potentially return duplicate rows. Only select
statements whose output columns include all the primary key columns of
the tables specified in the from clause can be successfully executed
with a distinct clause.
So I guess this DBMS doesn't really implement distinct
, as this constraint nullifies the interest of using this clause. Unless you join a table without any primary key, maybe ?
EDIT: Seems like this resource is old. Which version of Polyhedra are you using ?
SQL Distinct clause not working?
If you want the naming only appearing once, then group by
comes to mind. One method is:
SELECT c.fldContactName,
MAX(c.fldsignonlinesetup) as fldsignonlinesetup,
MAX(c.fldorderdate) as fldorderdate,
MAX(c.fldemail) as fldemail
FROM tblcustomers c LEFT JOIN
tblorders o
ON c.fldcustomerid = o.fldcustomerid
WHERE o.fldorderdate BETWEEN '2013-01-01' AND '2016-12-31' AND
c.fldemail <> 'NULL' AND c.fldcontactname <> 'NULL' AND
c.fldcontactname <> '' AND c.fldemail <> '' AND
c.fldsignonlinesetup = 0
GROUP BY c.fldcontactname
HAVING COUNT(*) = 1
ORDER BY c.fldcontactname ASC;
SELECT DISTINCT
just makes sure that all the columns in the result set are never duplicates. It has nothing to do with finding values with only one row. The HAVING
clause does this.
Notes:
- The use of table aliases is good, but abbreviations for table names make the query more understandable.
- The
MAX()
is really a no-op. With one row, it returns the value from the one row. - The
GROUP BY
is on the field you care about -- the one you don't want duplicates for. - The
HAVING
clause gets the values with only one row. - MySQL does not require the
MAX()
functions, but I strongly recommend using an aggregation function, so you don't learn bad habits that don't work in other databases and can behave unexpected in MySQL. - Do you really mean
fldemail <> 'NULL'
or do you intendA.fldemail IS NOT NULL
?
Distinct Counts in a Window Function
Unfortunately, SQL Server does not support COUNT(DISTINCT
as a window function.
So you need to nest window functions. I find the simplest and most efficient method is MAX
over a DENSE_RANK
, but there are others.
The partitioning clause is the equivalent of GROUP BY
in a normal aggregate, then the value you are DISTINCT
ing goes in the ORDER BY
of the DENSE_RANK
. So you calculate a ranking, while ignoring tied results, then take the maximum rank, per partition.
SELECT
PRODUCT_ID,
KEY_ID,
STORECLUSTER,
STORECLUSTER_COUNT = MAX(rn) OVER (PARTITION BY PRODUCT_ID, KEY_ID)
FROM (
SELECT *,
rn = DENSE_RANK() OVER (PARTITION BY PRODUCT_ID, KEY_ID ORDER BY STORECLUSTER)
FROM YourTable t
) t;
db<>fiddle
SELECT DISTINCT or SELECT efficiency for IN clause
SELECT * FROM account_item WHERE instrument_id IN (SELECT instrument_id FROM instrument WHERE market_id=202)
SELECT * FROM account_item WHERE instrument_id IN (SELECT distinct instrument_id FROM instrument WHERE market_id=202)
The distinct makes no difference (as you might guessed), both gives this execution plan:
ALL_ROWS SELECT STATEMENT Cost = 703
1.1 HASH JOIN
2.1 INDEX FAST FULL SCAN PRF_INSTRUMENT_COMPANY
2.2 TABLE ACCESS FULL ACCOUNT_ITEM
What makes difference is:
1) cardinality of the subselect
SELECT * FROM account_item WHERE instrument_id IN (SELECT instrument_id FROM instrument WHERE symbol='MSFT')
ALL_ROWS SELECT STATEMENT Cost = 129
1.1 HASH JOIN
2.1 TABLE ACCESS BY INDEX ROWID INSTRUMENT
3.1 INDEX RANGE SCAN PRF_INSTRUMENT_MATCH
2.2 TABLE ACCESS FULL ACCOUNT_ITEM
2) whether you access to unindexed columns of tbl_A (aka select * hurts)
SELECT account_id FROM account_item WHERE instrument_id IN (SELECT distinct instrument_id FROM instrument WHERE market_id=202)
ALL_ROWS SELECT STATEMENT Cost = 608
1.1 HASH JOIN
2.1 INDEX FAST FULL SCAN PRF_INSTRUMENT_COMPANY
2.2 INDEX FAST FULL SCAN PRF_ACCOUNT_ITEM
3) and once you have good cardinality and accessing to columns which are in the indexes in use:
SELECT account_id FROM account_item WHERE instrument_id IN (SELECT instrument_id FROM instrument WHERE symbol='MSFT')
ALL_ROWS SELECT STATEMENT Cost = 33
1.1 HASH JOIN
2.1 TABLE ACCESS BY INDEX ROWID INSTRUMENT
3.1 INDEX RANGE SCAN PRF_INSTRUMENT_MATCH
2.2 INDEX FAST FULL SCAN PRF_ACCOUNT_ITEM
Your query can be rewritten of using a join instead of a subselect.
select a.account_id
from account_item a,
instrument i
where a.instrument_id=i.instrument_id
and i.symbol='MSFT'
ALL_ROWS SELECT STATEMENT Cost = 33
1.1 HASH JOIN
2.1 TABLE ACCESS BY INDEX ROWID INSTRUMENT
3.1 INDEX RANGE SCAN PRF_INSTRUMENT_MATCH
2.2 INDEX FAST FULL SCAN PRF_ACCOUNT_ITEM
With your example tables:
select a.*
from tbl_A,tbl_B
where a.val=b.myField
and b...some other condition
The efficiency of subselect vs. join can fuel a new debate. Oracle is advertised so that it can convert in... subselect to joins, and as you can see, the execution plan is the same. However this is not always the case, depending on if you access to unindexed fields of tbl_a (account items), or the cardinality of the criteria on tbl_b (instruments) things can go quite weird.
Basically, the rule of thumb is:
- each column used in a where criteria must have indexed, preferrably not individually but as a set of columns covering all the columns used in where criteria, e.g. create index prf_fastInstruments on instrument(market_id,symbol)
- if you have a possibility to make an index unique, make it unique
- if you're about to touch lots of rows, such as "all payment records of the century", consider using a time criteria and put that to your index as well. This works as a poor man's partitioning and speed up queries
- limit the number of columns you load from the database. Usually select * is not really required. If you select only those columns which are indexed, the whole query is executed from the index, and needs to load really less data from the disk making your query blazing fast. But if you just select * or access even one additional unindexed column, that means that each matching row has to be loaded from the disk first.
- avoid premature optimizations (such as in... vs in distinct) - as you can see other factors have much bigger impact
- insert real data: real number of rows, with real cardinality (such as use a fake identity generator to create 1 million customers, but don't call them "user1"..."user100000") and do an execution plan on your selects before you change anything
COUNT DISTINCT with CONDITIONS
You can try this:
select
count(distinct tag) as tag_count,
count(distinct (case when entryId > 0 then tag end)) as positive_tag_count
from
your_table_name;
The first count(distinct...)
is easy.
The second one, looks somewhat complex, is actually the same as the first one, except that you use case...when
clause. In the case...when
clause, you filter only positive values. Zeros or negative values would be evaluated as null
and won't be included in count.
One thing to note here is that this can be done by reading the table once. When it seems that you have to read the same table twice or more, it can actually be done by reading once, in most of the time. As a result, it will finish the task a lot faster with less I/O.
Related Topics
Update X Set Y = Null Takes a Long Time
Sql Design Approach for Searching a Table with an Unlimited Number of Bit Fields
What Is The Query to Get "Related Tags" Like in
How to Remove The Default Value from a Column in Oracle
How to Specify an Input SQL File with Bcp
Left Join with Dynamic Table Name Derived from Column
Display All Data of All Tables
Count Max. Number of Concurrent User Sessions Per Day
How to Calculate Ratios in Sql
How to Find Tables Which Reference a Particular Row via a Foreign Key
Unique Date Range Fields in SQL Server 2008
How to Bulk Update with SQL Server
Creating Groups of Consecutive Days Meeting a Given Criteria
How to Get Rightmost 10 Places of a String in Oracle
Is Using "Not Exists" Considered to Be Bad SQL Practise
How to Set Numwidth in The Grid Output of Pl/Sql Developer
Call Dll Function from SQL Stored Procedure Using The Current Connection