Differencebetween Postgres Distinct VS Distinct On

What is the difference between Postgres DISTINCT vs DISTINCT ON?

DISTINCT and DISTINCT ON have completely different semantics.

First the theory

DISTINCT applies to an entire tuple. Once the result of the query is computed, DISTINCT removes any duplicate tuples from the result.

For example, assume a table R with the following contents:

#table r;
a | b
---+---
1 | a
2 | b
3 | c
3 | d
2 | e
1 | a

(6 rows)

SELECT distinct * from R will result:

# select distinct * from r;
a | b
---+---
1 | a
3 | d
2 | e
2 | b
3 | c
(5 rows)

Note that distinct applies to the entire list of projected attributes: thus

select distinct * from R

is semantically equivalent to

select distinct a,b from R

You cannot issue

select a, distinct b From R

DISTINCT must follow SELECT. It applies to the entire tuple, not to an attribute of the result.

DISTINCT ON is a postgresql addition to the language. It is similar, but not identical, to group by.

Its syntax is:

 SELECT DISTINCT ON (attributeList) <rest as any query>

For example:

 SELECT DISTINCT ON (a) * from R

It semantics can be described as follows. Compute the as usual--without the DISTINCT ON (a)---but before the projection of the result, sort the current result and group it according to the attribute list in DISTINCT ON (similar to group by). Now, do the projection using the first tuple in each group and ignore the other tuples.

Example:

select * from r order by a;
a | b
---+---
1 | a
2 | e
2 | b
3 | c
3 | d
(5 rows)

Then for every different value of a (in this case, 1, 2 and 3), take the first tuple. Which is the same as:

 SELECT DISTINCT on (a) * from r;
a | b
---+---
1 | a
2 | b
3 | c
(3 rows)

Some DBMS (most notably sqlite) will allow you to run this query:

 SELECT a,b from R group by a;

And this give you a similar result.

Postgresql will allow this query, if and only if there is a functional dependency from a to b. In other words, this query will be valid if for any instance of the relation R, there is only one unique tuple for every value or a (thus selecting the first tuple is deterministic: there is only one tuple).

For instance, if the primary key of R is a, then a->b and:

SELECT a,b FROM R group by a

is identical to:

  SELECT DISTINCT on (a) a, b from r;

Now, back to your problem:

First query:

SELECT DISTINCT count(dimension1)
FROM data_table;

computes the count of dimension1 (number of tuples in data_table that where dimension1 is not null). This query
returns one tuple, which is always unique (hence DISTINCT
is redundant).

Query 2:

SELECT count(*)
FROM (SELECT DISTINCT ON (dimension1) dimension1
FROM data_table
GROUP BY dimension1) AS tmp_table;

This is query in a query. Let me rewrite it for clarity:

WITH tmp_table AS (
SELECT DISTINCT ON (dimension1)
dimension1 FROM data_table
GROUP by dimension1)
SELECT count(*) from tmp_table

Let us compute first tmp_table. As I mentioned above,
let us first ignore the DISTINCT ON and do the rest of the
query. This is a group by by dimension1. Hence this part of the query
will result in one tuple per different value of dimension1.

Now, the DISTINCT ON. It uses dimension1 again. But dimension1 is unique already (due to the group by). Hence
this makes the DISTINCT ON superflouos (it does nothing).
The final count is simply a count of all the tuples in the group by.

As you can see, there is an equivalence in the following query (it applies to any relation with an attribute a):

SELECT (DISTINCT ON a) a
FROM R

and

SELECT a FROM R group by a

and

SELECT DISTINCT a FROM R

Warning

Using DISTINCT ON results in a query might be non-deterministic for a given instance of the database.
In other words, the query might return different results for the same tables.

One interesting aspect

Distinct ON emulates a bad behaviour of sqlite in a much cleaner way. Assume that R has two attributes a and b:

SELECT a, b FROM R group by a

is an illegal statement in SQL. Yet, it runs on sqlite. It simply takes a random value of b from any of the tuples in the group of same values of a.
In Postgresql this statement is illegal. Instead, you must use DISTINCT ON and write:

SELECT DISTINCT ON (a) a,b from R

Corollary

DISTINCT ON is useful in a group by when you want to access a value that is functionally dependent on the group by attributes. In other words, if you know that for every group of attributes they always have the same value of the third attribute, then use DISTINCT ON that group of attributes. Otherwise you would have to make a JOIN to retrieve that third attribute.

GROUP BY and DISTINCT ON working differently for tables and views

You get that error because when you query the table directly, Postgres is able to identify the primary key of the table and knows that grouping by it is enough.

Quote from the manual

When GROUP BY is present, or any aggregate functions are present, it is not valid for the SELECT list expressions to refer to ungrouped columns except within aggregate functions or when the ungrouped column is functionally dependent on the grouped columns, since there would otherwise be more than one possible value to return for an ungrouped column. A functional dependency exists if the grouped columns (or a subset thereof) are the primary key of the table containing the ungrouped column

(emphasis mine)

When querying the view, Postgres isn't able to detect that functional dependency that makes it possible to have a "shortened" GROUP BY when querying the table directly.

PostgreSQL DISTINCT ON with different ORDER BY

Documentation says:

DISTINCT ON ( expression [, ...] ) keeps only the first row of each set of rows where the given expressions evaluate to equal. [...] Note that the "first row" of each set is unpredictable unless ORDER BY is used to ensure that the desired row appears first. [...] The DISTINCT ON expression(s) must match the leftmost ORDER BY expression(s).

Official documentation

So you'll have to add the address_id to the order by.

Alternatively, if you're looking for the full row that contains the most recent purchased product for each address_id and that result sorted by purchased_at then you're trying to solve a greatest N per group problem which can be solved by the following approaches:

The general solution that should work in most DBMSs:

SELECT t1.* FROM purchases t1
JOIN (
SELECT address_id, max(purchased_at) max_purchased_at
FROM purchases
WHERE product_id = 1
GROUP BY address_id
) t2
ON t1.address_id = t2.address_id AND t1.purchased_at = t2.max_purchased_at
ORDER BY t1.purchased_at DESC

A more PostgreSQL-oriented solution based on @hkf's answer:

SELECT * FROM (
SELECT DISTINCT ON (address_id) *
FROM purchases
WHERE product_id = 1
ORDER BY address_id, purchased_at DESC
) t
ORDER BY purchased_at DESC

Problem clarified, extended and solved here: Selecting rows ordered by some column and distinct on another

Why use IS DISTINCT FROM - Postgres

First, it is convenient. Second, you need to run tests on larger amounts of data. A lot can happen on a database server in a second, so small changes in hundredths of a second are not necessarily indicative of overall performance.

On the positive side, I think Postgres will use an index for is distinct from. I don't think an index will necessarily be used for all the alternatives.

distinct vs group by which is better

Your experience is interesting. I have not seen the single reducer effect for distinct versus group by. Perhaps there is some subtle difference in the optimizer between the two constructs.

A "famous" example in Hive is:

select count(distinct id)
from mytbl;

versus

select count(*)
from (select distinct id
from mytbl
) t;

The former only uses one reducer and the latter operates in parallel. I have seen this both in my experience, and it is documented and discussed (for example, on slides 26 and 27 in this presentation). So, distinct can definitely take advantage of parallelism.

I imagine that as Hive matures, such problems will be fixed. However, it is ironic that Postgres has a similar performance issue with COUNT(DISTINCT), although I think the underlying reason is a little bit different.

Group By Vs Distinct in SQL

Why does this not work?

SELECT DISTINCT(continent), COUNT(name)
FROM world
WHERE population > 200000000;

That is simple. You have an aggregation query, because you have COUNT() in the SELECT. You have no GROUP BY, so any other columns references in the SELECT must be the arguments of aggregations columns. So, continent generates an error.

You seem to also be under the impression that the parentheses around continent have some significance. They do not. Not at all. SQL has a construct, SELECT DISTINCT, which selects distinct values of rows.

Also note that DISTINCT is almost never used with aggregation functions.

Is there any difference between GROUP BY and DISTINCT

MusiGenesis' response is functionally the correct one with regard to your question as stated; the SQL Server is smart enough to realize that if you are using "Group By" and not using any aggregate functions, then what you actually mean is "Distinct" - and therefore it generates an execution plan as if you'd simply used "Distinct."

However, I think it's important to note Hank's response as well - cavalier treatment of "Group By" and "Distinct" could lead to some pernicious gotchas down the line if you're not careful. It's not entirely correct to say that this is "not a question about aggregates" because you're asking about the functional difference between two SQL query keywords, one of which is meant to be used with aggregates and one of which is not.

A hammer can work to drive in a screw sometimes, but if you've got a screwdriver handy, why bother?

(for the purposes of this analogy, Hammer : Screwdriver :: GroupBy : Distinct and screw => get list of unique values in a table column)



Related Topics



Leave a reply



Submit