Postgresql Distinct on With Different Order By

PostgreSQL DISTINCT ON with different ORDER BY

Documentation says:

DISTINCT ON ( expression [, ...] ) keeps only the first row of each set of rows where the given expressions evaluate to equal. [...] Note that the "first row" of each set is unpredictable unless ORDER BY is used to ensure that the desired row appears first. [...] The DISTINCT ON expression(s) must match the leftmost ORDER BY expression(s).

Official documentation

So you'll have to add the address_id to the order by.

Alternatively, if you're looking for the full row that contains the most recent purchased product for each address_id and that result sorted by purchased_at then you're trying to solve a greatest N per group problem which can be solved by the following approaches:

The general solution that should work in most DBMSs:

SELECT t1.* FROM purchases t1
JOIN (
SELECT address_id, max(purchased_at) max_purchased_at
FROM purchases
WHERE product_id = 1
GROUP BY address_id
) t2
ON t1.address_id = t2.address_id AND t1.purchased_at = t2.max_purchased_at
ORDER BY t1.purchased_at DESC

A more PostgreSQL-oriented solution based on @hkf's answer:

SELECT * FROM (
SELECT DISTINCT ON (address_id) *
FROM purchases
WHERE product_id = 1
ORDER BY address_id, purchased_at DESC
) t
ORDER BY purchased_at DESC

Problem clarified, extended and solved here: Selecting rows ordered by some column and distinct on another

Select Distinct On while Order By a different column

Since you are selecting events.*, you shouldn't add eventable_id, and eventable_type to the output columns redundantly. Would result in duplicate column names. You know that you don't have to include the columns in the DISTINCT ON clause in target list, right?

Also, it's probably faster to use eventable_type DESC right away, since you have that in your final sort order. That's allowed, too.

SELECT DISTINCT ON (eventable_id, eventable_type)
*
FROM events
WHERE <condition>
ORDER BY eventable_id, eventable_type DESC, created_at DESC

@Denis already covers the rest: make that a subquery and order as you like in the outer query.

The alternative would be a subselect with GROUP BY and max(), but that yields multiple columns per group, when the latest created_at per group is not unique. (May or may not be desirable.) And it's probably still slower than DISTINCT ON with an additional ORDER BY step. Test with EXPLAIN ANALYZE.

SELECT e.*
FROM events e
JOIN (
SELECT eventable_id, eventable_type, max(created_at) AS created_at
FROM events
WHERE <condition>
GROUP BY 1, 2 DESC
) sub USING (eventable_id, eventable_type, created_at) -- maybe not unique
WHERE <repeat condition if dupes may be eliminated>
ORDER BY e.created_at, e.eventable_id, e.eventable_type DESC

How to select distinct on with different order by Postgresql

select distinct on (letter_one) letter_one, number_int 
from test_dupl2
order by 1, 2 desc;

letter_one | number_int
------------+------------
A | 9
B | 4
(2 rows)

Select Distinct on one column, without ordering by that column

The general answer to your question is that when using DISTINCT ON (x, ...) in SELECT statement in postgresql, the database sorts by the values in the distinct clause in order to make it easy to tell if the rows have distinct values (once they're ordered by the values, it only takes one pass for the db to remove duplicates, and it only needs to compare adjacent rows. Because of this, the db forces you to sort by the same columns in the distinct clause.

You can work around this by making your original query a subquery, like so:

SELECT t.id FROM
(SELECT DISTINCT ON (countries.id) countries.id
, province_infos.population
, country_infos.founding_date
FROM countries
...
ORDER BY countries.id, province_infos.population DESC, country_infos.founding_date ASC
)t
ORDER BY t.population DESC, T.founding_date ASC

Postgres: Why DISTINCT ON expression(s) must match the leftmost ORDER BY expression(s)?

Well, the ORDER BY is needed to keep those rows together that share the same value for the "distinct columns". The database processes them sequentially discarding all subsequent rows from the same set. If the rows weren't sorted, this wouldn't be easily possible.

Assume this set of rows:

c1 | c2 
---+----
1 | 100
2 | 10
1 | 200
2 | 15

If you want the c1 to be unique and pick the highest c2 you would need to use

select distinct on (c1) *
from the_table
order by c1, c2 desc;

The order by itself will generate the following result:

c1 | c2 
---+----
1 | 200
1 | 100
2 | 15
2 | 10

By processing that result row-by-row the database can now efficiently discard every but the first row for each c1 value by simply checking if that value changes from row to another. If the result wasn't sorted this check would be become far more complicated.

Get distinct on one column, order by another

Leading expressions in ORDER BY have to agree with expressions in DISTINCT ON:

SELECT DISTINCT ON ("threadId") *
FROM messages
ORDER BY "threadId", "createdAt" DESC;

Detailed explanation:

  • Select first row in each GROUP BY group?

If you want to order results in a different way (like commented), you'll have to wrap the query in an outer query with a second ORDER BY. See:

  • PostgreSQL DISTINCT ON with different ORDER BY

Or similar, depending on your exact situation and requirements. There may be sophistication to get best results. Recent example:

  • How do I take a DISTINCT ON subquery that is ordered by a separate column, and make it fast?

How to use DISTINCT ON but ORDER BY another expression?

This requirement is necessary to make DISTINCT ON work; to change the final order, you can add an outer query with another ORDER BY clause:

SELECT *
FROM (SELECT DISTINCT ON (s.id)
s.id as subscription_id, subscription_carts.authorized_at, s.*
FROM subscriptions s
JOIN ...
WHERE ...
ORDER BY s.id, subscription_carts.authorized_at
) AS subq
ORDER BY authorized_at;

SQL - how to order by date and select distinct

Your way is correct. You need the "wrong" order to get the correct date value per category_id.

You need to correct this order with an additional ORDER BY step:

SELECT
*
FROM (
SELECT DISTINCT ON (category_id)
category_id,
date
FROM review
ORDER BY category_id, date ASC
) s
ORDER BY date ASC

PostgreSQL SELECT result with a distinct ids that prefers specific values from another table

The ORDER BY that determine how the DISTINCT ON selects its record should be on the same level as the DISTINCT ON itself, not in a subquery.

If you did that, the error message would tell you the problem, you have to sort first by the DISTINCT ON columns, then the tie-breaker columns after. Like this:

select distinct on (id) d.id, d.some_text, d.some_array, dl.name, dl.language  
from some_dictionary d
join some_dictionary_language dl on d.id = dl.some_dictionary_id
where dl."language" in ('POLISH', 'ENGLISH')
and (d.some_text ilike '%na%' or dl.name ilike '%na%')
and 1 = ANY(d.some_array)
order by id, case when dl."language" = 'POLISH' then 1 end;

When the DISTINCT ON query doesn't have an ORDER BY, it just makes one up consisting of only the DISTINCT ON columns, leaving the kept row within each group to be selected arbitrarily.

PostgreSQL distinct on + order by query optimization

To use the index to avoid the sort, PostgreSQL would first have to scan all of "reporting_processedamazonsnapshot" in index order, then join all of "offers_boooffer" using a nested loop join (so that the order is preserved) and then join all of "offers_offersettings", again using a nested loop join.

Finally, all rows that don't match the condition "offers_offersettings"."account_id" = 20 would be thrown away.

PostgreSQL believes – correctly in my opinion – that it is more efficient to start by reducing the number of rows as much as possible using the condition, then use the most efficient join method to join the tables and then sort for the DISTINCT clause.

I wonder if the following query might be faster:

SELECT DISTINCT ON (q.offer_id) *
FROM offers_offersettings ofs
JOIN offers_boooffer bo ON bo.id = ofs.offer_id
CROSS JOIN LATERAL
(SELECT *
FROM reporting_processedamazonsnapshot r
WHERE r.offer_id = bo.offer_id
ORDER BY r.scraping_date DESC
LIMIT 1) q
WHERE ofs.account_id = 20
ORDER BY q.offer_id ASC, q.scraping_date DESC;

The execution plan would be similar, except that fewer rows would have to be scanned from the index, which should reduce execution time where you need it most.

If you want to speed up the sort, increase work_mem to some 500MB for that query (if you can afford it).



Related Topics



Leave a reply



Submit