PostgreSQL DISTINCT ON with different ORDER BY
Documentation says:
DISTINCT ON ( expression [, ...] ) keeps only the first row of each set of rows where the given expressions evaluate to equal. [...] Note that the "first row" of each set is unpredictable unless ORDER BY is used to ensure that the desired row appears first. [...] The DISTINCT ON expression(s) must match the leftmost ORDER BY expression(s).
Official documentation
So you'll have to add the address_id
to the order by.
Alternatively, if you're looking for the full row that contains the most recent purchased product for each address_id
and that result sorted by purchased_at
then you're trying to solve a greatest N per group problem which can be solved by the following approaches:
The general solution that should work in most DBMSs:
SELECT t1.* FROM purchases t1
JOIN (
SELECT address_id, max(purchased_at) max_purchased_at
FROM purchases
WHERE product_id = 1
GROUP BY address_id
) t2
ON t1.address_id = t2.address_id AND t1.purchased_at = t2.max_purchased_at
ORDER BY t1.purchased_at DESC
A more PostgreSQL-oriented solution based on @hkf's answer:
SELECT * FROM (
SELECT DISTINCT ON (address_id) *
FROM purchases
WHERE product_id = 1
ORDER BY address_id, purchased_at DESC
) t
ORDER BY purchased_at DESC
Problem clarified, extended and solved here: Selecting rows ordered by some column and distinct on another
Select Distinct On while Order By a different column
Since you are selecting events.*
, you shouldn't add eventable_id
, and eventable_type
to the output columns redundantly. Would result in duplicate column names. You know that you don't have to include the columns in the DISTINCT ON
clause in target list, right?
Also, it's probably faster to use eventable_type DESC
right away, since you have that in your final sort order. That's allowed, too.
SELECT DISTINCT ON (eventable_id, eventable_type)
*
FROM events
WHERE <condition>
ORDER BY eventable_id, eventable_type DESC, created_at DESC
@Denis already covers the rest: make that a subquery and order as you like in the outer query.
The alternative would be a subselect with GROUP BY
and max()
, but that yields multiple columns per group, when the latest created_at
per group is not unique. (May or may not be desirable.) And it's probably still slower than DISTINCT ON
with an additional ORDER BY
step. Test with EXPLAIN ANALYZE
.
SELECT e.*
FROM events e
JOIN (
SELECT eventable_id, eventable_type, max(created_at) AS created_at
FROM events
WHERE <condition>
GROUP BY 1, 2 DESC
) sub USING (eventable_id, eventable_type, created_at) -- maybe not unique
WHERE <repeat condition if dupes may be eliminated>
ORDER BY e.created_at, e.eventable_id, e.eventable_type DESC
How to select distinct on with different order by Postgresql
select distinct on (letter_one) letter_one, number_int
from test_dupl2
order by 1, 2 desc;
letter_one | number_int
------------+------------
A | 9
B | 4
(2 rows)
Select Distinct on one column, without ordering by that column
The general answer to your question is that when using DISTINCT ON (x, ...) in SELECT statement in postgresql, the database sorts by the values in the distinct clause in order to make it easy to tell if the rows have distinct values (once they're ordered by the values, it only takes one pass for the db to remove duplicates, and it only needs to compare adjacent rows. Because of this, the db forces you to sort by the same columns in the distinct clause.
You can work around this by making your original query a subquery, like so:
SELECT t.id FROM
(SELECT DISTINCT ON (countries.id) countries.id
, province_infos.population
, country_infos.founding_date
FROM countries
...
ORDER BY countries.id, province_infos.population DESC, country_infos.founding_date ASC
)t
ORDER BY t.population DESC, T.founding_date ASC
Postgres: Why DISTINCT ON expression(s) must match the leftmost ORDER BY expression(s)?
Well, the ORDER BY is needed to keep those rows together that share the same value for the "distinct columns". The database processes them sequentially discarding all subsequent rows from the same set. If the rows weren't sorted, this wouldn't be easily possible.
Assume this set of rows:
c1 | c2
---+----
1 | 100
2 | 10
1 | 200
2 | 15
If you want the c1 to be unique and pick the highest c2 you would need to use
select distinct on (c1) *
from the_table
order by c1, c2 desc;
The order by
itself will generate the following result:
c1 | c2
---+----
1 | 200
1 | 100
2 | 15
2 | 10
By processing that result row-by-row the database can now efficiently discard every but the first row for each c1
value by simply checking if that value changes from row to another. If the result wasn't sorted this check would be become far more complicated.
Get distinct on one column, order by another
Leading expressions in ORDER BY
have to agree with expressions in DISTINCT ON
:
SELECT DISTINCT ON ("threadId") *
FROM messages
ORDER BY "threadId", "createdAt" DESC;
Detailed explanation:
- Select first row in each GROUP BY group?
If you want to order results in a different way (like commented), you'll have to wrap the query in an outer query with a second ORDER BY
. See:
- PostgreSQL DISTINCT ON with different ORDER BY
Or similar, depending on your exact situation and requirements. There may be sophistication to get best results. Recent example:
- How do I take a DISTINCT ON subquery that is ordered by a separate column, and make it fast?
How to use DISTINCT ON but ORDER BY another expression?
This requirement is necessary to make DISTINCT ON
work; to change the final order, you can add an outer query with another ORDER BY
clause:
SELECT *
FROM (SELECT DISTINCT ON (s.id)
s.id as subscription_id, subscription_carts.authorized_at, s.*
FROM subscriptions s
JOIN ...
WHERE ...
ORDER BY s.id, subscription_carts.authorized_at
) AS subq
ORDER BY authorized_at;
SQL - how to order by date and select distinct
Your way is correct. You need the "wrong" order to get the correct date value per category_id
.
You need to correct this order with an additional ORDER BY
step:
SELECT
*
FROM (
SELECT DISTINCT ON (category_id)
category_id,
date
FROM review
ORDER BY category_id, date ASC
) s
ORDER BY date ASC
PostgreSQL SELECT result with a distinct ids that prefers specific values from another table
The ORDER BY that determine how the DISTINCT ON selects its record should be on the same level as the DISTINCT ON itself, not in a subquery.
If you did that, the error message would tell you the problem, you have to sort first by the DISTINCT ON columns, then the tie-breaker columns after. Like this:
select distinct on (id) d.id, d.some_text, d.some_array, dl.name, dl.language
from some_dictionary d
join some_dictionary_language dl on d.id = dl.some_dictionary_id
where dl."language" in ('POLISH', 'ENGLISH')
and (d.some_text ilike '%na%' or dl.name ilike '%na%')
and 1 = ANY(d.some_array)
order by id, case when dl."language" = 'POLISH' then 1 end;
When the DISTINCT ON query doesn't have an ORDER BY, it just makes one up consisting of only the DISTINCT ON columns, leaving the kept row within each group to be selected arbitrarily.
PostgreSQL distinct on + order by query optimization
To use the index to avoid the sort, PostgreSQL would first have to scan all of "reporting_processedamazonsnapshot"
in index order, then join all of "offers_boooffer"
using a nested loop join (so that the order is preserved) and then join all of "offers_offersettings"
, again using a nested loop join.
Finally, all rows that don't match the condition "offers_offersettings"."account_id" = 20
would be thrown away.
PostgreSQL believes – correctly in my opinion – that it is more efficient to start by reducing the number of rows as much as possible using the condition, then use the most efficient join method to join the tables and then sort for the DISTINCT
clause.
I wonder if the following query might be faster:
SELECT DISTINCT ON (q.offer_id) *
FROM offers_offersettings ofs
JOIN offers_boooffer bo ON bo.id = ofs.offer_id
CROSS JOIN LATERAL
(SELECT *
FROM reporting_processedamazonsnapshot r
WHERE r.offer_id = bo.offer_id
ORDER BY r.scraping_date DESC
LIMIT 1) q
WHERE ofs.account_id = 20
ORDER BY q.offer_id ASC, q.scraping_date DESC;
The execution plan would be similar, except that fewer rows would have to be scanned from the index, which should reduce execution time where you need it most.
If you want to speed up the sort, increase work_mem
to some 500MB for that query (if you can afford it).
Related Topics
Can't Connect to MySQL Server Error 111
How to Limit the Number of Rows Returned by an Oracle Query After Ordering
Select Rows Which Are Not Present in Other Table
MySQL Fails On: MySQL "Error 1524 (Hy000): Plugin 'Auth_Socket' Is Not Loaded"
Why Is Select * Considered Harmful
Left Outer Join Doesn't Return All Rows from My Left Table
Refactor a Pl/Pgsql Function to Return the Output of Various Select Queries
Commit Data in a MySQL Container
Insert Results of a Stored Procedure into a Temporary Table
Identity Increment Is Jumping in SQL Server Database
Foreign Key Constraint May Cause Cycles or Multiple Cascade Paths
Gem Install: Failed to Build Gem Native Extension (Can't Find Header Files)
How to Obtain a Query Execution Plan in SQL Server
How to Do an Update Statement With Join in SQL Server
Best Way to Do Multi-Row Insert in Oracle