Postgresql - Fetch the Rows Which Have the Max Value for a Column in Each Group by Group

PostgreSQL - fetch the rows which have the Max value for a column in each GROUP BY group

On a table with 158k pseudo-random rows (usr_id uniformly distributed between 0 and 10k, trans_id uniformly distributed between 0 and 30),

By query cost, below, I am referring to Postgres' cost based optimizer's cost estimate (with Postgres' default xxx_cost values), which is a weighed function estimate of required I/O and CPU resources; you can obtain this by firing up PgAdminIII and running "Query/Explain (F7)" on the query with "Query/Explain options" set to "Analyze"

  • Quassnoy's query has a cost estimate of 745k (!), and completes in 1.3 seconds (given a compound index on (usr_id, trans_id, time_stamp))
  • Bill's query has a cost estimate of 93k, and completes in 2.9 seconds (given a compound index on (usr_id, trans_id))
  • Query #1 below has a cost estimate of 16k, and completes in 800ms (given a compound index on (usr_id, trans_id, time_stamp))
  • Query #2 below has a cost estimate of 14k, and completes in 800ms (given a compound function index on (usr_id, EXTRACT(EPOCH FROM time_stamp), trans_id))

    • this is Postgres-specific
  • Query #3 below (Postgres 8.4+) has a cost estimate and completion time comparable to (or better than) query #2 (given a compound index on (usr_id, time_stamp, trans_id)); it has the advantage of scanning the lives table only once and, should you temporarily increase (if needed) work_mem to accommodate the sort in memory, it will be by far the fastest of all queries.

All times above include retrieval of the full 10k rows result-set.

Your goal is minimal cost estimate and minimal query execution time, with an emphasis on estimated cost. Query execution can dependent significantly on runtime conditions (e.g. whether relevant rows are already fully cached in memory or not), whereas the cost estimate is not. On the other hand, keep in mind that cost estimate is exactly that, an estimate.

The best query execution time is obtained when running on a dedicated database without load (e.g. playing with pgAdminIII on a development PC.) Query time will vary in production based on actual machine load/data access spread. When one query appears slightly faster (<20%) than the other but has a much higher cost, it will generally be wiser to choose the one with higher execution time but lower cost.

When you expect that there will be no competition for memory on your production machine at the time the query is run (e.g. the RDBMS cache and filesystem cache won't be thrashed by concurrent queries and/or filesystem activity) then the query time you obtained in standalone (e.g. pgAdminIII on a development PC) mode will be representative. If there is contention on the production system, query time will degrade proportionally to the estimated cost ratio, as the query with the lower cost does not rely as much on cache whereas the query with higher cost will revisit the same data over and over (triggering additional I/O in the absence of a stable cache), e.g.:

              cost | time (dedicated machine) |     time (under load) |
-------------------+--------------------------+-----------------------+
some query A: 5k | (all data cached) 900ms | (less i/o) 1000ms |
some query B: 50k | (all data cached) 900ms | (lots of i/o) 10000ms |

Do not forget to run ANALYZE lives once after creating the necessary indices.


Query #1

-- incrementally narrow down the result set via inner joins
-- the CBO may elect to perform one full index scan combined
-- with cascading index lookups, or as hash aggregates terminated
-- by one nested index lookup into lives - on my machine
-- the latter query plan was selected given my memory settings and
-- histogram
SELECT
l1.*
FROM
lives AS l1
INNER JOIN (
SELECT
usr_id,
MAX(time_stamp) AS time_stamp_max
FROM
lives
GROUP BY
usr_id
) AS l2
ON
l1.usr_id = l2.usr_id AND
l1.time_stamp = l2.time_stamp_max
INNER JOIN (
SELECT
usr_id,
time_stamp,
MAX(trans_id) AS trans_max
FROM
lives
GROUP BY
usr_id, time_stamp
) AS l3
ON
l1.usr_id = l3.usr_id AND
l1.time_stamp = l3.time_stamp AND
l1.trans_id = l3.trans_max

Query #2

-- cheat to obtain a max of the (time_stamp, trans_id) tuple in one pass
-- this results in a single table scan and one nested index lookup into lives,
-- by far the least I/O intensive operation even in case of great scarcity
-- of memory (least reliant on cache for the best performance)
SELECT
l1.*
FROM
lives AS l1
INNER JOIN (
SELECT
usr_id,
MAX(ARRAY[EXTRACT(EPOCH FROM time_stamp),trans_id])
AS compound_time_stamp
FROM
lives
GROUP BY
usr_id
) AS l2
ON
l1.usr_id = l2.usr_id AND
EXTRACT(EPOCH FROM l1.time_stamp) = l2.compound_time_stamp[1] AND
l1.trans_id = l2.compound_time_stamp[2]

2013/01/29 update

Finally, as of version 8.4, Postgres supports Window Function meaning you can write something as simple and efficient as:

Query #3

-- use Window Functions
-- performs a SINGLE scan of the table
SELECT DISTINCT ON (usr_id)
last_value(time_stamp) OVER wnd,
last_value(lives_remaining) OVER wnd,
usr_id,
last_value(trans_id) OVER wnd
FROM lives
WINDOW wnd AS (
PARTITION BY usr_id ORDER BY time_stamp, trans_id
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
);

postgresql Select all rows which have max value in one column

update for postgresql version 13+

the latest version of postgresql introduces limit ... with ties

SELECT id, c1, c2, c3, c4
FROM t
WHERE id = 'something'
ORDER BY c4 DESC
FETCH 1 ROW WITH TIES

Specifying WITH TIES will return all the rows where c4 equals the largest value in that columns, so a separate CTE with a window function is not required. If there is only 1 row where c4 has its maximum values, then only 1 row will be returned.


original answer (for earlier versions of postgresql):

If there can be multiple rows for the same version number, then use the window function RANK in a cte and select the rows where rank equals 1

SELECT id, c1, c2, c3, c4 
FROM (
SELECT
*
, RANK() OVER (PARTITION BY id ORDER BY c4 DESC) c4rank
FROM t
) ranked
WHERE c4rank = 1
AND id = 'something'

If you want the most recent version for all ids, just omit the condition id = 'something' from the above statement.

If there can only be 1 row for a given version number then use order by and limit 1

SELECT id, c1, c2, c3, c4
FROM t
WHERE id = 'something'
ORDER BY c4 DESC
LIMIT 1

If you want the most recent version for all ids & there can only be 1 row per (id, version) combination

SELECT DISTINCT ON (id) id, c1, c2, c3, c4
FROM t
WHERE id = 'something'
ORDER BY id, c4 DESC

How to find maximum value per group for all rows in SQL?

You need MAX() window function:

SELECT *, 
MAX(value) OVER (PARTITION BY name) "max"
FROM out_pumptable;

See the demo.

Get the row with the max value in sql

I feel that ALL construct can be useful here

SELECT marca, totOre
FROM Private
WHERE totore >= ALL(select totore from private)

Selecting max value grouped by specific column


The fix

Rewrite monetaryGain to be:

 max(cp.price * cy.value) AS monetaryGain 

That way you will not be required to group by cp.price because it is not outputted as an group member, but used in aggregate.

Why?

When you write GROUP BY query you can output only columns that are in GROUP BY list and aggregate function values. Well this is expected - you expect single row per group, but you may have several distinct values for the field that is not in grouping column list.

For the same reason you can not use a non grouping column(-s) in arithmetic or any other (not aggregate) function because this would lead in several results for in single row - there would not be a way to display.

This is VERY loose explanation but I hope will help to grasp the concept.

Aliases in GROUP BY

Also you should not use aliases in GROUP BY. Use:

GROUP BY EXTRACT(YEAR FROM cy.date) 

Using alias in GROUP BY is not allowed. This link might explain why: https://www.postgresql.org/message-id/7608.1259177709%40sss.pgh.pa.us

Get rows with maximum count per one column - while grouping by two columns

Simpler & faster (and correct):

SELECT DISTINCT ON (col1)
col1, col2, count(*) AS conta
FROM tab
GROUP BY col1, col2
ORDER BY col1, conta DESC;

db<>fiddle here (based on a_horse's fiddle)

DISTINCT ON is applied after aggregation, so we don't need a subquery or CTE. Consider the sequence of events in a SELECT query:

  • Best way to get result count before LIMIT was applied
  • Select first row in each GROUP BY group?

Postgresql - Getting row with max value in column


SELECT DISTINCT ON (country_id)
country_id,
book_name
FROM country
JOIN publisher ON fk_pub_country = country_id
JOIN book ON fk_book_publisher = pub_id
ORDER BY
country_id,
release_date DESC,
book_id DESC

How to fetch a column name which has maximum value from each group in postgres?

You can use row_number. If I understand correctly you want to have result with max(app_service_id) by month.

with cte as
(
SELECT
count(app.product_service_id) AS app_service_id,
ps.product_service_name,
CAST(extract('month' FROM to_date(app_date, 'DD/MM/YYYY')) AS INTEGER) AS months,
CAST(extract('year' FROM to_date(app_date, 'DD/MM/YYYY')) AS INTEGER) AS years,
CONCAT(extract('year' FROM to_date(app_date, 'DD/MM/YYYY')), ' ', to_char(to_date(app_date, 'DD/MM/YYYY'), 'Month')) AS month_name
FROM appointment app
JOIN product_services ps
ON app.product_service_id = ps.product_service_id
WHERE to_date(app_date, 'DD/MM/YYYY') > to_date('01-07-2019', 'DD/MM/YYYY')
AND to_date(app_date, 'DD/MM/YYYY') < to_date('01-07-2020', 'DD/MM/YYYY')
GROUP BY
product_service_name,
month_name,
years,
months
)

select
app_service_id,
product_service_name,
months,
years,
month_name
from
(
select
*,
row_number() over(partition by months order by app_service_id desc) as rn
from cte
) subq
where rn = 1

Selecting rows with max value, in correlation to two other columns

You can achieve this in this way: First you need to group the data by the region field and get the maximum value. Then execute a simple IN condition with fields region and count

select * from my_table WHERE (region, count) in (select region, MAX(count) from my_table  GROUP BY region)

Demo in DBfiddle

Postgresql Get max value of each group with multiple columns

Use distinct on:

select distinct on (band) t.*
from test t
order by band, random_number desc;

Here is a db<>fiddle.

distinct on is a very handy Postgres extension. For performance on large datasets, you want an index on (band, random_number desc).



Related Topics



Leave a reply



Submit