Optimize Group by Query to Retrieve Latest Row Per User

Optimize GROUP BY query to retrieve latest row per user

For best read performance you need a multicolumn index:

CREATE INDEX log_combo_idx
ON log (user_id, log_date DESC NULLS LAST);

To make index only scans possible, add the otherwise not needed column payload in a covering index with the INCLUDE clause (Postgres 11 or later):

CREATE INDEX log_combo_covering_idx
ON log (user_id, log_date DESC NULLS LAST) INCLUDE (payload);

See:

  • Do covering indexes in PostgreSQL help JOIN columns?

Fallback for older versions:

CREATE INDEX log_combo_covering_idx
ON log (user_id, log_date DESC NULLS LAST, payload);

Why DESC NULLS LAST?

  • Unused index in range of dates query

For few rows per user_id or small tables DISTINCT ON is typically fastest and simplest:

  • Select first row in each GROUP BY group?

For many rows per user_id an index skip scan (or loose index scan) is (much) more efficient. That's not implemented up to Postgres 12 - work is ongoing for Postgres 14. But there are ways to emulate it efficiently.

Common Table Expressions require Postgres 8.4+.

LATERAL requires Postgres 9.3+.

The following solutions go beyond what's covered in the Postgres Wiki.

1. No separate table with unique users

With a separate users table, solutions in 2. below are typically simpler and faster. Skip ahead.

1a. Recursive CTE with LATERAL join

WITH RECURSIVE cte AS (
( -- parentheses required
SELECT user_id, log_date, payload
FROM log
WHERE log_date <= :mydate
ORDER BY user_id, log_date DESC NULLS LAST
LIMIT 1
)
UNION ALL
SELECT l.*
FROM cte c
CROSS JOIN LATERAL (
SELECT l.user_id, l.log_date, l.payload
FROM log l
WHERE l.user_id > c.user_id -- lateral reference
AND log_date <= :mydate -- repeat condition
ORDER BY l.user_id, l.log_date DESC NULLS LAST
LIMIT 1
) l
)
TABLE cte
ORDER BY user_id;

This is simple to retrieve arbitrary columns and probably best in current Postgres. More explanation in chapter 2a. below.

1b. Recursive CTE with correlated subquery

WITH RECURSIVE cte AS (
( -- parentheses required
SELECT l AS my_row -- whole row
FROM log l
WHERE log_date <= :mydate
ORDER BY user_id, log_date DESC NULLS LAST
LIMIT 1
)
UNION ALL
SELECT (SELECT l -- whole row
FROM log l
WHERE l.user_id > (c.my_row).user_id
AND l.log_date <= :mydate -- repeat condition
ORDER BY l.user_id, l.log_date DESC NULLS LAST
LIMIT 1)
FROM cte c
WHERE (c.my_row).user_id IS NOT NULL -- note parentheses
)
SELECT (my_row).* -- decompose row
FROM cte
WHERE (my_row).user_id IS NOT NULL
ORDER BY (my_row).user_id;

Convenient to retrieve a single column or the whole row. The example uses the whole row type of the table. Other variants are possible.

To assert a row was found in the previous iteration, test a single NOT NULL column (like the primary key).

More explanation for this query in chapter 2b. below.

Related:

  • Query last N related rows per row
  • GROUP BY one column, while sorting by another in PostgreSQL

2. With separate users table

Table layout hardly matters as long as exactly one row per relevant user_id is guaranteed. Example:

CREATE TABLE users (
user_id serial PRIMARY KEY
, username text NOT NULL
);

Ideally, the table is physically sorted in sync with the log table. See:

  • Optimize Postgres timestamp query range

Or it's small enough (low cardinality) that it hardly matters. Else, sorting rows in the query can help to further optimize performance. See Gang Liang's addition. If the physical sort order of the users table happens to match the index on log, this may be irrelevant.

2a. LATERAL join

SELECT u.user_id, l.log_date, l.payload
FROM users u
CROSS JOIN LATERAL (
SELECT l.log_date, l.payload
FROM log l
WHERE l.user_id = u.user_id -- lateral reference
AND l.log_date <= :mydate
ORDER BY l.log_date DESC NULLS LAST
LIMIT 1
) l;

JOIN LATERAL allows to reference preceding FROM items on the same query level. See:

  • What is the difference between LATERAL JOIN and a subquery in PostgreSQL?

Results in one index (-only) look-up per user.

Returns no row for users missing in the users table. Typically, a foreign key constraint enforcing referential integrity would rule that out.

Also, no row for users without matching entry in log - conforming to the original question. To keep those users in the result use LEFT JOIN LATERAL ... ON true instead of CROSS JOIN LATERAL:

  • Call a set-returning function with an array argument multiple times

Use LIMIT n instead of LIMIT 1 to retrieve more than one rows (but not all) per user.

Effectively, all of these do the same:

JOIN LATERAL ... ON true
CROSS JOIN LATERAL ...
, LATERAL ...

The last one has lower priority, though. Explicit JOIN binds before comma. That subtle difference can matters with more join tables. See:

  • "invalid reference to FROM-clause entry for table" in Postgres query

2b. Correlated subquery

Good choice to retrieve a single column from a single row. Code example:

  • Optimize groupwise maximum query

The same is possible for multiple columns, but you need more smarts:

CREATE TEMP TABLE combo (log_date date, payload int);

SELECT user_id, (combo1).* -- note parentheses
FROM (
SELECT u.user_id
, (SELECT (l.log_date, l.payload)::combo
FROM log l
WHERE l.user_id = u.user_id
AND l.log_date <= :mydate
ORDER BY l.log_date DESC NULLS LAST
LIMIT 1) AS combo1
FROM users u
) sub;

Like LEFT JOIN LATERAL above, this variant includes all users, even without entries in log. You get NULL for combo1, which you can easily filter with a WHERE clause in the outer query if need be.

Nitpick: in the outer query you can't distinguish whether the subquery didn't find a row or all column values happen to be NULL - same result. You need a NOT NULL column in the subquery to avoid this ambiguity.

A correlated subquery can only return a single value. You can wrap multiple columns into a composite type. But to decompose it later, Postgres demands a well-known composite type. Anonymous records can only be decomposed providing a column definition list.

Use a registered type like the row type of an existing table. Or register a composite type explicitly (and permanently) with CREATE TYPE. Or create a temporary table (dropped automatically at end of session) to register its row type temporarily. Cast syntax: (log_date, payload)::combo

Finally, we do not want to decompose combo1 on the same query level. Due to a weakness in the query planner this would evaluate the subquery once for each column (still true in Postgres 12). Instead, make it a subquery and decompose in the outer query.

Related:

  • Get values from first and last row per group

Demonstrating all 4 queries with 100k log entries and 1k users:

db<>fiddle here - pg 11

Old sqlfiddle

MySQL fast retrieve last record in each group

Try each of these two queries. Usually at least one will work well for me with "max row" queries lines yours.

Query 1:

SELECT
d.*
FROM devices d
LEFT OUTER JOIN devices larger_d
ON larger_d.device_id = d.device_id
AND larger_d.id > d.id
WHERE larger_d.device_id IS NULL

Query 2:

SELECT
d.*
FROM devices d
INNER JOIN (
SELECT
MAX(id) AS id,
device_id
FROM devices d
GROUP BY device_id
) largest_d
ON largest_d.device_id = d.device_id
AND largest_d.id = d.id

In both cases, you will need an index on (device_id,id) before you run these queries.

In response to your comments on other peoples' answers, the (id,device_id) index is not equivalent to this one we are suggesting. You do not need to remove it, however it will slow down inserts (just like all indexes do). However, for this query it is not useful, and so you can probably remove it if you don't have a specific reason to keep it.

Retrieving the last record in each group - MySQL

MySQL 8.0 now supports windowing functions, like almost all popular SQL implementations. With this standard syntax, we can write greatest-n-per-group queries:

WITH ranked_messages AS (
SELECT m.*, ROW_NUMBER() OVER (PARTITION BY name ORDER BY id DESC) AS rn
FROM messages AS m
)
SELECT * FROM ranked_messages WHERE rn = 1;

This and other approaches to finding groupwise maximal rows are illustrated in the MySQL manual.

Below is the original answer I wrote for this question in 2009:


I write the solution this way:

SELECT m1.*
FROM messages m1 LEFT JOIN messages m2
ON (m1.name = m2.name AND m1.id < m2.id)
WHERE m2.id IS NULL;

Regarding performance, one solution or the other can be better, depending on the nature of your data. So you should test both queries and use the one that is better at performance given your database.

For example, I have a copy of the StackOverflow August data dump. I'll use that for benchmarking. There are 1,114,357 rows in the Posts table. This is running on MySQL 5.0.75 on my Macbook Pro 2.40GHz.

I'll write a query to find the most recent post for a given user ID (mine).

First using the technique shown by @Eric with the GROUP BY in a subquery:

SELECT p1.postid
FROM Posts p1
INNER JOIN (SELECT pi.owneruserid, MAX(pi.postid) AS maxpostid
FROM Posts pi GROUP BY pi.owneruserid) p2
ON (p1.postid = p2.maxpostid)
WHERE p1.owneruserid = 20860;

1 row in set (1 min 17.89 sec)

Even the EXPLAIN analysis takes over 16 seconds:

+----+-------------+------------+--------+----------------------------+-------------+---------+--------------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+--------+----------------------------+-------------+---------+--------------+---------+-------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 76756 | |
| 1 | PRIMARY | p1 | eq_ref | PRIMARY,PostId,OwnerUserId | PRIMARY | 8 | p2.maxpostid | 1 | Using where |
| 2 | DERIVED | pi | index | NULL | OwnerUserId | 8 | NULL | 1151268 | Using index |
+----+-------------+------------+--------+----------------------------+-------------+---------+--------------+---------+-------------+
3 rows in set (16.09 sec)

Now produce the same query result using my technique with LEFT JOIN:

SELECT p1.postid
FROM Posts p1 LEFT JOIN posts p2
ON (p1.owneruserid = p2.owneruserid AND p1.postid < p2.postid)
WHERE p2.postid IS NULL AND p1.owneruserid = 20860;

1 row in set (0.28 sec)

The EXPLAIN analysis shows that both tables are able to use their indexes:

+----+-------------+-------+------+----------------------------+-------------+---------+-------+------+--------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+----------------------------+-------------+---------+-------+------+--------------------------------------+
| 1 | SIMPLE | p1 | ref | OwnerUserId | OwnerUserId | 8 | const | 1384 | Using index |
| 1 | SIMPLE | p2 | ref | PRIMARY,PostId,OwnerUserId | OwnerUserId | 8 | const | 1384 | Using where; Using index; Not exists |
+----+-------------+-------+------+----------------------------+-------------+---------+-------+------+--------------------------------------+
2 rows in set (0.00 sec)

Here's the DDL for my Posts table:

CREATE TABLE `posts` (
`PostId` bigint(20) unsigned NOT NULL auto_increment,
`PostTypeId` bigint(20) unsigned NOT NULL,
`AcceptedAnswerId` bigint(20) unsigned default NULL,
`ParentId` bigint(20) unsigned default NULL,
`CreationDate` datetime NOT NULL,
`Score` int(11) NOT NULL default '0',
`ViewCount` int(11) NOT NULL default '0',
`Body` text NOT NULL,
`OwnerUserId` bigint(20) unsigned NOT NULL,
`OwnerDisplayName` varchar(40) default NULL,
`LastEditorUserId` bigint(20) unsigned default NULL,
`LastEditDate` datetime default NULL,
`LastActivityDate` datetime default NULL,
`Title` varchar(250) NOT NULL default '',
`Tags` varchar(150) NOT NULL default '',
`AnswerCount` int(11) NOT NULL default '0',
`CommentCount` int(11) NOT NULL default '0',
`FavoriteCount` int(11) NOT NULL default '0',
`ClosedDate` datetime default NULL,
PRIMARY KEY (`PostId`),
UNIQUE KEY `PostId` (`PostId`),
KEY `PostTypeId` (`PostTypeId`),
KEY `AcceptedAnswerId` (`AcceptedAnswerId`),
KEY `OwnerUserId` (`OwnerUserId`),
KEY `LastEditorUserId` (`LastEditorUserId`),
KEY `ParentId` (`ParentId`),
CONSTRAINT `posts_ibfk_1` FOREIGN KEY (`PostTypeId`) REFERENCES `posttypes` (`PostTypeId`)
) ENGINE=InnoDB;

Note to commenters: If you want another benchmark with a different version of MySQL, a different dataset, or different table design, feel free to do it yourself. I have shown the technique above. Stack Overflow is here to show you how to do software development work, not to do all the work for you.

Effectively select latest row for each group in a very large table?

Perhaps a join with a window function will work:

select su.*
from (select s.user_id, u.status, u.timestamp,
max(u.timestamp) over (partition by s.user_id) as max_timestamp
from specialusers s join
users u
on s.user_id = u.user_id
) su
where timestamp = max_timestamp;

This specifically uses max() instead of row_number() on the speculation that it might use slightly fewer resources.

Optimize sub-query selecting last record of each group

Where performance is needed, subqueries in the SELECT clause are indeed a pain and have to be banished :)

You can rewrite this part:

SELECT
u.id,
u.user_name,
ifnull((select longitude from map where user_id = u.id order by map_id desc limit 1 ),0) as Longitude,
ifnull((select latitude from map where user_id = u.id order by map_id desc limit 1 ),0) as Longitude,
(select created from map where user_id = 1 order by created desc limit 1) as LatestTime
FROM users as u

In:

SELECT
u.id,
u.user_name,
COALESCE(m1.longitude, 0) as longitude,
COALESCE(m1.latitude, 0) as latitude
FROM users u
LEFT JOIN map m1 ON m1.user_id = u.id
LEFT JOIN map m2 ON m2.user_id = m1.user_id AND m2.map_id > m1.map_id
WHERE m2.map_id IS NULL

I wrote a short explanation of the query structure in this answer. It's a really nice trick to learn as it is more readable, subquery-less and performance wiser.

I haven't looked at the IN part yet but will if the above didn't help.

Edit1: You can extract the created date and use a MAX() instead.

SELECT
u.id,
u.user_name,
COALESCE(m1.longitude, 0) as longitude,
COALESCE(m1.latitude, 0) as latitude,
created.LatestTime
FROM (SELECT MAX(created) FROM map WHERE user_id = 1) created
INNER JOIN users u ON TRUE
LEFT JOIN map m1 ON m1.user_id = u.id
LEFT JOIN map m2 ON m2.user_id = m1.user_id AND m2.map_id > m1.map_id
WHERE m2.map_id IS NULL

how do I query sql for a latest record date for each user

select t.username, t.date, t.value
from MyTable t
inner join (
select username, max(date) as MaxDate
from MyTable
group by username
) tm on t.username = tm.username and t.date = tm.MaxDate

Optimize updating first, last, and second to last ranked value

Since we have the all-important index on (user_id, created_at), I suggest:

UPDATE users u
SET first_at = h.first_at
, latest_at = h.latest_at
, previous_at = h.previous_at
FROM (
SELECT u.id, f.first_at, l.last[1] AS latest_at, l.last[2] AS previous_at
FROM users u
CROSS JOIN LATERAL (
SELECT ARRAY (
SELECT h.created_at
FROM history h
WHERE h.user_id = u.id
AND h.type = 'SomeType' -- ??
ORDER BY h.created_at DESC
LIMIT 2
) AS last
) l
CROSS JOIN LATERAL (
SELECT created_at AS first_at
FROM history h
WHERE h.user_id = u.id
AND h.type = 'SomeType' -- ??
ORDER BY created_at
LIMIT 1
) f
WHERE u.id BETWEEN $1 AND $2
) h
WHERE u.id = h.id
AND (u.first_at IS DISTINCT FROM h.first_at
OR u.latest_at IS DISTINCT FROM h.latest_at
OR u.previous_at IS DISTINCT FROM h.previous_at);

This works with non-unique timestamps per user_id, too.

And it's very efficient if there are many rows per user. It's designed to avoid a sequential scan on the big table and make heavy use of the index on (user_id, created_at) instead.
Related:

  • Optimize GROUP BY query to retrieve latest row per user

Assuming most or all users get updated this way, we don't need an index on users. (For the purpose of this UPDATE, no index would be best.)

If there is only a single row in table history for a user, then previous_at is set to NULL. (Your original query has the same effect.)

Only users are updated where qualifying history rows are found.

This added WHERE clause skips updates that would not change anything (at full cost):

AND   (u.first_at    IS DISTINCT FROM h.first_at
OR u.latest_at IS DISTINCT FROM h.latest_at
OR u.previous_at IS DISTINCT FROM h.previous_at)

See:

  • How do I (or can I) SELECT DISTINCT on multiple columns?

The only insecurity is with WHERE type = 'SomeType'. If that's selective, a partial index with the same predicate would be better. Then we could even get index-only scans ...

Since the new query should be much faster, you might update more (or all) users at once.

Oracle SQL query: Retrieve latest values per group based on time

Given this data ...

SQL> select * from qtys
2 /

ID TS QTY
---------- ---------------- ----------
1 2010-01-04 11:00 152
2 2010-01-04 11:00 210
1 2010-01-04 10:45 132
2 2010-01-04 10:45 318
4 2010-01-04 10:45 122
1 2010-01-04 10:30 1
3 2010-01-04 10:30 214
2 2010-01-04 10:30 5515
4 2010-01-04 10:30 210

9 rows selected.

SQL>

... the following query gives what you want ...

SQL> select x.id
2 , x.ts as "DATE"
3 , x.qty as "QUANTITY"
4 from (
5 select id
6 , ts
7 , rank () over (partition by id order by ts desc) as rnk
8 , qty
9 from qtys ) x
10 where x.rnk = 1
11 /

ID DATE QUANTITY
---------- ---------------- ----------
1 2010-01-04 11:00 152
2 2010-01-04 11:00 210
3 2010-01-04 10:30 214
4 2010-01-04 10:45 122

SQL>

With regards to your additional requirements, you can apply additional filters to the outer WHERE clause. Similarly you can join additional tables to the inline view like it was any other table.

Optimal performing query for latest record for each N

Depends on your data (how many rows are there per group?) and your indexes.

See Optimizing TOP N Per Group Queries for some performance comparisons of 3 approaches.

In your case with millions of rows for only a small number of Vehicles I would add an index on VehicleID, Timestamp and do

SELECT CA.*
FROM Vehicles V
CROSS APPLY (SELECT TOP 1 *
FROM ChannelValue CV
WHERE CV.VehicleID = V.VehicleID
ORDER BY TimeStamp DESC) CA


Related Topics



Leave a reply



Submit