How to Access Latest Row for Each Individual Identifier

Proper way to access latest row for each individual identifier?

This answer seems to go in the way of the DISTINCT ON answer here, however it also mentions this :

For many rows per customer (low cardinality in column
customer), a loose index scan (a.k.a. "skip scan") would be
(much) more efficient, but that's not implemented up to Postgres 12.
(An implementation for index-only scans is in development for Postgres
13. See here and here.)

For now, there are faster query techniques to substitute for this.
In particular if you have a
separate table holding unique customers, which is the typical use
case. But also if you don't:

  • Optimize GROUP BY query to retrieve latest row per user

Using this other great answer, I find a way to keep the same performance as a distinct table with the use of LATERAL.
By using a new table test_boats I can do something like this :

 CREATE TABLE test_boats AS (select distinct on (mmsi) mmsi from core_message);

This table creation take 40+ seconds which is pretty similar to the time taken by the other answer here.

Then, with the help of LATERAL :

SELECT a.mmsi, b.time
FROM test_boats a
CROSS JOIN LATERAL(
SELECT b.time
FROM core_message b
WHERE a.mmsi = b.mmsi
ORDER BY b.time DESC
LIMIT 1
) b LIMIT 10;

This is blazingly fast, 1+ millisecond.

This will need the modification of my program's logic and the use of a query a bit more complex but I think I can live with that.

For a fast solution without the need to create a new table, check out the
answer of @ErwinBrandstetter below


UPDATE: I feel this question is not quite answered yet, as it's not very clear why the other solutions proposed perform poorly here.

I tried the benchmark mentionned here. At first, it would seem that the DISTINCT ON way is fast enough if you do a request like the one proposed in the benchmark : +/- 30ms on my computer.
But this is because that request uses index only scan. If you include a field that is not in the index, some_column in the case of the benchmark, the performance will drop to +/- 100ms.

Not a dramatic drop in performance yet.
That is why we need a benchmark with a bigger data set. Something similar to my case : 40K customers and 8M rows. Here

Let's try again the DISTINCT ON with this new table:

SELECT DISTINCT ON (customer_id) id, customer_id, total 
FROM purchases_more
ORDER BY customer_id, total DESC, id;

This takes about 1.5 seconds to complete.

SELECT DISTINCT ON (customer_id) *
FROM purchases_more
ORDER BY customer_id, total DESC, id;

This takes about 35 seconds to complete.

Now, to come back to my first solution above. It is using an index only scan and a LIMIT, that's one of the reason why it is extremely fast. If I recraft that query to not use index-only scan and dump the limit :

SELECT b.*
FROM test_boats a
CROSS JOIN LATERAL(
SELECT b.*
FROM core_message b
WHERE a.mmsi = b.mmsi
ORDER BY b.time DESC
LIMIT 1
) b;

This will take about 500ms, which is still pretty fast.

For a more in-depth benchmark of sort, see my other answer below.

How to select the last record of each ID

You can use a window function called ROW_NUMBER.Here is a solution for you given below. I have also made a demo query in db-fiddle for you. Please check link Demo Code in DB-Fiddle

WITH CTE AS
(SELECT product, user_id,
ROW_NUMBER() OVER(PARTITION BY user_id order by product desc)
as RN
FROM Mytable)
SELECT product, user_id FROM CTE WHERE RN=1 ;

Get most recent row for given ID

Use the aggregate MAX(signin) grouped by id. This will list the most recent signin for each id.

SELECT 
id,
MAX(signin) AS most_recent_signin
FROM tbl
GROUP BY id

To get the whole single record, perform an INNER JOIN against a subquery which returns only the MAX(signin) per id.

SELECT 
tbl.id,
signin,
signout
FROM tbl
INNER JOIN (
SELECT id, MAX(signin) AS maxsign FROM tbl GROUP BY id
) ms ON tbl.id = ms.id AND signin = maxsign
WHERE tbl.id=1

Retrieving the last record in each group - MySQL

MySQL 8.0 now supports windowing functions, like almost all popular SQL implementations. With this standard syntax, we can write greatest-n-per-group queries:

WITH ranked_messages AS (
SELECT m.*, ROW_NUMBER() OVER (PARTITION BY name ORDER BY id DESC) AS rn
FROM messages AS m
)
SELECT * FROM ranked_messages WHERE rn = 1;

This and other approaches to finding groupwise maximal rows are illustrated in the MySQL manual.

Below is the original answer I wrote for this question in 2009:


I write the solution this way:

SELECT m1.*
FROM messages m1 LEFT JOIN messages m2
ON (m1.name = m2.name AND m1.id < m2.id)
WHERE m2.id IS NULL;

Regarding performance, one solution or the other can be better, depending on the nature of your data. So you should test both queries and use the one that is better at performance given your database.

For example, I have a copy of the StackOverflow August data dump. I'll use that for benchmarking. There are 1,114,357 rows in the Posts table. This is running on MySQL 5.0.75 on my Macbook Pro 2.40GHz.

I'll write a query to find the most recent post for a given user ID (mine).

First using the technique shown by @Eric with the GROUP BY in a subquery:

SELECT p1.postid
FROM Posts p1
INNER JOIN (SELECT pi.owneruserid, MAX(pi.postid) AS maxpostid
FROM Posts pi GROUP BY pi.owneruserid) p2
ON (p1.postid = p2.maxpostid)
WHERE p1.owneruserid = 20860;

1 row in set (1 min 17.89 sec)

Even the EXPLAIN analysis takes over 16 seconds:

+----+-------------+------------+--------+----------------------------+-------------+---------+--------------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+--------+----------------------------+-------------+---------+--------------+---------+-------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 76756 | |
| 1 | PRIMARY | p1 | eq_ref | PRIMARY,PostId,OwnerUserId | PRIMARY | 8 | p2.maxpostid | 1 | Using where |
| 2 | DERIVED | pi | index | NULL | OwnerUserId | 8 | NULL | 1151268 | Using index |
+----+-------------+------------+--------+----------------------------+-------------+---------+--------------+---------+-------------+
3 rows in set (16.09 sec)

Now produce the same query result using my technique with LEFT JOIN:

SELECT p1.postid
FROM Posts p1 LEFT JOIN posts p2
ON (p1.owneruserid = p2.owneruserid AND p1.postid < p2.postid)
WHERE p2.postid IS NULL AND p1.owneruserid = 20860;

1 row in set (0.28 sec)

The EXPLAIN analysis shows that both tables are able to use their indexes:

+----+-------------+-------+------+----------------------------+-------------+---------+-------+------+--------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+----------------------------+-------------+---------+-------+------+--------------------------------------+
| 1 | SIMPLE | p1 | ref | OwnerUserId | OwnerUserId | 8 | const | 1384 | Using index |
| 1 | SIMPLE | p2 | ref | PRIMARY,PostId,OwnerUserId | OwnerUserId | 8 | const | 1384 | Using where; Using index; Not exists |
+----+-------------+-------+------+----------------------------+-------------+---------+-------+------+--------------------------------------+
2 rows in set (0.00 sec)

Here's the DDL for my Posts table:

CREATE TABLE `posts` (
`PostId` bigint(20) unsigned NOT NULL auto_increment,
`PostTypeId` bigint(20) unsigned NOT NULL,
`AcceptedAnswerId` bigint(20) unsigned default NULL,
`ParentId` bigint(20) unsigned default NULL,
`CreationDate` datetime NOT NULL,
`Score` int(11) NOT NULL default '0',
`ViewCount` int(11) NOT NULL default '0',
`Body` text NOT NULL,
`OwnerUserId` bigint(20) unsigned NOT NULL,
`OwnerDisplayName` varchar(40) default NULL,
`LastEditorUserId` bigint(20) unsigned default NULL,
`LastEditDate` datetime default NULL,
`LastActivityDate` datetime default NULL,
`Title` varchar(250) NOT NULL default '',
`Tags` varchar(150) NOT NULL default '',
`AnswerCount` int(11) NOT NULL default '0',
`CommentCount` int(11) NOT NULL default '0',
`FavoriteCount` int(11) NOT NULL default '0',
`ClosedDate` datetime default NULL,
PRIMARY KEY (`PostId`),
UNIQUE KEY `PostId` (`PostId`),
KEY `PostTypeId` (`PostTypeId`),
KEY `AcceptedAnswerId` (`AcceptedAnswerId`),
KEY `OwnerUserId` (`OwnerUserId`),
KEY `LastEditorUserId` (`LastEditorUserId`),
KEY `ParentId` (`ParentId`),
CONSTRAINT `posts_ibfk_1` FOREIGN KEY (`PostTypeId`) REFERENCES `posttypes` (`PostTypeId`)
) ENGINE=InnoDB;

Note to commenters: If you want another benchmark with a different version of MySQL, a different dataset, or different table design, feel free to do it yourself. I have shown the technique above. Stack Overflow is here to show you how to do software development work, not to do all the work for you.

select rows in sql with latest date for each ID repeated multiple times

This question has been asked before. Please see this question.

Using the accepted answer and adapting it to your problem you get:

SELECT tt.*
FROM myTable tt
INNER JOIN
(SELECT ID, MAX(Date) AS MaxDateTime
FROM myTable
GROUP BY ID) groupedtt
ON tt.ID = groupedtt.ID
AND tt.Date = groupedtt.MaxDateTime

Get value of each latest record grouped by ID

Do you just want distinct on again?

SELECT DISTINCT ON (p.ID) p.ID, r.*
FROM (SELECT DISTINCT ON (r.ID) r.* FROM records r
) r CROSS JOIN LATERAL
(SELECT p.*
FROM points p
ORDER BY r.position <-> p.geom
LIMIT 1
) p
WHERE r.field1 = p.field1 AND r.field2 = p.field2
ORDER BY p.ID, r.timestamp DESC;

I cannot figure out what you intend by:

(SELECT DISTINCT ON (ID) *
FROM records
)

At a minimum, you should have an ORDER BY:

(SELECT DISTINCT ON (ID) *
FROM records
ORDER BY ID
)

However, your sample data and the name ID suggest that there a no duplicates, so the DISTINCT ON may not be necessary.

Scalable Solution to get latest row for each ID in BigQuery

Quick and dirty option - combine your both queries into one - first get all records with latest collection_time (using your second query) and then dedup them using your first query:

SELECT * FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY tab1.ID) AS rn
FROM (
SELECT tab1.*
FROM mytable AS tab1
INNER JOIN (
SELECT ID, MAX(collection_time) AS second_time
FROM mytable GROUP BY ID
) AS tab2
ON tab1.ID=tab2.ID AND tab1.collection_time=tab2.second_time
)
)
WHERE rn = 1

And with Standard SQL (proposed by S.Mohsen sh)

WITH myTable AS (
SELECT 1 AS ID, 1 AS collection_time
),
tab1 AS (
SELECT ID,
MAX(collection_time) AS second_time
FROM myTable GROUP BY ID
),
tab2 AS (
SELECT * FROM myTable
),
joint AS (
SELECT tab2.*
FROM tab2 INNER JOIN tab1
ON tab2.ID=tab1.ID AND tab2.collection_time=tab1.second_time
)
SELECT * EXCEPT(rn)
FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY ID) AS rn
FROM joint
)
WHERE rn=1

SQL: Get last row for each unique id

Use row_number():

select ccs.*
from (select ccs.*,
row_number() over (partition by ccs.CoatingChambersID order by ccs.LastDt desc) as seqnum
from [REO].[dbo].[CoatingChamberStateLogs] ccs
) ccs
where seqnum = 1;

The use of select distinct with group by is almost never correct.

Another fun way to write the query doesn't require a subquery:

select top (1) with ties ccs.*
from [REO].[dbo].[CoatingChamberStateLogs] ccs
order by row_number() over (partition by ccs.CoatingChambersID order by ccs.LastDt desc);

How to get a latest row for each ID based on a timestamp

a data.table approach

sample data

DT <- fread('ID  Value1  Value2  Value3  Value4  Time
1 1 7 13 19 "2013-11-15 21:12:03:337"
1 2 8 14 20 "2013-12-23 15:12:01:227"
2 3 9 15 21 "2014-12-07 14:37:01:127"
2 4 10 16 22 "2013-12-12 05:23:01:239"
3 5 11 17 23 "2011-12-25 15:12:01:227"
3 6 12 18 24 "2011-12-25 15:12:02:227"', quote = "\"")

code

#first, set miliseconds correct by replacing the last : with a .
DT[, Time := gsub( "(.*)(:)([0-9]*$)", "\\1.\\3", Time)]
#now convert to POSIXct
DT[, Time := as.POSIXct( Time, format = " %Y-%m-%d %H:%M:%OS")]

#now, pull the max Time per group
DT[DT[, .I[which.max(Time)], by=ID]$V1]

output

#    ID Value1 Value2 Value3 Value4                Time
# 1: 1 2 8 14 20 2013-12-23 15:12:01
# 2: 2 3 9 15 21 2014-12-07 14:37:01
# 3: 3 6 12 18 24 2011-12-25 15:12:02


Related Topics



Leave a reply



Submit