Proper way to access latest row for each individual identifier?
This answer seems to go in the way of the DISTINCT ON
answer here, however it also mentions this :
For many rows per customer (low cardinality in column
customer
), a loose index scan (a.k.a. "skip scan") would be
(much) more efficient, but that's not implemented up to Postgres 12.
(An implementation for index-only scans is in development for Postgres
13. See here and here.)
For now, there are faster query techniques to substitute for this.
In particular if you have a
separate table holding unique customers, which is the typical use
case. But also if you don't:
- Optimize GROUP BY query to retrieve latest row per user
Using this other great answer, I find a way to keep the same performance as a distinct table with the use of LATERAL
.
By using a new table test_boats
I can do something like this :
CREATE TABLE test_boats AS (select distinct on (mmsi) mmsi from core_message);
This table creation take 40+ seconds which is pretty similar to the time taken by the other answer here.
Then, with the help of LATERAL
:
SELECT a.mmsi, b.time
FROM test_boats a
CROSS JOIN LATERAL(
SELECT b.time
FROM core_message b
WHERE a.mmsi = b.mmsi
ORDER BY b.time DESC
LIMIT 1
) b LIMIT 10;
This is blazingly fast, 1+ millisecond.
This will need the modification of my program's logic and the use of a query a bit more complex but I think I can live with that.
For a fast solution without the need to create a new table, check out the
answer of @ErwinBrandstetter below
UPDATE: I feel this question is not quite answered yet, as it's not very clear why the other solutions proposed perform poorly here.
I tried the benchmark mentionned here. At first, it would seem that the DISTINCT ON
way is fast enough if you do a request like the one proposed in the benchmark : +/- 30ms on my computer.
But this is because that request uses index only scan. If you include a field that is not in the index, some_column
in the case of the benchmark, the performance will drop to +/- 100ms.
Not a dramatic drop in performance yet.
That is why we need a benchmark with a bigger data set. Something similar to my case : 40K customers and 8M rows. Here
Let's try again the DISTINCT ON
with this new table:
SELECT DISTINCT ON (customer_id) id, customer_id, total
FROM purchases_more
ORDER BY customer_id, total DESC, id;
This takes about 1.5 seconds to complete.
SELECT DISTINCT ON (customer_id) *
FROM purchases_more
ORDER BY customer_id, total DESC, id;
This takes about 35 seconds to complete.
Now, to come back to my first solution above. It is using an index only scan and a LIMIT
, that's one of the reason why it is extremely fast. If I recraft that query to not use index-only scan and dump the limit :
SELECT b.*
FROM test_boats a
CROSS JOIN LATERAL(
SELECT b.*
FROM core_message b
WHERE a.mmsi = b.mmsi
ORDER BY b.time DESC
LIMIT 1
) b;
This will take about 500ms, which is still pretty fast.
For a more in-depth benchmark of sort, see my other answer below.
How to select the last record of each ID
You can use a window function called ROW_NUMBER
.Here is a solution for you given below. I have also made a demo query in db-fiddle for you. Please check link Demo Code in DB-Fiddle
WITH CTE AS
(SELECT product, user_id,
ROW_NUMBER() OVER(PARTITION BY user_id order by product desc)
as RN
FROM Mytable)
SELECT product, user_id FROM CTE WHERE RN=1 ;
Get most recent row for given ID
Use the aggregate MAX(signin)
grouped by id. This will list the most recent signin
for each id
.
SELECT
id,
MAX(signin) AS most_recent_signin
FROM tbl
GROUP BY id
To get the whole single record, perform an INNER JOIN
against a subquery which returns only the MAX(signin)
per id.
SELECT
tbl.id,
signin,
signout
FROM tbl
INNER JOIN (
SELECT id, MAX(signin) AS maxsign FROM tbl GROUP BY id
) ms ON tbl.id = ms.id AND signin = maxsign
WHERE tbl.id=1
Retrieving the last record in each group - MySQL
MySQL 8.0 now supports windowing functions, like almost all popular SQL implementations. With this standard syntax, we can write greatest-n-per-group queries:
WITH ranked_messages AS (
SELECT m.*, ROW_NUMBER() OVER (PARTITION BY name ORDER BY id DESC) AS rn
FROM messages AS m
)
SELECT * FROM ranked_messages WHERE rn = 1;
This and other approaches to finding groupwise maximal rows are illustrated in the MySQL manual.
Below is the original answer I wrote for this question in 2009:
I write the solution this way:
SELECT m1.*
FROM messages m1 LEFT JOIN messages m2
ON (m1.name = m2.name AND m1.id < m2.id)
WHERE m2.id IS NULL;
Regarding performance, one solution or the other can be better, depending on the nature of your data. So you should test both queries and use the one that is better at performance given your database.
For example, I have a copy of the StackOverflow August data dump. I'll use that for benchmarking. There are 1,114,357 rows in the Posts
table. This is running on MySQL 5.0.75 on my Macbook Pro 2.40GHz.
I'll write a query to find the most recent post for a given user ID (mine).
First using the technique shown by @Eric with the GROUP BY
in a subquery:
SELECT p1.postid
FROM Posts p1
INNER JOIN (SELECT pi.owneruserid, MAX(pi.postid) AS maxpostid
FROM Posts pi GROUP BY pi.owneruserid) p2
ON (p1.postid = p2.maxpostid)
WHERE p1.owneruserid = 20860;
1 row in set (1 min 17.89 sec)
Even the EXPLAIN
analysis takes over 16 seconds:
+----+-------------+------------+--------+----------------------------+-------------+---------+--------------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+--------+----------------------------+-------------+---------+--------------+---------+-------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 76756 | |
| 1 | PRIMARY | p1 | eq_ref | PRIMARY,PostId,OwnerUserId | PRIMARY | 8 | p2.maxpostid | 1 | Using where |
| 2 | DERIVED | pi | index | NULL | OwnerUserId | 8 | NULL | 1151268 | Using index |
+----+-------------+------------+--------+----------------------------+-------------+---------+--------------+---------+-------------+
3 rows in set (16.09 sec)
Now produce the same query result using my technique with LEFT JOIN
:
SELECT p1.postid
FROM Posts p1 LEFT JOIN posts p2
ON (p1.owneruserid = p2.owneruserid AND p1.postid < p2.postid)
WHERE p2.postid IS NULL AND p1.owneruserid = 20860;
1 row in set (0.28 sec)
The EXPLAIN
analysis shows that both tables are able to use their indexes:
+----+-------------+-------+------+----------------------------+-------------+---------+-------+------+--------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+----------------------------+-------------+---------+-------+------+--------------------------------------+
| 1 | SIMPLE | p1 | ref | OwnerUserId | OwnerUserId | 8 | const | 1384 | Using index |
| 1 | SIMPLE | p2 | ref | PRIMARY,PostId,OwnerUserId | OwnerUserId | 8 | const | 1384 | Using where; Using index; Not exists |
+----+-------------+-------+------+----------------------------+-------------+---------+-------+------+--------------------------------------+
2 rows in set (0.00 sec)
Here's the DDL for my Posts
table:
CREATE TABLE `posts` (
`PostId` bigint(20) unsigned NOT NULL auto_increment,
`PostTypeId` bigint(20) unsigned NOT NULL,
`AcceptedAnswerId` bigint(20) unsigned default NULL,
`ParentId` bigint(20) unsigned default NULL,
`CreationDate` datetime NOT NULL,
`Score` int(11) NOT NULL default '0',
`ViewCount` int(11) NOT NULL default '0',
`Body` text NOT NULL,
`OwnerUserId` bigint(20) unsigned NOT NULL,
`OwnerDisplayName` varchar(40) default NULL,
`LastEditorUserId` bigint(20) unsigned default NULL,
`LastEditDate` datetime default NULL,
`LastActivityDate` datetime default NULL,
`Title` varchar(250) NOT NULL default '',
`Tags` varchar(150) NOT NULL default '',
`AnswerCount` int(11) NOT NULL default '0',
`CommentCount` int(11) NOT NULL default '0',
`FavoriteCount` int(11) NOT NULL default '0',
`ClosedDate` datetime default NULL,
PRIMARY KEY (`PostId`),
UNIQUE KEY `PostId` (`PostId`),
KEY `PostTypeId` (`PostTypeId`),
KEY `AcceptedAnswerId` (`AcceptedAnswerId`),
KEY `OwnerUserId` (`OwnerUserId`),
KEY `LastEditorUserId` (`LastEditorUserId`),
KEY `ParentId` (`ParentId`),
CONSTRAINT `posts_ibfk_1` FOREIGN KEY (`PostTypeId`) REFERENCES `posttypes` (`PostTypeId`)
) ENGINE=InnoDB;
Note to commenters: If you want another benchmark with a different version of MySQL, a different dataset, or different table design, feel free to do it yourself. I have shown the technique above. Stack Overflow is here to show you how to do software development work, not to do all the work for you.
select rows in sql with latest date for each ID repeated multiple times
This question has been asked before. Please see this question.
Using the accepted answer and adapting it to your problem you get:
SELECT tt.*
FROM myTable tt
INNER JOIN
(SELECT ID, MAX(Date) AS MaxDateTime
FROM myTable
GROUP BY ID) groupedtt
ON tt.ID = groupedtt.ID
AND tt.Date = groupedtt.MaxDateTime
Get value of each latest record grouped by ID
Do you just want distinct on
again?
SELECT DISTINCT ON (p.ID) p.ID, r.*
FROM (SELECT DISTINCT ON (r.ID) r.* FROM records r
) r CROSS JOIN LATERAL
(SELECT p.*
FROM points p
ORDER BY r.position <-> p.geom
LIMIT 1
) p
WHERE r.field1 = p.field1 AND r.field2 = p.field2
ORDER BY p.ID, r.timestamp DESC;
I cannot figure out what you intend by:
(SELECT DISTINCT ON (ID) *
FROM records
)
At a minimum, you should have an ORDER BY
:
(SELECT DISTINCT ON (ID) *
FROM records
ORDER BY ID
)
However, your sample data and the name ID
suggest that there a no duplicates, so the DISTINCT ON
may not be necessary.
Scalable Solution to get latest row for each ID in BigQuery
Quick and dirty option - combine your both queries into one - first get all records with latest collection_time (using your second query) and then dedup them using your first query:
SELECT * FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY tab1.ID) AS rn
FROM (
SELECT tab1.*
FROM mytable AS tab1
INNER JOIN (
SELECT ID, MAX(collection_time) AS second_time
FROM mytable GROUP BY ID
) AS tab2
ON tab1.ID=tab2.ID AND tab1.collection_time=tab2.second_time
)
)
WHERE rn = 1
And with Standard SQL (proposed by S.Mohsen sh)
WITH myTable AS (
SELECT 1 AS ID, 1 AS collection_time
),
tab1 AS (
SELECT ID,
MAX(collection_time) AS second_time
FROM myTable GROUP BY ID
),
tab2 AS (
SELECT * FROM myTable
),
joint AS (
SELECT tab2.*
FROM tab2 INNER JOIN tab1
ON tab2.ID=tab1.ID AND tab2.collection_time=tab1.second_time
)
SELECT * EXCEPT(rn)
FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY ID) AS rn
FROM joint
)
WHERE rn=1
SQL: Get last row for each unique id
Use row_number()
:
select ccs.*
from (select ccs.*,
row_number() over (partition by ccs.CoatingChambersID order by ccs.LastDt desc) as seqnum
from [REO].[dbo].[CoatingChamberStateLogs] ccs
) ccs
where seqnum = 1;
The use of select distinct
with group by
is almost never correct.
Another fun way to write the query doesn't require a subquery:
select top (1) with ties ccs.*
from [REO].[dbo].[CoatingChamberStateLogs] ccs
order by row_number() over (partition by ccs.CoatingChambersID order by ccs.LastDt desc);
How to get a latest row for each ID based on a timestamp
a data.table
approach
sample data
DT <- fread('ID Value1 Value2 Value3 Value4 Time
1 1 7 13 19 "2013-11-15 21:12:03:337"
1 2 8 14 20 "2013-12-23 15:12:01:227"
2 3 9 15 21 "2014-12-07 14:37:01:127"
2 4 10 16 22 "2013-12-12 05:23:01:239"
3 5 11 17 23 "2011-12-25 15:12:01:227"
3 6 12 18 24 "2011-12-25 15:12:02:227"', quote = "\"")
code
#first, set miliseconds correct by replacing the last : with a .
DT[, Time := gsub( "(.*)(:)([0-9]*$)", "\\1.\\3", Time)]
#now convert to POSIXct
DT[, Time := as.POSIXct( Time, format = " %Y-%m-%d %H:%M:%OS")]
#now, pull the max Time per group
DT[DT[, .I[which.max(Time)], by=ID]$V1]
output
# ID Value1 Value2 Value3 Value4 Time
# 1: 1 2 8 14 20 2013-12-23 15:12:01
# 2: 2 3 9 15 21 2014-12-07 14:37:01
# 3: 3 6 12 18 24 2011-12-25 15:12:02
Related Topics
Firstname, Lastname in Sql, Too Complex
How to Limit The Amount of Results Returned in Sybase
Have Pl/Sql Outputs in Real Time
Sql Server Freetext Match - How to Sort by Relevance
How to Find Tables Which Reference a Particular Row via a Foreign Key
"Pivoting" a Table in SQL (I.E. Cross Tabulation/Crosstabulation)
Sql Access Query- Update Row If Exists, Insert If Does Not
Add Non-Nullable Columns to an Existing Table in SQL Server
Is Using "Not Exists" Considered to Be Bad SQL Practise
Apply Like Over All Columns Without Specifying All Column Names
How to Select Top 5 Percent from Each Group
Extract Email Address from String Using Tsql
How to Select Row with Max Value When Duplicate Rows Exist in SQL Server
I Am Trying to Copy a File, But Getting Error Message
Sql 2008 Vs 2012 Error: Incorrect Syntax Near The Keyword 'Compute'
How to Execute SQL Statements in Command Prompt (Cmd)
What to Do When I Want to Use Database Constraints But Only Mark as Deleted Instead of Deleting