Efficient latest record query with Postgresql
If you don't want to change your data model, you can use DISTINCT ON
to fetch the newest record from table "b" for each entry in "a":
SELECT DISTINCT ON (a.id) *
FROM a
INNER JOIN b ON a.id=b.id
ORDER BY a.id, b.date DESC
If you want to avoid a "sort" in the query, adding an index like this might help you, but I am not sure:
CREATE INDEX b_id_date ON b (id, date DESC)
SELECT DISTINCT ON (b.id) *
FROM a
INNER JOIN b ON a.id=b.id
ORDER BY b.id, b.date DESC
Alternatively, if you want to sort records from table "a" some way:
SELECT DISTINCT ON (sort_column, a.id) *
FROM a
INNER JOIN b ON a.id=b.id
ORDER BY sort_column, a.id, b.date DESC
Alternative approaches
However, all of the above queries still need to read all referenced rows from table "b", so if you have lots of data, it might still just be too slow.
You could create a new table, which only holds the newest "b" record for each a.id
-- or even move those columns into the "a" table itself.
Get last record of a table in Postgres
If under "last record" you mean the record which has the latest timestamp value, then try this:
my_query = client.query("
SELECT TIMESTAMP,
value,
card
FROM my_table
ORDER BY TIMESTAMP DESC
LIMIT 1
");
Optimize GROUP BY query to retrieve latest row per user
For best read performance you need a multicolumn index:
CREATE INDEX log_combo_idx
ON log (user_id, log_date DESC NULLS LAST);
To make index only scans possible, add the otherwise not needed column payload
in a covering index with the INCLUDE
clause (Postgres 11 or later):
CREATE INDEX log_combo_covering_idx
ON log (user_id, log_date DESC NULLS LAST) INCLUDE (payload);
See:
- Do covering indexes in PostgreSQL help JOIN columns?
Fallback for older versions:
CREATE INDEX log_combo_covering_idx
ON log (user_id, log_date DESC NULLS LAST, payload);
Why DESC NULLS LAST
?
- Unused index in range of dates query
For few rows per user_id
or small tables DISTINCT ON
is typically fastest and simplest:
- Select first row in each GROUP BY group?
For many rows per user_id
an index skip scan (or loose index scan) is (much) more efficient. That's not implemented up to Postgres 12 - work is ongoing for Postgres 14. But there are ways to emulate it efficiently.
Common Table Expressions require Postgres 8.4+.LATERAL
requires Postgres 9.3+.
The following solutions go beyond what's covered in the Postgres Wiki.
1. No separate table with unique users
With a separate users
table, solutions in 2. below are typically simpler and faster. Skip ahead.
1a. Recursive CTE with LATERAL
join
WITH RECURSIVE cte AS (
( -- parentheses required
SELECT user_id, log_date, payload
FROM log
WHERE log_date <= :mydate
ORDER BY user_id, log_date DESC NULLS LAST
LIMIT 1
)
UNION ALL
SELECT l.*
FROM cte c
CROSS JOIN LATERAL (
SELECT l.user_id, l.log_date, l.payload
FROM log l
WHERE l.user_id > c.user_id -- lateral reference
AND log_date <= :mydate -- repeat condition
ORDER BY l.user_id, l.log_date DESC NULLS LAST
LIMIT 1
) l
)
TABLE cte
ORDER BY user_id;
This is simple to retrieve arbitrary columns and probably best in current Postgres. More explanation in chapter 2a. below.
1b. Recursive CTE with correlated subquery
WITH RECURSIVE cte AS (
( -- parentheses required
SELECT l AS my_row -- whole row
FROM log l
WHERE log_date <= :mydate
ORDER BY user_id, log_date DESC NULLS LAST
LIMIT 1
)
UNION ALL
SELECT (SELECT l -- whole row
FROM log l
WHERE l.user_id > (c.my_row).user_id
AND l.log_date <= :mydate -- repeat condition
ORDER BY l.user_id, l.log_date DESC NULLS LAST
LIMIT 1)
FROM cte c
WHERE (c.my_row).user_id IS NOT NULL -- note parentheses
)
SELECT (my_row).* -- decompose row
FROM cte
WHERE (my_row).user_id IS NOT NULL
ORDER BY (my_row).user_id;
Convenient to retrieve a single column or the whole row. The example uses the whole row type of the table. Other variants are possible.
To assert a row was found in the previous iteration, test a single NOT NULL column (like the primary key).
More explanation for this query in chapter 2b. below.
Related:
- Query last N related rows per row
- GROUP BY one column, while sorting by another in PostgreSQL
2. With separate users
table
Table layout hardly matters as long as exactly one row per relevant user_id
is guaranteed. Example:
CREATE TABLE users (
user_id serial PRIMARY KEY
, username text NOT NULL
);
Ideally, the table is physically sorted in sync with the log
table. See:
- Optimize Postgres timestamp query range
Or it's small enough (low cardinality) that it hardly matters. Else, sorting rows in the query can help to further optimize performance. See Gang Liang's addition. If the physical sort order of the users
table happens to match the index on log
, this may be irrelevant.
2a. LATERAL
join
SELECT u.user_id, l.log_date, l.payload
FROM users u
CROSS JOIN LATERAL (
SELECT l.log_date, l.payload
FROM log l
WHERE l.user_id = u.user_id -- lateral reference
AND l.log_date <= :mydate
ORDER BY l.log_date DESC NULLS LAST
LIMIT 1
) l;
JOIN LATERAL
allows to reference preceding FROM
items on the same query level. See:
- What is the difference between LATERAL JOIN and a subquery in PostgreSQL?
Results in one index (-only) look-up per user.
Returns no row for users missing in the users
table. Typically, a foreign key constraint enforcing referential integrity would rule that out.
Also, no row for users without matching entry in log
- conforming to the original question. To keep those users in the result use LEFT JOIN LATERAL ... ON true
instead of CROSS JOIN LATERAL
:
- Call a set-returning function with an array argument multiple times
Use LIMIT n
instead of LIMIT 1
to retrieve more than one rows (but not all) per user.
Effectively, all of these do the same:
JOIN LATERAL ... ON true
CROSS JOIN LATERAL ...
, LATERAL ...
The last one has lower priority, though. Explicit JOIN
binds before comma. That subtle difference can matters with more join tables. See:
- "invalid reference to FROM-clause entry for table" in Postgres query
2b. Correlated subquery
Good choice to retrieve a single column from a single row. Code example:
- Optimize groupwise maximum query
The same is possible for multiple columns, but you need more smarts:
CREATE TEMP TABLE combo (log_date date, payload int);
SELECT user_id, (combo1).* -- note parentheses
FROM (
SELECT u.user_id
, (SELECT (l.log_date, l.payload)::combo
FROM log l
WHERE l.user_id = u.user_id
AND l.log_date <= :mydate
ORDER BY l.log_date DESC NULLS LAST
LIMIT 1) AS combo1
FROM users u
) sub;
Like LEFT JOIN LATERAL
above, this variant includes all users, even without entries in log
. You get NULL
for combo1
, which you can easily filter with a WHERE
clause in the outer query if need be.
Nitpick: in the outer query you can't distinguish whether the subquery didn't find a row or all column values happen to be NULL - same result. You need a NOT NULL
column in the subquery to avoid this ambiguity.
A correlated subquery can only return a single value. You can wrap multiple columns into a composite type. But to decompose it later, Postgres demands a well-known composite type. Anonymous records can only be decomposed providing a column definition list.
Use a registered type like the row type of an existing table. Or register a composite type explicitly (and permanently) with CREATE TYPE
. Or create a temporary table (dropped automatically at end of session) to register its row type temporarily. Cast syntax: (log_date, payload)::combo
Finally, we do not want to decompose combo1
on the same query level. Due to a weakness in the query planner this would evaluate the subquery once for each column (still true in Postgres 12). Instead, make it a subquery and decompose in the outer query.
Related:
- Get values from first and last row per group
Demonstrating all 4 queries with 100k log entries and 1k users:
db<>fiddle here - pg 11
Old sqlfiddle
How to get First and Last record from a sql query?
[Caveat: Might not be the most efficient way to do it]:
(SELECT <some columns>
FROM mytable
<maybe some joins here>
WHERE <various conditions>
ORDER BY date DESC
LIMIT 1)
UNION ALL
(SELECT <some columns>
FROM mytable
<maybe some joins here>
WHERE <various conditions>
ORDER BY date ASC
LIMIT 1)
Postgresql extract last row for each id
The most efficient way is to use Postgres' distinct on
operator
select distinct on (id) id, date, another_info
from the_table
order by id, date desc;
If you want a solution that works across databases (but is less efficient) you can use a window function:
select id, date, another_info
from (
select id, date, another_info,
row_number() over (partition by id order by date desc) as rn
from the_table
) t
where rn = 1
order by id;
The solution with a window function is in most cases faster than using a sub-query.
Select physically last record without ORDER BY
to do a physically last record selection, you should use ctid
- the tuple id, to get the last one - just select max(ctid). smth like:
t=# select ctid,* from t order by ctid desc limit 1;
ctid | t
--------+-------------------------------
(5,50) | 2017-06-13 11:41:04.894666+00
(1 row)
and to do it without order by
:
t=# select t from t where ctid = (select max(ctid) from t);
t
-------------------------------
2017-06-13 11:41:04.894666+00
(1 row)
Its worth knowing that you can find ctid only after sequential scan. so checking the latest physically row will be costy on large data sets
Related Topics
Query Grants for a Table in Postgres
Transposing Rows in to Colums in SQL Server 2005
Why Can't I Enter This Date into a Table Using SQL
How to Delete Duplicates from a Database Table Based on a Certain Field
How to See the SQL That Will Be Generated by a Given Activerecord Query in Ruby on Rails
SQL Explain Plan: What Is Materialize
Convert Unknown Number of Comma Separated Varchars Within 1 Column into Multiple Columns
How to Get Use Text Columns in a Trigger
Remove Trailing Spaces and Update in Columns in SQL Server
MySQL Create Time and Update Time Timestamp
Syntax Error in Dynamic SQL in Pl/Pgsql Function
Best Way to Reset an Oracle Sequence to the Next Value in an Existing Column
Database Design for 'Followers' and 'Followings'
MySQL - Select All Except What Is in This Table
Remove Duplicates from SQL Union