Optimize query with OFFSET on large table
A large OFFSET
is always going to be slow. Postgres has to order all rows and count the visible ones up to your offset. To skip all previous rows directly you could add an indexed row_number
to the table (or create a MATERIALIZED VIEW
including said row_number
) and work with WHERE row_number > x
instead of OFFSET x
.
However, this approach is only sensible for read-only (or mostly) data. Implementing the same for table data that can change concurrently is more challenging. You need to start by defining desired behavior exactly.
I suggest a different approach for pagination:
SELECT *
FROM big_table
WHERE (vote, id) > (vote_x, id_x) -- ROW values
ORDER BY vote, id -- needs to be deterministic
LIMIT n;
Where vote_x
and id_x
are from the last row of the previous page (for both DESC
and ASC
). Or from the first if navigating backwards.
Comparing row values is supported by the index you already have - a feature that complies with the ISO SQL standard, but not every RDBMS supports it.
CREATE INDEX vote_order_asc ON big_table (vote, id);
Or for descending order:
SELECT *
FROM big_table
WHERE (vote, id) < (vote_x, id_x) -- ROW values
ORDER BY vote DESC, id DESC
LIMIT n;
Can use the same index.
I suggest you declare your columns NOT NULL
or acquaint yourself with the NULLS FIRST|LAST
construct:
- PostgreSQL sort by datetime asc, null first?
Note two things in particular:
The
ROW
values in theWHERE
clause cannot be replaced with separated member fields.WHERE (vote, id) > (vote_x, id_x)
cannot be replaced with:WHERE vote >= vote_x
AND id > id_xThat would rule out all rows with
id <= id_x
, while we only want to do that for the same vote and not for the next. The correct translation would be:WHERE (vote = vote_x AND id > id_x) OR vote > vote_x
... which doesn't play along with indexes as nicely, and gets increasingly complicated for more columns.
Would be simple for a single column, obviously. That's the special case I mentioned at the outset.
The technique does not work for mixed directions in
ORDER BY
like:ORDER BY vote ASC, id DESC
At least I can't think of a generic way to implement this as efficiently. If at least one of both columns is a numeric type, you could use a functional index with an inverted value on
(vote, (id * -1))
- and use the same expression inORDER BY
:ORDER BY vote ASC, (id * -1) ASC
Related:
- SQL syntax term for 'WHERE (col1, col2) < (val1, val2)'
- Improve performance for order by with columns from many tables
Note in particular the presentation by Markus Winand I linked to:
- "Pagination done the PostgreSQL way"
How to optimize limit offset when I join multiple tables?
First of all, offsets and limits are unpredictable unless you include ORDER BY
clauses in your query. Without ORDER BY
, your SQL server is allowed to return result rows in any order it chooses.
Second, Large offsets and small limits are a notorious query-performance antipattern. There's not much you can to do make the problem go away.
To get decent performance, it's helpful to rethink why you want to use this kind of access pattern, and then try to use WHERE
filters on some indexed column value.
For example, let's say you're doing this kind of thing.
select a.user_id, b.user_email, c.user_account
from table1 a
left join table2 b on a.user_id = b.user_id
left join table3 c on b.account_id = c.account_id
limit whatever
Let's say you're paginating the query so you get fifty users at a time. Then you can start with a last_seen_user_id
variable in your program, initialized to -1
.
Your query looks like this:
select a.user_id, b.user_email, c.user_account
from (
select user_id
from table1
where user_id > ?last_seen_user_id?
order by user_id
limit 50
) u
join table1 a on u.user_id = a.user_id
left join table2 b on a.user_id = b.user_id
left join table3 c on b.account_id = c.account_id
order by a.user_id
Then, when you retrieve that result, set your last_seen_user_id
to the value from the last row in the result.
Run the query again to get the next fifty users. If table1.user_id
is a primary key or a unique index, this will be fast.
How can I speed up a MySQL query with a large offset in the LIMIT clause?
Perhaps you could create an indexing table which provides a sequential key relating to the key in your target table. Then you can join this indexing table to your target table and use a where clause to more efficiently get the rows you want.
#create table to store sequences
CREATE TABLE seq (
seq_no int not null auto_increment,
id int not null,
primary key(seq_no),
unique(id)
);
#create the sequence
TRUNCATE seq;
INSERT INTO seq (id) SELECT id FROM mytable ORDER BY id;
#now get 1000 rows from offset 1000000
SELECT mytable.*
FROM mytable
INNER JOIN seq USING(id)
WHERE seq.seq_no BETWEEN 1000000 AND 1000999;
Why does MYSQL higher LIMIT offset slow the query down?
It's normal that higher offsets slow the query down, since the query needs to count off the first OFFSET + LIMIT
records (and take only LIMIT
of them). The higher is this value, the longer the query runs.
The query cannot go right to OFFSET
because, first, the records can be of different length, and, second, there can be gaps from deleted records. It needs to check and count each record on its way.
Assuming that id
is the primary key of a MyISAM table, or a unique non-primary key field on an InnoDB table, you can speed it up by using this trick:
SELECT t.*
FROM (
SELECT id
FROM mytable
ORDER BY
id
LIMIT 10000, 30
) q
JOIN mytable t
ON t.id = q.id
See this article:
- MySQL ORDER BY / LIMIT performance: late row lookups
MySQL - Slow performance for limit offset on json column
When MySQL uses OFFSET
, it cannot just skip to row 216204. MySQL indexes index by value, not by row number. So it must actually examine all those rows, which means reading the pages on which the rows are stored. Loading them from disk into RAM if necessary.
When you only reference t.id
, it can scan along the index. I'll assume id
is indexed (maybe even the primary key), and it's probably an integer (4 bytes) or bigint (8 bytes). Regardless, a lot of those entries can fit in a given InnoDB page (each page is of fixed size, 16KB each).
Whereas the JSON column is like TEXT or BLOB or VARCHAR, in that it is probably stored on additional pages, and it's much more bulky than a single integer. So it takes more pages. If the JSON documents are especially large, it may even take many pages per document.
So that's a lot of storage I/O to load all those JSON documents. It is many times the I/O work to load 216,000 JSON documents, compared to only referencing the primary key. It might be the case that the primary key pages are already cached in RAM anyway, so there is little or no I/O needed to read them.
It might be even worse with AWS Aurora than traditional MySQL, because Aurora uses distributed storage and replicated buffer pools, so there's additional overhead to both the I/O and the loading into RAM.
What can you do instead?
Stop using large values in your OFFSET
. Search by values instead.
SELECT t.json_column
FROM table t
WHERE t.id >= 208000
LIMIT 501;
This is quick because the index helps to skip directly to the 208,000th row without examining all the preceding rows. It will only need to examine the rows starting with the row where id=208000 (or whatever value you search for). It stops examining rows once it finds enough for your LIMIT
.
Related Topics
Microsoft Jet Wildcards: Asterisk or Percentage Sign
Alter a MySQL Column to Be Auto_Increment
Why Are There Gaps in My Identity Column Values
MySQL Not Using Indexes With Where in Clause
How to Emulate SQLs Rank Functions in R
Deleting Duplicate Rows from SQLite Database
Is There Something Like a Zip() Function in Postgresql That Combines Two Arrays
Oracle: What Does '(+)' Do in a Where Clause
Delete "Column Does Not Exist"
How to Count Unique Items in Field in Access Query
Error: There Is No Unique Constraint Matching Given Keys For Referenced Table "Bar"
List of Special Characters For SQL Like Clause
Safely Rename Tables Using Serial Primary Key Columns
Update Records in Table from Cte