Optimize Query With Offset on Large Table

Optimize query with OFFSET on large table

A large OFFSET is always going to be slow. Postgres has to order all rows and count the visible ones up to your offset. To skip all previous rows directly you could add an indexed row_number to the table (or create a MATERIALIZED VIEW including said row_number) and work with WHERE row_number > x instead of OFFSET x.

However, this approach is only sensible for read-only (or mostly) data. Implementing the same for table data that can change concurrently is more challenging. You need to start by defining desired behavior exactly.

I suggest a different approach for pagination:

SELECT *
FROM   big_table
WHERE  (vote, id) > (vote_x, id_x)  -- ROW values
ORDER  BY vote, id  -- needs to be deterministic
LIMIT  n;

Where vote_x and id_x are from the last row of the previous page (for both DESC and ASC). Or from the first if navigating backwards.

Comparing row values is supported by the index you already have - a feature that complies with the ISO SQL standard, but not every RDBMS supports it.

CREATE INDEX vote_order_asc ON big_table (vote, id);

Or for descending order:

SELECT *
FROM   big_table
WHERE  (vote, id) < (vote_x, id_x)  -- ROW values
ORDER  BY vote DESC, id DESC
LIMIT  n;

Can use the same index.

I suggest you declare your columns NOT NULL or acquaint yourself with the NULLS FIRST|LAST construct:

PostgreSQL sort by datetime asc, null first?

Note two things in particular:

The ROW values in the WHERE clause cannot be replaced with separated member fields. WHERE (vote, id) > (vote_x, id_x) cannot be replaced with:
```
WHERE  vote >= vote_x
AND    id   > id_x
```
That would rule out all rows with id <= id_x, while we only want to do that for the same vote and not for the next. The correct translation would be:
```
WHERE (vote = vote_x AND id > id_x) OR vote > vote_x
```
... which doesn't play along with indexes as nicely, and gets increasingly complicated for more columns.
Would be simple for a single column, obviously. That's the special case I mentioned at the outset.
The technique does not work for mixed directions in ORDER BY like:
```
ORDER  BY vote ASC, id DESC
```
At least I can't think of a generic way to implement this as efficiently. If at least one of both columns is a numeric type, you could use a functional index with an inverted value on (vote, (id * -1)) - and use the same expression in ORDER BY:
```
ORDER  BY vote ASC, (id * -1) ASC
```

SQL syntax term for 'WHERE (col1, col2) < (val1, val2)'
Improve performance for order by with columns from many tables

Note in particular the presentation by Markus Winand I linked to:

"Pagination done the PostgreSQL way"

How to optimize limit offset when I join multiple tables?

First of all, offsets and limits are unpredictable unless you include ORDER BY clauses in your query. Without ORDER BY, your SQL server is allowed to return result rows in any order it chooses.

Second, Large offsets and small limits are a notorious query-performance antipattern. There's not much you can to do make the problem go away.

To get decent performance, it's helpful to rethink why you want to use this kind of access pattern, and then try to use WHERE filters on some indexed column value.

For example, let's say you're doing this kind of thing.

select a.user_id, b.user_email, c.user_account
from table1 a
left join table2 b on a.user_id = b.user_id 
left join table3 c on b.account_id = c.account_id
limit whatever

Let's say you're paginating the query so you get fifty users at a time. Then you can start with a last_seen_user_id variable in your program, initialized to -1.

Your query looks like this:

select a.user_id, b.user_email, c.user_account
from (
        select user_id 
          from table1 
         where user_id > ?last_seen_user_id?
         order by user_id
         limit 50
     ) u
join      table1 a on u.user_id = a.user_id
left join table2 b on a.user_id = b.user_id 
left join table3 c on b.account_id = c.account_id
order by a.user_id

Then, when you retrieve that result, set your last_seen_user_id to the value from the last row in the result.

Run the query again to get the next fifty users. If table1.user_id is a primary key or a unique index, this will be fast.

How can I speed up a MySQL query with a large offset in the LIMIT clause?

Perhaps you could create an indexing table which provides a sequential key relating to the key in your target table. Then you can join this indexing table to your target table and use a where clause to more efficiently get the rows you want.

#create table to store sequences
CREATE TABLE seq (
   seq_no int not null auto_increment,
   id int not null,
   primary key(seq_no),
   unique(id)
);

#create the sequence
TRUNCATE seq;
INSERT INTO seq (id) SELECT id FROM mytable ORDER BY id;

#now get 1000 rows from offset 1000000
SELECT mytable.* 
FROM mytable 
INNER JOIN seq USING(id)
WHERE seq.seq_no BETWEEN 1000000 AND 1000999;

Why does MYSQL higher LIMIT offset slow the query down?

It's normal that higher offsets slow the query down, since the query needs to count off the first OFFSET + LIMIT records (and take only LIMIT of them). The higher is this value, the longer the query runs.

The query cannot go right to OFFSET because, first, the records can be of different length, and, second, there can be gaps from deleted records. It needs to check and count each record on its way.

Assuming that id is the primary key of a MyISAM table, or a unique non-primary key field on an InnoDB table, you can speed it up by using this trick:

SELECT  t.* 
FROM    (
        SELECT  id
        FROM    mytable
        ORDER BY
                id
        LIMIT 10000, 30
        ) q
JOIN    mytable t
ON      t.id = q.id

See this article:

MySQL ORDER BY / LIMIT performance: late row lookups

MySQL - Slow performance for limit offset on json column

When MySQL uses OFFSET, it cannot just skip to row 216204. MySQL indexes index by value, not by row number. So it must actually examine all those rows, which means reading the pages on which the rows are stored. Loading them from disk into RAM if necessary.

When you only reference t.id, it can scan along the index. I'll assume id is indexed (maybe even the primary key), and it's probably an integer (4 bytes) or bigint (8 bytes). Regardless, a lot of those entries can fit in a given InnoDB page (each page is of fixed size, 16KB each).

Whereas the JSON column is like TEXT or BLOB or VARCHAR, in that it is probably stored on additional pages, and it's much more bulky than a single integer. So it takes more pages. If the JSON documents are especially large, it may even take many pages per document.

So that's a lot of storage I/O to load all those JSON documents. It is many times the I/O work to load 216,000 JSON documents, compared to only referencing the primary key. It might be the case that the primary key pages are already cached in RAM anyway, so there is little or no I/O needed to read them.

It might be even worse with AWS Aurora than traditional MySQL, because Aurora uses distributed storage and replicated buffer pools, so there's additional overhead to both the I/O and the loading into RAM.

What can you do instead?

Stop using large values in your OFFSET. Search by values instead.

SELECT t.json_column
FROM table t
WHERE t.id >= 208000
LIMIT 501;

This is quick because the index helps to skip directly to the 208,000th row without examining all the preceding rows. It will only need to examine the rows starting with the row where id=208000 (or whatever value you search for). It stops examining rows once it finds enough for your LIMIT.

Optimize Query With Offset on Large Table