Postgresql - Repeating Rows from Limit Offset

PostgreSQL - repeating rows from LIMIT OFFSET

Why does "foo" appear in both queries?

Because all rows that are returned have the same value for the status column. In that case the database is free to return the rows in any order it wants.

If you want a reproducable ordering you need to add a second column to your order by statement to make it consistent. E.g. the ID column:

SELECT students.* 
FROM students
ORDER BY students.status asc,
students.id asc

If two rows have the same value for the status column, they will be sorted by the id.

Repeating some rows in some offsets in the PostgreSQL

The problem is that you are sorting with a column (amount) that contains duplicate values. Your order by clause is not deterministic, hence the results are not stable.

A simple solution is to usea second sorting criteria in order to break the ties (it looks like user(id) can do this):

select *                                  -- better enumerate the columns here
from wallets w
inner join users u on u.id = w.user_id -- your "left join" is actually "inner join"
where u.role = 'tester' and w.amount > 0
order by w.amount, u.id -- here is the second sorting criteria
offset 0, limit 20

LIMIT and OFFSET returning repeating results

According to PostgreSQL documents offset means OFFSET says to skip that many rows before beginning to return rows.

In your query, you add LIMIT 20 OFFSET 2 this means skip 2 rows from 20 existing rows and two first rows have been skipped and showing duplicate rows of the previous page.

You should use this formula for calculate limit and offset:

LIMIT total_record_show OFFSET total_record_show * (page_number - 1)
--- page 1 (total_record_show: 20, page_number: 1)
LIMIT 20 OFFSET 0

--- page 2 (total_record_show: 20, page_number: 2)
LIMIT 20 OFFSET 20

--- page 3 (total_record_show: 20, page_number: 3)
LIMIT 20 OFFSET 40

PostgreSQL OFFSET field consistency across processes

Without an order by statement, limit and offset are seldom meaningful... SQL offers no guarantee on row order unless you make it explicit. So add an order by clause.

Also, if copying a table wholesale is what you want, it's better to simply:

insert into table2 select * from table1

Paging in SQL with LIMIT/OFFSET sometimes results in duplicates on different pages

Do you execute one query per page to display? If yes, I suspect that the database doesn't guarantee a consitent order for items with the same number of votes. So first query may return { item 1, item 2 } and a 2nd query may return { item 2, item 1} if both items have same number of votes. If the items are actually items 10 and 11, then the same item may appear on page 1 and then on page 2.

I had such a problem once. If that's also your case, append an extra clause to the order by to ensure a consistent ordering of items with same vote number, e.g.:

ORDER BY picture.vote, picture.ID

Postgresql get the distinct count when using limit and offset

Use COUNT() window function after you filter the table student_teacher for teacherId = 2 and then join to students.

SELECT s.*, st.total_count  
FROM student s
JOIN (
SELECT *, COUNT(*) OVER() AS total_count
FROM student_teacher
WHERE teacherId = 2
) st ON st.studentId = s.id
LIMIT 10 OFFSET 0;

There is no need to join teachers.

Run a query with a LIMIT/OFFSET and also get the total number of rows

Yes. With a simple window function:

SELECT *, count(*) OVER() AS full_count
FROM tbl
WHERE /* whatever */
ORDER BY col1
OFFSET ?
LIMIT ?

Be aware that the cost will be substantially higher than without the total number, but typically still cheaper than two separate queries. Postgres has to actually count all rows either way, which imposes a cost depending on the total number of qualifying rows. Details:

  • Best way to get result count before LIMIT was applied

However, as Dani pointed out, when OFFSET is at least as great as the number of rows returned from the base query, no rows are returned. So we also don't get full_count.

If that's not acceptable, a possible workaround to always return the full count would be with a CTE and an OUTER JOIN:

WITH cte AS (
SELECT *
FROM tbl
WHERE /* whatever */
)
SELECT *
FROM (
TABLE cte
ORDER BY col1
LIMIT ?
OFFSET ?
) sub
RIGHT JOIN (SELECT count(*) FROM cte) c(full_count) ON true;

You get one row of NULL values with the full_count appended if OFFSET is too big. Else, it's appended to every row like in the first query.

If a row with all NULL values is a possible valid result you have to check offset >= full_count to disambiguate the origin of the empty row.

This still executes the base query only once. But it adds more overhead to the query and only pays if that's less than repeating the base query for the count.

If indexes supporting the final sort order are available, it might pay to include the ORDER BY in the CTE (redundantly).

Select query with offset limit is too much slow

It's slow because it needs to locate the top offset rows and scan the next 100. No amounts of optimization will change that when you're dealing with huge offsets.

This is because your query literally instruct the DB engine to visit lots of rows by using offset 3900000 -- that's 3.9M rows. Options to speed this up somewhat aren't many.

Super-fast RAM, SSDs, etc. will help. But you'll only gain by a constant factor in doing so, meaning it's merely kicking the can down the road until you reach a larger enough offset.

Ensuring the table fits in memory, with plenty more to spare will likewise help by a larger constant factor -- except the first time. But this may not be possible with a large enough table or index.

Ensuring you're doing index-only scans will work to an extent. (See velis' answer; it has a lot of merit.) The problem here is that, for all practical purposes, you can think of an index as a table storing a disk location and the indexed fields. (It's more optimized than that, but it's a reasonable first approximation.) With enough rows, you'll still be running into problems with a larger enough offset.

Trying to store and maintain the precise position of the rows is bound to be an expensive approach too.(This is suggested by e.g. benjist.) While technically feasible, it suffers from limitations similar to those that stem from using MPTT with a tree structure: you'll gain significantly on reads but will end up with excessive write times when a node is inserted, updated or removed in such a way that large chunks of the data needs to be updated alongside.

As is hopefully more clear, there isn't any real magic bullet when you're dealing with offsets this large. It's often better to look at alternative approaches.

If you're paginating based on the ID (or a date field, or any other indexable set of fields), a potential trick (used by blogspot, for instance) would be to make your query start at an arbitrary point in the index.

Put another way, instead of:

example.com?page_number=[huge]

Do something like:

example.com?page_following=[huge]

That way, you keep a trace of where you are in your index, and the query becomes very fast because it can head straight to the correct starting point without plowing through a gazillion rows:

select * from foo where ID > [huge] order by ID limit 100

Naturally, you lose the ability to jump to e.g. page 3000. But give this some honest thought: when was the last time you jumped to a huge page number on a site instead of going straight for its monthly archives or using its search box?

If you're paginating but want to keep the page offset by any means, yet another approach is to forbid the use of larger page number. It's not silly: it's what Google is doing with search results. When running a search query, Google gives you an estimate number of results (you can get a reasonable number using explain), and then will allow you to brows the top few thousand results -- nothing more. Among other things, they do so for performance reasons -- precisely the one you're running into.

Postgresql returns random rows when using large OFFSET

PostgreSQL has a feature that tries to get multiple concurrent sequential scans on the same large table to all work on the same part of the table at the same time, so that they can share cache space and don't have to each read the same data off disk individually. A side effect of this is that for partial (like with LIMIT) sequential scans done consecutively, each one starts where the previous one left off.

The synchronization points are always at page boundaries, so with a low OFFSET and a low LIMIT you just keep reading data from the same page (and from that page's beginning) over and over again and getting the same data.

You can turn this off with set synchronize_seqscans TO off; if you need to get more stable results for some internal testing purpose. If you do this you are, as von Neumann might say, living in a state of sin.



Related Topics



Leave a reply



Submit