How to Get the Top 10 Values in Postgresql

How to get top value in Postgres when ties?

There is an SQL standard way to do this. It doesn't use the LIMIT keyword, but rather FETCH

FETCH FIRST 1 ROWS WITH TIES

But support for the WITH TIES part of the standard wasn't added to PostgreSQL until v13.

show top 10 occurring values in column

Since no tiebreaker has been defined, return results "with ties":

SELECT charityid, count(*) AS occurrencs
FROM donations
GROUP BY charityid
ORDER BY occurrencs DESC -- no tiebeaker?
FETCH FIRST 10 ROWS WITH TIES;

See:

  • Get top row(s) with highest value, with ties

And use count(*) here. Faster, and equivalent while no null values are involved.

postgresql selecting highest values

Generally the algorithm is very simple:

  1. Retrieve rows from Student table
  2. Sort the whole resultset from #1 using ORDER BY expressions
  3. Apply LIMIT clause (+ offset) to get a portion of rows from ordered resultset obtained from #2

You can read more on LIMIT here: http://www.postgresql.org/docs/9.4/static/queries-limit.html

In some cases LIMIT is taken into account during ORDER BY operation (sorting) to speed up the query, especially when some index can be used to eliminate a sort operation.

You can examine how this work looking at explain plans.

Let's say there is an index created on this table:

create index student_mark1 on student(student_marks1);

This query gives the following explain plan:

select * From student
order by student_marks2
limit 1;

Limit (cost=5.06..5.06 rows=1 width=178) (actual time=0.088..0.089 rows=1 loops=1)
Output: student_name, student_rollno, student_marks1, student_marks2
-> Sort (cost=5.06..5.57 rows=204 width=178) (actual time=0.088..0.088 rows=1 loops=1)
Output: student_name, student_rollno, student_marks1, student_marks2
Sort Key: student.student_marks2
Sort Method: top-N heapsort Memory: 25kB
-> Seq Scan on public.student (cost=0.00..4.04 rows=204 width=178) (actual time=0.007..0.021 rows=204 loops=1)
Output: student_name, student_rollno, student_marks1, student_marks2

You need to read this plan from bottom to up.

The first operation is Seq scan - it means that the all rows are read from disk (the whole table - see actual rows = 204).

Then the sort operation is performed (ORDER BY). And the last operation is LIMIT 1 (at the top of the plan)

Compare the above plan to this query:

select * From student
order by student_marks1
limit 1;

Limit (cost=0.14..0.24 rows=1 width=178) (actual time=0.010..0.010 rows=1 loops=1)
Output: student_name, student_rollno, student_marks1, student_marks2
-> Index Scan using student_mark1 on public.student (cost=0.14..19.20 rows=204 width=178) (actual time=0.009..0.009 rows=1 loops=1)
Output: student_name, student_rollno, student_marks1, student_marks2

Here the sorting phase is skipped, since we can use the index to retrieve rows in required order (ORDER BY student_marks1 => INDEX ON Student( student_marks1 )).

Please take notice of Actual rows = 1 in the bottom-most operation: `Index scan'.

This means, that PostgreSQL doesn't scan the whole index, but retrieves only 1 (first) row from the index, because it knows, than the query has LIMIT 1 clause. (One sometimes says that PostgreSQL "pushed down" limit 1 clause to the index scan operation and used it to reduce a number of scanned entries in the index).

More on using indices to speed up ORDER BY you can find here: http://www.postgresql.org/docs/8.3/static/indexes-ordering.html

In case the query in your question, the ORDER BY clause contains an expression Student_Marks1+Student_Marks2, not a simple columns. An explain plan for this query looks like this:

select *
From student
order by student_marks1 + student_marks2
limit 2;

Limit (cost=7.10..7.11 rows=2 width=178) (actual time=0.207..0.207 rows=2 loops=1)
Output: student_name, student_rollno, student_marks1, student_marks2, (((student_marks1)::numeric + student_marks2))
-> Sort (cost=7.10..7.61 rows=204 width=178) (actual time=0.205..0.205 rows=2 loops=1)
Output: student_name, student_rollno, student_marks1, student_marks2, (((student_marks1)::numeric + student_marks2))
Sort Key: (((student.student_marks1)::numeric + student.student_marks2))
Sort Method: top-N heapsort Memory: 25kB
-> Seq Scan on public.student (cost=0.00..5.06 rows=204 width=178) (actual time=0.019..0.107 rows=204 loops=1)
Output: student_name, student_rollno, student_marks1, student_marks2, ((student_marks1)::numeric + student_marks2)

But you can still speed up this query creating a function based index, in this way:

create index student_mark12 on student( ( student_marks1 + student_marks2) );

After creating the index, we have now:

Limit  (cost=0.14..0.34 rows=2 width=178) (actual time=0.044..0.047 rows=2 loops=1)
Output: student_name, student_rollno, student_marks1, student_marks2, (((student_marks1)::numeric + student_marks2))
-> Index Scan using student_mark12 on public.student (cost=0.14..20.22 rows=204 width=178) (actual time=0.043..0.046 rows=2 loops=1)
Output: student_name, student_rollno, student_marks1, student_marks2, ((student_marks1)::numeric + student_marks2)

Notice that Postgre uses the index in this case, and retrieves only 2 entries from it (actual rows = 2) according to LIMIT 2 clause.

Get top row(s) with highest value, with ties

The first query fails if any row has quantity IS NULL (as Gordon demonstrates).

The second query only fails if all rows have quantity IS NULL. So it should be usable in most cases. (And it's faster.)

Postgres 13 or newer

Use the standard SQL clause WITH TIES:

SELECT id
FROM product
ORDER BY quantity DESC NULLS LAST
FETCH FIRST 1 ROWS WITH TIES;

db<>fiddle here

Works with any amount of NULL values.

The manual:

SQL:2008 introduced a different syntax to achieve the same result,
which PostgreSQL also supports. It is:

OFFSET start { ROW | ROWS }
FETCH { FIRST | NEXT } [ count ] { ROW | ROWS } { ONLY | WITH TIES }

In this syntax, the start or count value is required by the standard
to be a literal constant, a parameter, or a variable name; as a
PostgreSQL extension, other expressions are allowed, but will
generally need to be enclosed in parentheses to avoid ambiguity. If
count is omitted in a FETCH clause, it defaults to 1. The WITH TIES
option is used to return any additional rows that tie for the last
place in the result set according to the ORDER BY clause; ORDER BY is
mandatory in this case. ROW and ROWS as well as FIRST and NEXT are
noise words that don't influence the effects of these clauses.

Notably, WITH TIES cannot be used with the (non-standard) short syntax LIMIT n.

It's the fastest possible solution. Faster than either of your current queries. More important for performance: have an index on (quantity). Or a more specialized covering index to allow index-only scans (a bit faster, yet):

CREATE INDEX ON product (quantity DESC NULLS LAST) INCLUDE (id);

See:

  • Do covering indexes in PostgreSQL help JOIN columns?

We need NULLS LAST to keep NULL values last in descending order. See:

  • Sort by column ASC, but NULL values first?

Postgres 12 or older

A NULL-safe query:

SELECT id, quantity
FROM product
WHERE quantity IS NOT DISTINCT FROM (SELECT MAX(quantity) FROM product);

Or, probably faster:

SELECT id, quantity
FROM (
SELECT *, rank() OVER (ORDER BY quantity DESC NULLS LAST) AS rnk
FROM product
) sub
WHERE rnk = 1;

See:

  • PostgreSQL equivalent for TOP n WITH TIES: LIMIT "with ties"?

Faster alternatives for big tables:

  • Equivalent for FETCH FIRST WITH TIES in PostgreSQL 11 with comparable performance

Update top N values using PostgreSQL

WITH cte AS (
SELECT id, row_number() OVER (ORDER BY account DESC NULLS LAST) AS rn
FROM accountrecords
ORDER BY account DESC NULLS LAST
LIMIT 10
)
UPDATE accountrecords a
SET accountrank = cte.rn
FROM cte
WHERE cte.id = a.id;

Joining in a table expression is typically faster than correlated subqueries. It is also shorter.

With the window function row_number() distinct numbers are guaranteed. Use rank() (or possibly dense_rank()) if you want rows with equal values for account to share the same number.

Only if there can be NULL values in account, you need to append NULLS LAST for descending sort order, or NULL values sort on top:

  • Sort by column ASC, but NULL values first?

If there can be concurrent write access, the above query is subject to a race condition. Consider:

  • Atomic UPDATE .. SELECT in Postgres
  • Postgres UPDATE … LIMIT 1

However, if that was the case, the whole concept of hard-coding the top ten would be a dubious approach to begin with.

Use a CTE instead of a plain subquery to enforce the LIMIT reliably. See links above.

How to find the three greatest values in each category in PostgreSQL?

This kind of query is typically solved using window functions

select *
from (
SELECT b.product_id,
b.category_id,
a.revenue,
dense_rank() over (partition by b.category_id, b.product_id order by a.revenue desc) as rnk
from A
join b ON B.product_id = A.product_id
where a.order_date between date '2006-01-01' AND date '2006-01-31'
) as t
where rnk <= 3
order by product_id, category_id, revenue desc;

dense_rank() will also deal with ties (products with the same revenue in the same category) so you might actually get more than 3 rows per product/category.

If the same product can show up more than once in table b (for the same category) you need to combine this with a GROUP BY to get the sum of all revenues:

select *
from (
SELECT b.product_id,
b.category_id,
sum(a.revenue) as total_revenue,
dense_rank() over (partition by b.category_id, a.product_id order by sum(a.revenue) desc) as rnk
from a
join b on B.product_id = A.product_id
where a.order_date between date '2006-01-01' AND date '2006-01-31'
group by b.product_id, b.category_id
) as t
where rnk <= 3
order by product_id, category_id, total_revenue desc;

When combining window functions and GROUP BY, the window function will be applied after the GROUP BY.

Find the 3rd top selling item postgres

Yet another option is:

  • aggregating the "qty" values per "product_id" (SUM(qty) GROUP BY product_id)
  • extracting a ranking value for each product_id summed quantities (DENSE_RANK() OVER(ORDER BY SUM(qty) DESC))
  • ordering your output rows with respect to when this ranking value equals 3 (DENSE_RANK() ... = 3)
  • keeping only the first row given your ordering (FETCH FIRST 1 ROWS WITH TIES )
SELECT product_id
FROM "transaction"
GROUP BY product_id
ORDER BY DENSE_RANK() OVER(ORDER BY SUM(qty) DESC) = 3 DESC
FETCH FIRST 1 ROWS WITH TIES

Check the demo here.



Related Topics



Leave a reply



Submit