Max VS Top 1 - Which Is Better

SQL: MAX versus SELECT TOP 1

The difference between MAX and TOP 1 was also discussed on these posts:

MAX vs Top 1 - which is better?

SQL performance MAX()

In an SQL query, is TOP 1 a reliable substitute for aggregate functions such MIN() or MAX()..?

Reliable is a subjective term. Yes, TOP 1 will give you a result, as will MAX() or MIN(). It depends on what you are after.

If you look for a specific user only (as you appear to be in this case) and sort by DATE ascending and use TOP 1, you will get all of the details for that one record. However, if you are looking for the first purchase of every user in the table, then TOP 1 will only give you the info for the very first person who made an order.

On the other hand, if you use SELECT DAACCT, MIN(DAIDAT) FROM table GROUP BY DAACCT then you will get the earliest purchase for each user. This assumes you are storing the DAIDAT as a date format with a time component, not just the date value itself. If you do that, you open yourself up to multiple possible records.

TL;DR: If you stick with the concept of the query 1) looking for a very specific user for 2) a very specific product and 3) your dates are stored as proper dates, TOP 1 will be better to use than an aggregate function. If one of these three conditions are not met, reevaluate.

Performance of max() vs ORDER BY DESC + LIMIT 1

There does not seem to be an index on sensor.station_id, which is important here.

There is an actual difference between max() and ORDER BY DESC + LIMIT 1. Many people seem to miss that. NULL values sort first in descending sort order. So ORDER BY timestamp DESC LIMIT 1 returns a NULL value if one exists, while the aggregate function max() ignores NULL values and returns the latest not-null timestamp. ORDER BY timestamp DESC NULLS LAST LIMIT 1 would be equivalent

For your case, since your column d.timestamp is defined NOT NULL (as your update revealed), there is no effective difference. An index with DESC NULLS LAST and the same clause in the ORDER BY for the LIMIT query should still serve you best. I suggest these indexes (my query below builds on the 2nd one):

sensor(station_id, id)
data(sensor_id, timestamp DESC NULLS LAST)

You can drop the other indexes sensor_ind_timestamp and sensor_ind_timestamp_desc unless they are in use otherwise (unlikely, but possible).

Much more importantly, there is another difficulty: The filter on the first table sensors returns few, but still (possibly) multiple rows. Postgres expects to find 2 rows (rows=2) in your added EXPLAIN output.

The perfect technique would be an index-skip-scan (a.k.a. loose index scan) for the second table data - which is not currently implemented (up to at least Postgres 15). There are various workarounds. See:

  • Optimize GROUP BY query to retrieve latest row per user

The best should be:

SELECT d.timestamp
FROM sensors s
CROSS JOIN LATERAL (
SELECT timestamp
FROM data
WHERE sensor_id = s.id
ORDER BY timestamp DESC NULLS LAST
LIMIT 1
) d
WHERE s.station_id = 4
ORDER BY d.timestamp DESC NULLS LAST
LIMIT 1;

The choice between max() and ORDER BY / LIMIT hardly matters in comparison. You might as well:

SELECT max(d.timestamp) AS timestamp
FROM sensors s
CROSS JOIN LATERAL (
SELECT timestamp
FROM data
WHERE sensor_id = s.id
ORDER BY timestamp DESC NULLS LAST
LIMIT 1
) d
WHERE s.station_id = 4;

Or:

SELECT max(d.timestamp) AS timestamp
FROM sensors s
CROSS JOIN LATERAL (
SELECT max(timestamp) AS timestamp
FROM data
WHERE sensor_id = s.id
) d
WHERE s.station_id = 4;

Or even with a correlated subquery, shortest of all:

SELECT max((SELECT max(timestamp) FROM data WHERE sensor_id = s.id)) AS timestamp
FROM sensors s
WHERE station_id = 4;

Note the double parentheses!

The additional advantage of LIMIT in a LATERAL join is that you can retrieve arbitrary columns of the selected row, not just the latest timestamp (one column).

Related:

  • Why do NULL values come first when ordering DESC in a PostgreSQL query?
  • What is the difference between a LATERAL JOIN and a subquery in PostgreSQL?
  • Select first row in each GROUP BY group?
  • Optimize groupwise maximum query

The faster of two SQL queries, sort and select top 1, or select MAX

With an index on order_date, they are of same performance.

Without an index, MAX is a little bit faster, since it will use Stream Aggregation rather than Top N Sort.

SQL Selecting Top 1 with MAX

I think TOP 1 does what you want:

SELECT TOP 1 State, COUNT(Citation) as MostViolations
FROM dbo.ParkingCitations
GROUP BY State
ORDER BY COUNT(Citation) DESC;

If you want all when there are ties, then use top 1 with ties.

Get top 1 row of each group

;WITH cte AS
(
SELECT *,
ROW_NUMBER() OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC) AS rn
FROM DocumentStatusLogs
)
SELECT *
FROM cte
WHERE rn = 1

If you expect 2 entries per day, then this will arbitrarily pick one. To get both entries for a day, use DENSE_RANK instead

As for normalised or not, it depends if you want to:

  • maintain status in 2 places
  • preserve status history
  • ...

As it stands, you preserve status history. If you want latest status in the parent table too (which is denormalisation) you'd need a trigger to maintain "status" in the parent. or drop this status history table.

which is more efficient in sql min or top

Tests made on sql server 2012 ( you didn't ask for a specific version, and this is what I have):

-- Create the table
CREATE TABLE ABC (salary int)

-- insert sample data
DECLARE @I int = 0

WHILE @i < 1000000 -- that's right, a million records..
BEGIN
INSERT INTO ABC VALUES (@i)
SET @I = @I + 1
END

including execution plans and run both queries:

select MIN(salary) from ABC
select top 1 salary from ABC order by salary asc

Results:

  • Without indexes: query cost for top 1 was 94% and for min was 6%.
  • With an index on salary - query cost for both was 50% (doesn't matter if the index is clustered or not).

Without an index:
Execution plan without an index

With an index: (clustered and non-clustered resulted in the same execution plan)
Sample Image

MIN/MAX vs ORDER BY and LIMIT

In the worst case, where you're looking at an unindexed field, using MIN() requires a single full pass of the table. Using SORT and LIMIT requires a filesort. If run against a large table, there would likely be a significant difference in percieved performance. As an anecdotal data point, MIN() took .36s while SORT and LIMIT took .84s against a 106,000 row table on my dev server.

If, however, you're looking at an indexed column, the difference is harder to notice (meaningless data point is 0.00s in both cases). Looking at the output of explain, however, it looks like MIN() is able to simply pluck the smallest value from the index ('Select tables optimized away' and 'NULL' rows) whereas the SORT and LIMIT still needs needs to do an ordered traversal of the index (106,000 rows). The actual performance impact is probably negligible.

It looks like MIN() is the way to go - it's faster in the worst case, indistinguishable in the best case, is standard SQL and most clearly expresses the value you're trying to get. The only case where it seems that using SORT and LIMIT would be desirable would be, as mson mentioned, where you're writing a general operation that finds the top or bottom N values from arbitrary columns and it's not worth writing out the special-case operation.

TOP(1) vs MIN problem - Why does SQL Server use a clustered scan rather than seeking the covering index?

Even though RowNumber, the clustered index key, is a key value on IX_Transactions_TransactionDate, the index keys are ordered first by TransactionDate, then by RowNumber. The MIN(RowNumber) may not be on the first row with TransactionDate >= '20191002 04:00:00.000'.

Consider if the IX_Transactions_TransactionDate contained the key values:

(20191002 04:00:00.000,10),
(20191002 05:00:00.000,11),
(20191002 06:00:00.000,1)

The result of

SELECT MIN(RowNumber) FROM FintracTransactions WHERE TransactionDate >= '20191002 04:00:00.000' OPTION(RECOMPILE)

is 1. While the result of:

SELECT TOP(1) RowNumber FROM FintracTransactions WHERE TransactionDate >= '20191002 04:00:00.000' ORDER BY TransactionDate OPTION(RECOMPILE)

is 10.

So the optimizer's real choice is to scan every value from IX_Transactions_TransactionDate after the target date, or to scan the clustered index from the beginning until it finds the first row with a qualifying TransactionDate.

You should see that the execution plan for:

SELECT MIN(RowNumber) FROM [Transactions] with (index=[IX_Transactions_TransactionDate]) WHERE TransactionDate >= '20191002 04:00:00.000' OPTION(RECOMPILE)

has a higher estimated cost than the execution plan for:

SELECT MIN(RowNumber) FROM [Transactions]  WHERE TransactionDate >= '20191002 04:00:00.000' OPTION(RECOMPILE)


Related Topics



Leave a reply



Submit