Min/Max VS Order by and Limit

MIN/MAX vs ORDER BY and LIMIT

In the worst case, where you're looking at an unindexed field, using MIN() requires a single full pass of the table. Using SORT and LIMIT requires a filesort. If run against a large table, there would likely be a significant difference in percieved performance. As an anecdotal data point, MIN() took .36s while SORT and LIMIT took .84s against a 106,000 row table on my dev server.

If, however, you're looking at an indexed column, the difference is harder to notice (meaningless data point is 0.00s in both cases). Looking at the output of explain, however, it looks like MIN() is able to simply pluck the smallest value from the index ('Select tables optimized away' and 'NULL' rows) whereas the SORT and LIMIT still needs needs to do an ordered traversal of the index (106,000 rows). The actual performance impact is probably negligible.

It looks like MIN() is the way to go - it's faster in the worst case, indistinguishable in the best case, is standard SQL and most clearly expresses the value you're trying to get. The only case where it seems that using SORT and LIMIT would be desirable would be, as mson mentioned, where you're writing a general operation that finds the top or bottom N values from arbitrary columns and it's not worth writing out the special-case operation.

Performance of max() vs ORDER BY DESC + LIMIT 1

There does not seem to be an index on sensor.station_id, which is most probably important here.

There is an actual difference between max() and ORDER BY DESC + LIMIT 1. Many people seem to miss that. NULL values sort first in descending sort order. So ORDER BY timestamp DESC LIMIT 1 returns a row with timestamp IS NULL if it exists, while the aggregate function max() ignores NULL values and returns the latest non-null timestamp.

For your case, since your column d.timestamp is defined NOT NULL (as your update revealed), there is no effective difference. An index with DESC NULLS LAST and the same clause in the ORDER BY for the LIMIT query should still serve you best. I suggest these indexes (my query below builds on the 2nd one):

sensor(station_id, id)
data(sensor_id, timestamp DESC NULLS LAST)

You can drop the other index variants sensor_ind_timestamp and sensor_ind_timestamp_desc unless you have other queries that still require them (unlikely, but possible).

Much more importantly, there is another difficulty: The filter on the first table sensors returns few, but still (possibly) multiple rows. Postgres expects to find 2 rows (rows=2) in your added EXPLAIN output.

The perfect technique would be a loose index scan for the second table data - which is not currently implemented in Postgres 9.4 (or Postgres 9.5). You can rewrite the query to work around this limitation in various ways. Details:

  • Optimize GROUP BY query to retrieve latest record per user

The best should be:

SELECT d.timestamp
FROM sensors s
CROSS JOIN LATERAL (
SELECT timestamp
FROM data
WHERE sensor_id = s.id
ORDER BY timestamp DESC NULLS LAST
LIMIT 1
) d
WHERE s.station_id = 4
ORDER BY d.timestamp DESC NULLS LAST
LIMIT 1;

Since the style of outer query is mostly irrelevant, you can also just:

SELECT max(d.timestamp) AS timestamp
FROM sensors s
CROSS JOIN LATERAL (
SELECT timestamp
FROM data
WHERE sensor_id = s.id
ORDER BY timestamp DESC NULLS LAST
LIMIT 1
) d
WHERE s.station_id = 4;

And the max() variant should perform about as fast now:

SELECT max(d.timestamp) AS timestamp
FROM sensors s
CROSS JOIN LATERAL (
SELECT max(timestamp) AS timestamp
FROM data
WHERE sensor_id = s.id
) d
WHERE s.station_id = 4;

Or even, shortest of all:

SELECT max((SELECT max(timestamp) FROM data WHERE sensor_id = s.id)) AS timestamp
FROM sensors s
WHERE station_id = 4;

Note the double parentheses!

The additional advantage of LIMIT in a LATERAL join is that you can retrieve arbitrary columns of the selected row, not just the latest timestamp (one column).

Related:

  • Why do NULL values come first when ordering DESC in a PostgreSQL query?
  • What is the difference between LATERAL and a subquery in PostgreSQL?
  • Select first row in each GROUP BY group?
  • Optimize groupwise maximum query

Select MAX or Order By Limit 1

I came across this question and thought I'd add what I've found. Note that the columns are indexed. I'm running on MariaDB 10.2.14.

I have a query which looks like SELECT MAX(created) FROM tbl WHERE group=0 AND created IS NOT NULL. There's an index on (group,created) (both are ints, but created can be NULL). There are many entries with group=0, not many where created IS NULL. tbl is using the Aria storage engine.

EXPLAIN shows the index is being used and gives a row count of 46312, with extra saying "Using where; Using index"

Running the query takes around 0.692s, but the status has something interesting:

Handler_read_key: 1
Handler_read_next: 45131
Handler_read_prev: 0

This seems to suggest that the key is being fully scanned for the maximum; using MIN instead of MAX seems to give similar results. This seems to suggest that MIN/MAX actually can't make use of the optimisation to just pick the first/last entry of the index here.

However, if the query is changed to SELECT created FROM tbl WHERE group=0 AND created IS NOT NULL ORDER BY created DESC LIMIT 1, whilst the query seems to take about the same amount of time to run, and EXPLAIN shows the same info, the status shows:

Handler_read_key: 1
Handler_read_next: 0
Handler_read_prev: 0

I get similar results if the order by is changed to ASC. It seems to me that using an ORDER BY...LIMIT can skip an index scan, which could lead to faster queries if there are many rows which match the index condition, if my understanding is correct.

Note that for the above results, there's enough RAM and cache allocated for holding all indexes in cache, so, presumably, index scans are fast.

I haven't done any experiments with other conditions (different MySQL versions and storage engines) but I suppose the moral of this story is, checking status of queries via SHOW SESSION STATUS may help provide answers to these things.

At least in this case, the ORDER BY...LIMIT may be more efficient than MIN/MAX even when an index can be used.

min/max performance versus order by limit in sqlite?

When there is an index, both queries just take the first entry from the index.

When there is no index, MIN() does a single pass through the table, and ORDER BY sorts only as much as is needed for the LIMIT, which ends up the same effort. (Older versions of SQLite did sort everything.)

Why would MIN() query be slower than ORDER BY X LIMIT 1?

They do different things, hence one has to work harder.

SELECT  min(id)
FROM my_table
WHERE s_time >= now() - INTERVAL 14 DAY;

Searches all of the items in the last two weeks to find the lowest id. INDEX(s_time, id) will help some.

SELECT  id
FROM my_table
WHERE s_time >= now() - INTERVAL 14 DAY
ORDER BY s_time, id
LIMIT 1;

If you have INDEX(stime, id), then it will look at only one row -- the first one of 14 days ago. No scan. No checking to see if it is the smallestid`.

Note: If you have PRIMARY KEY(id), INDEX(stime), then that index is effectively (stime, id).

Since you probably inserted the rows in stime order, the results will probably be the same. But the Optimizer has no way of knowing that.

Why is MAX() 100 times slower than ORDER BY ... LIMIT 1?

You should add an index on (bar, quux).

Without this index, MySQL can't see how to perform the query efficiently so it has to choose from various inefficient query plans.

In the first example it scans the quux index and for each row found, looks up the corresponding value of bar in the original table. This takes twice as long to check each row, but it gets lucky that a row that has the correct value of bar is near the beginning of its scan and so it can stop. This could be because the value of bar you are searching for occurs frequently, so the chance of being lucky is very high. As a result it may only have to examine a handful of rows before it finds a match, so even though it takes twice as long to check each row, the fact that only a few rows are checked gives a massive overall saving. Since you have no index on bar, MySQL doesn't know in advance that the value :bar occurs frequently so it cannot know that this query will be fast.

In the second example it uses a different plan where it always scans the whole table. Each row is read directly from the table, without usage of an index. This means that each row read is fast, but because you have a lot of rows, it is slow overall. If none of the rows matched on :bar this would be the faster query plan. But if roughly 1% of rows have the desired value of bar, it will be (very) approximately 100 times slower to use this query plan compared to the above plan. Since you have no index on bar, MySQL doesn't know this in advance.

You could also just add the missing index and then both queries will go much faster.

COUNT and MAX/MIN and (HAVING?) in one query/ AdventureWorks2017 task

Take these kind of requirements one at a time and build your solution gradually.

Run your query after each additional step to confirm the expected result!

Steps

  1. Only job titles with more than 4 people:

    group by e.JobTitle having count(1) >= 4
  2. Count women proportionally (I interpret this as "the relative number of women within one job title"). This requires the overall count and the women only count:

    count(1) as JobTitleCount and

    count(case when e.Gender = 'F' then 1 end) as FemaleCount

    The relative (proportional) number or percentage then becomes:

    count(case when e.Gender = 'F' then 1 end)*100.0/count(1) as FemalePercentage
  3. "Fewest" and "most" means ranking functions with an order by clause:

    dense_rank() over(order by <percentage from step 2 comes here> ) as RankLeast and

    dense_rank() over(order by <percentage from step 2 comes here> desc) as RankMost
  4. Filter on the job titles that are in first place for either ranking:

    where jfpr.RankLeast = 1 or jfpr.RankMost = 1

Intermediate result

Steps 1. to 3.

select  e.JobTitle,
count(1) as JobTitleCount,
count(case when e.Gender = 'F' then 1 end) as FemaleCount,
count(case when e.Gender = 'F' then 1 end)*100.0/count(1) as FemalePercentage,
dense_rank() over(order by count(case when e.Gender = 'F' then 1 end)*100.0/count(1) ) as FemalePercentageRankLeast,
dense_rank() over(order by count(case when e.Gender = 'F' then 1 end)*100.0/count(1) desc) as FemalePercentageRankMost
from HumanResources.Employee e
group by e.JobTitle
having count(1) >= 4
order by FemalePercentage;

Full solution

Moving the entire previous query in a subquery jfpr and leaving out the columns we no longer need for the final result (JobTitleCount and FemaleCount). The order by moves to outside the subquery.

select  jfpr.JobTitle,
jfpr.FemalePercentage
from ( select e.JobTitle,
count(case when e.Gender = 'F' then 1 end)*100.0/count(1) as FemalePercentage,
dense_rank() over(order by count(case when e.Gender = 'F' then 1 end)*100.0/count(1) ) as FemalePercentageRankLeast,
dense_rank() over(order by count(case when e.Gender = 'F' then 1 end)*100.0/count(1) desc) as FemalePercentageRankMost
from HumanResources.Employee e
group by e.JobTitle
having count(1) >= 4 ) jfpr -- job female percentage rank
where jfpr.FemalePercentageRankLeast = 1
or jfpr.FemalePercentageRankMost = 1
order by jfpr.FemalePercentage;

Result

On my copy of AdventureWorks (no idea what version I have right now) this produces:

JobTitle                      FemalePercentage
---------------------------- ----------------
Quality Assurance Technician 0.000000000000
Scheduling Assistant 0.000000000000
Janitor 50.000000000000
Application Specialist 50.000000000000

min and max with and without windowing function

With 2 subqueries to get the latest and earliest amounts:

select distinct t.user_id, t.name, 
(select amount from dt
where user_id = t.user_id
order by td desc limit 1
)
-
(select amount from dt
where user_id = t.user_id
order by td limit 1
) amount
from dt t

See the demo.

Or:

select t.user_id, t.name,
max(t.latest * t.amount) - max(t.earliest * t.amount) amount
from (
select d.user_id, d.name, d.amount,
d.td = g.earliestdate earliest, d.td = g.latestdate latest
from dt d inner join (
select user_id, min(td) earliestdate, max(td) latestdate
from dt
group by user_id
) g on d.user_id = g.user_id and d.td in (earliestdate, latestdate)
) t
group by t.user_id, t.name

See the demo.

Results:

| user_id | name   | amount |
| ------- | ------ | ------ |
| 1 | Rodeo | 30.32 |
| 2 | Janice | 13.08 |
| 3 | Sam | 0 |


Related Topics



Leave a reply



Submit