SQL Why Is Select Count(*) , Min(Col), Max(Col) Faster Then Select Min(Col), Max(Col)

SQL why is SELECT COUNT(*) , MIN(col), MAX(col) faster then SELECT MIN(col), MAX(col)

The SQL Server cardinality estimator makes various modelling assumptions such as

  • Independence: Data distributions on different columns are independent unless correlation information is available.
  • Uniformity: Within each statistics object histogram step, distinct values are evenly spread and each value has the same frequency.

Source

There are 810,064 rows in the table.

You have the query

SELECT COUNT(*),
MIN(startdate) AS Firstdate,
MAX(startdate) AS Lastdate
FROM table
WHERE status <> 'A'
AND fk = 4193

1,893 (0.23%) rows meet the fk = 4193 predicate, and of those two fail the status <> 'A' part so overall 1,891 match and need to be aggregated.

You also have two indexes neither of which cover the whole query.

For your fast query it uses an index on fk to directly find rows where fk = 4193 then needs to do 1,893 key lookups to find each row in the clustered index to check the status predicate and retrieve the startdate for aggregation.

When you remove the COUNT(*) from the SELECT list SQL Server no longer has to process every qualifying row. As a result it considers another option.

You have an index on startdate so it could start scanning that from the beginning, doing key lookups back to the base table and as soon as it finds the first matching row stop as it has found the MIN(startdate), Similarly the MAX can be found with another scan starting the other end of the index and working backwards.

SQL Server estimates that each of these scans will end up processing 590 rows before they hit upon one that matches the predicate. Giving 1,180 total lookups vs 1,893 so it chooses this plan.

The 590 figure is just table_size / estimated_number_of_rows_that_match. i.e. the cardinality estimator assumes that the matching rows will be evenly distributed throughout the table.

Unfortunately the 1,891 rows that meet the predicate are not randomly distributed with respect to startdate. In fact they are all condensed into a single 8,205 row segment towards the end of the index meaning that the scan to get to the MIN(startdate) ends up doing 801,859 key lookups before it can stop.

This can be reproduced below.

CREATE TABLE T
(
id int identity(1,1) primary key,
startdate datetime,
fk int,
[status] char(1),
Filler char(2000)
)

CREATE NONCLUSTERED INDEX ix ON T(startdate)

INSERT INTO T
SELECT TOP 810064 Getdate() - 1,
4192,
'B',
''
FROM sys.all_columns c1,
sys.all_columns c2

UPDATE T
SET fk = 4193, startdate = GETDATE()
WHERE id BETWEEN 801859 and 803748 or id = 810064

UPDATE T
SET startdate = GETDATE() + 1
WHERE id > 810064

/*Both queries give the same plan.
UPDATE STATISTICS T WITH FULLSCAN
makes no difference*/

SELECT MIN(startdate) AS Firstdate,
MAX(startdate) AS Lastdate
FROM T
WHERE status <> 'A' AND fk = 4192

SELECT MIN(startdate) AS Firstdate,
MAX(startdate) AS Lastdate
FROM T
WHERE status <> 'A' AND fk = 4193

You could consider using query hints to force the plan to use the index on fk rather than startdate or add the suggested missing index highlighted in the execution plan on (fk,status) INCLUDE (startdate) to avoid this issue.

Getting the min() of a count(*) column

I don't have an oracle station to test on but you should be able to just wrap the aggregator around your SELECT as a subquery/derived table/inline view

So it would be (UNTESTED!!)

SELECT 
AVG(s.c)
, MIN(s.c)
, MAX(s.c)
, s.ID
FROM
--Note this is just your query
(select id, to_char(time), count(*) as c from vehicle_location group by id, to_char(time), min having id = 16) as s
GROUP BY s.ID

Here's some reading on it:

http://www.devshed.com/c/a/Oracle/Inserting-SubQueries-in-SELECT-Statements-in-Oracle/3/

EDIT: Though normally it is a bad idea to select both the MIN and MAX in a single query.

EDIT2: The min/max issue is related to how some RDBMS (including oracle) handle aggregations on indexed columns. It may not affect this particular query but the premise is that it's easy to use the index to find either the MIN or the MAX but not both at the same time because any index may not be used effectively.

Here's some reading on it:

http://momendba.blogspot.com/2008/07/min-and-max-functions-in-single-query.html

SQL MAX of multiple columns?

This is an old answer and broken in many way.

See https://stackoverflow.com/a/6871572/194653 which has way more upvotes and works with sql server 2008+ and handles nulls, etc.

Original but problematic answer:

Well, you can use the CASE statement:

SELECT
CASE
WHEN Date1 >= Date2 AND Date1 >= Date3 THEN Date1
WHEN Date2 >= Date1 AND Date2 >= Date3 THEN Date2
WHEN Date3 >= Date1 AND Date3 >= Date2 THEN Date3
ELSE Date1
END AS MostRecentDate

Querying minimum value in SQL Server is a lot longer than querying all the rows

The root cause of the problem is the non-aligned nonclustered index, combined with the statistical limitation Martin Smith points out (see his answer to another question for details).

Your table is partitioned on [Hour] along these lines:

CREATE PARTITION FUNCTION PF (integer)
AS RANGE RIGHT
FOR VALUES (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23);

CREATE PARTITION SCHEME PS
AS PARTITION PF ALL TO ([PRIMARY]);

-- Partitioned
CREATE TABLE dbo.Connection
(
Id bigint IDENTITY(1,1) NOT NULL,
DateConnection datetime NOT NULL,
TimeConnection time(7) NOT NULL,
[Hour] AS (DATEPART(HOUR, TimeConnection)) PERSISTED NOT NULL,

CONSTRAINT [PK_Connection]
PRIMARY KEY CLUSTERED
(
[Hour] ASC,
[Id] ASC
)
ON PS ([Hour])
);

-- Not partitioned
CREATE UNIQUE NONCLUSTERED INDEX [IX_Connection_Id]
ON dbo.Connection
(
Id ASC
)ON [PRIMARY];

-- Pretend there are lots of rows
UPDATE STATISTICS dbo.Connection WITH ROWCOUNT = 200000000, PAGECOUNT = 4000000;

The query and execution plan are:

SELECT 
MinID = MIN(c.Id)
FROM dbo.Connection AS c WITH (READUNCOMMITTED)
WHERE
c.DateConnection = '2012-06-26';

Selected plan

The optimizer takes advantage of the index (ordered on Id) to transform the MIN aggregate to a TOP (1) - since the minimum value will by definition be the first value encountered in the ordered stream. (If the nonclustered index were also partitioned, the optimizer would not choose this strategy since the required ordering would be lost).

The slight complication is that we also need to apply the predicate in the WHERE clause, which requires a lookup to the base table to fetch the DateConnection value. The statistical limitation Martin mentions explains why the optimizer estimates it will only need to check 119 rows from the ordered index before finding one with a DateConnection value that will match the WHERE clause. The hidden correlation between DateConnection and Id values means this estimate is a very long way off.

In case you are interested, the Compute Scalar calculates which partition to perform the Key Lookup into. For each row from the nonclustered index, it computes an expression like [PtnId1000] = Scalar Operator(RangePartitionNew([dbo].[Connection].[Hour] as [c].[Hour],(1),(1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12),(13),(14),(15),(16),(17),(18),(19),(20),(21),(22),(23))), and this is used as the leading key of the lookup seek. There is prefetching (read-ahead) on the nested loops join, but this needs to be an ordered prefetch to preserve the sorting required by the TOP (1) optimization.

Solution

We can avoid the statistical limitation (without using query hints) by finding the minimum Id for each Hour value, and then taking the minimum of the per-hour minimums:

-- Global minimum
SELECT
MinID = MIN(PerHour.MinId)
FROM
(
-- Local minimums (for each distinct hour value)
SELECT
MinID = MIN(c.Id)
FROM dbo.Connection AS c WITH(READUNCOMMITTED)
WHERE
c.DateConnection = '2012-06-26'
GROUP BY
c.[Hour]
) AS PerHour;

The execution plan is:

Serial plan

If parallelism is enabled, you will see a plan more like the following, which uses parallel index scan and multi-threaded stream aggregates to produce the result even faster:

Parallel plan

Why does my database query take longer if I select specific columns?

Thanks for posting the STATISTICS IO and Execution Plan. Please post the various indexes and schema associated with these tables.

Why your query times vary

In SELECTing just the orders.* data, the SQL Server Query Optimizer is deciding to use "GoalProcessingStatus.nci_wi_Goal..." index. This index encompasses every GoalProcessingStatus column you reference in the predicate.

In SELECTing the * data, the SQL Server Query Optimizer is deciding to use "IX_U_GoalP..." index and doing subsequent lookups on the Clustered index. This is because it needs all the columns from the GoalProcessingStatus table.

Moving Forward

To move forward with solving this problem, you'll need to figure out why performance on index "GoalProcessingStatus.nci_wi_Goal..." is so poor. It's difficult for me to help without the indexes and schema definitions. Also, the XML Execution plan would also provide more information.

Updated

After revisiting your query "order.*" query, the only columns being referenced in the predicate are GoalProcessingStatusID, Attempts, and IsProcessed. your non-clustered index may lack uniqueness.

sqlite3 select min, max together is much slower than select them separately

This is a known quirk of the sqlite query optimizer, as explained here: http://www.sqlite.org/optoverview.html#minmax:

Queries of the following forms will be optimized to run in logarithmic time assuming appropriate indices exist:

 SELECT MIN(x) FROM table;
SELECT MAX(x) FROM table;

In order for these optimizations to occur, they must appear in exactly the form shown above - changing only the name of the table and column. It is not permissible to add a WHERE clause or do any arithmetic on the result. The result set must contain a single column. The column in the MIN or MAX function must be an indexed column.

UPDATE (2017/06/23): Recently, this has been updated to say that a query containing a single MAX or MIN might be satisfied by an index lookup (allowing for things like arithmetic); however, they still preclude having more than one such aggregation operator in a single query (so MIN,MAX will still be slow):

Queries that contain a single MIN() or MAX() aggregate function whose argument is the left-most column of an index might be satisfied by doing a single index lookup rather than by scanning the entire table. Examples:

SELECT MIN(x) FROM table;
SELECT MAX(x)+1 FROM table;


Related Topics



Leave a reply



Submit