Why is my SQL Server ORDER BY slow despite the ordered column being indexed?
If your query does not contain an order by then it will return the data in whatever oreder it was found. There is no guarantee that the data will even be returned in the same order when you run the query again.
When you include an order by clause, the dabatase has to build a list of the rows in the correct order and then return the data in that order. This can take a lot of extra processing which translates into extra time.
It probably takes longer to sort a large number of columns, which your query might be returning. At some point you will run out of buffer space and the database will have to start swapping and perfromance will go downhill.
Try returning less columns (specify the columns you need instead of Select *) and see if the query runs faster.
SQL Query is slow when ORDER BY statement added
As far as why the query is slower, the query is required to return the rows "in order", so it either needs to do a sort, or it needs to use an index.
Using the index with a leading column of CreatedDate, SQL Server can avoid a sort. But SQL Server would also have to visit the pages in the underlying table to evaluate whether the row is to be returned, looking at the values in Status and Name columns.
If the optimizer chooses not to use the index with CreatedDate as the leading column, then it needs to first locate all of the rows that satisfy the predicates, and then perform a sort operation to get those rows in order. Then it can return the first fifty rows from the sorted set. (SQL Server wouldn't necessarily need to sort the entire set, but it would need to go through that whole set, and do sufficient sorting to guarantee that it's got the "first fifty" that need to be returned.
NOTE: I suspect you already know this, but to clarify: SQL Server honors the ORDER BY
before the TOP 50
. If you wanted any 50 rows that satisfied the predicates, but not necessarily the 50 rows with the lowest values of DateCreated,you could restructure/rewrite your query, to get (at most) 50 rows, and then perform the sort of just those.
A couple of ideas to improve performance
Adding a composite index (as other answers have suggested) may offer some improvement, for example:
ON Documents (Status, DateCreated, Name)
SQL Server might be able to use that index to satisfy the equality predicate on Status, and also return the rows in DateCreated order without a sort operation. SQL server may also be able to satisfy the predicate on Name from the index, limiting the number of lookups to pages in the underlying table, which it needs to do for rows to be returned, to get "all" of the columns for the row.
For SQL Server 2008 or later, I'd consider a filtered index... dependent on the cardinality of Status='New' (that is, if rows that satisfy the predicate Status='New'
is a relatively small subset of the table.
CREATE NONCLUSTERED INDEX Documents_FIX
ON Documents (Status, DateCreated, Name)
WHERE Status = 'New'
I would also modify the query to specify ORDER BY Status, DateCreated, Name
so that the order by clause matches the index, it doesn't really change the order that the rows are returned in.
As a more complicated alternative, I would consider adding a persisted computed column and adding a filtered index on that
ALTER TABLE Documents
ADD new_none_date_created AS
CASE
WHEN Status = 'New' AND COALESCE(Name,'') IN ('','None') THEN DateCreated
ELSE NULL
END
PERSISTED
;
CREATE NONCLUSTERED INDEX Documents_FIXP
ON Documents (new_none_date_created)
WHERE new_none_date_created IS NOT NULL
;
Then the query could be re-written:
SELECT TOP 50 *
FROM Documents
WHERE new_none_date_created IS NOT NULL
ORDER BY new_none_date_created
;
SQL 'ORDER BY' slowness
I did some performance testing last night on a more production-type database (not the developer one) and here is what I found:
Total rows in table A: 13000
Total rows in table B: 5000
Rows returned by join query : 5000
Time taken if using with ORDER BY clause: ~5.422 seconds
Time taken if not using ORDER BY clause: ~5.345 seconds.
So it looked like the ORDER BY wasnt making much of a difference. (I am okay with the few milliseconds added).
I also tested by setting all B.SYNTAX values to NULL to make sure that it wasnt just the network latency with transferring so much data.
Now I removed the B.SYNTAX from the SELECT clause and the query took only 0.8 seconds!
So it seems that the whole CLOB column is the bottleneck. Which doesnt mean that I have gotten the solution to making this query faster, but at least I wont spend time writing a sorting algorithm.
Thanks to all who replied. I learned quite a bit and it led me to try a few different things out.
SQL Server 2008: Ordering by datetime is too slow
Ordering by id
probably uses a clustered index scan while ordering by datetime
uses either sorting or index lookup.
Both these methods are more slow than a clustered index scan.
If your table is clustered by id
, basically it means it is already sorted. The records are contained in a B+Tree
which has a linked list linking the pages in id
order. The engine should just traverse the linked list to get the records ordered by id
.
If the id
s were inserted in sequential order, this means that the physical order of the rows will match the logical order and the clustered index scan will be yet faster.
If you want your records to be ordered by datetime
, there are two options:
- Take all records from the table and sort them. Slowness is obvious.
- Use the index on
datetime
. The index is stored in a separate space of the disk, this means the engine needs to shuttle between the index pages and table pages in a nested loop. It is more slow too.
To improve the ordering, you can create a separate covering index on datetime
:
CREATE INDEX ix_mytable_datetime ON mytable (datetime) INCLUDE (field1, field2, …)
, and include all columns you use in your query into that index.
This index is like a shadow copy of your table but with data sorted in different order.
This will allow to get rid of the key lookups (since the index contains all data) which will make ordering by datetime
as fast as that on id
.
Update:
A fresh blog post on this problem:
- SQL Server: clustered index and ordering
MySQL query is slow only when using ORDER BY field DESC and LIMIT
"Avg_row_length: 767640"; lots of
MEDIUMTEXT
. A row is limited to about 8KB; overflow goes into "off-record" blocks. Reading those blocks takes extra disk hits.SELECT *
will reach for all those fat columns. The total will be about 50 reads (of 16KB each). This takes time.(Exp 4)
SELECT a,b,c,d
ran faster because it did not need to fetch all ~50 blocks per row.Your secondary index, (
systemcreated
,referrersiteid
,__id
), -- only the first column is useful. This is because ofsystemcreated LIKE 'xxx%'
. This is a "range". Once a range is hit, the rest of the index is ineffective. Except..."Index hints" (
USE INDEX(...)
) may help today but may make things worse tomorrow when the data distribution changes.If you can't get rid of the wild cards in
LIKE
, I recommend these two indexes:INDEX(systemcreated)
INDEX(referrersiteid)The real speedup can occur by turning the query inside out. That is, find the 16 ids first, then go looking for all those bulky columns:
SELECT r2... -- whatever you want
FROM
(
SELECT __id
FROM referrals
WHERE `systemcreated` LIKE 'XXXXXX%'
AND `referrersiteid` LIKE 'XXXXXXXXXXXX%'
order by __id desc
limit 16
) AS r1
JOIN referrals r2 USING(__id)
ORDER BY __id DESC -- yes, this needs repeating
And keep the 3-column secondary index that you have. Even if it must scan a lot more than 16 rows to find the 16 desired, it is a lot less bulky. This means that the subquery ("derived table") will be moderately fast. Then the outer query will still have 16 lookups -- possibly 16*50 blocks to read. The total number of blocks read will still be a lot less.
There is rarely a noticeable difference between ASC
and DESC
on ORDER BY
.
Why does the Optimizer pick the PK instead of the seemingly better secondary index? The PK might be best, especially if the 16 rows are at the 'end' (DESC) of the table. But that would be a terrible choice if it had to scan the entire table without finding 16 rows.
Meanwhile, the wildcard test makes the secondary index only partially useful. The Optimizer makes a decision based on inadequate statistics. Sometimes it feels like the flip of a coin.
If you use my inside-out reformulation, then I recommend the follow two composite indexes -- The Optimizer can make a semi-intelligent, semi-correct choice between them for the derived table:
INDEX(systemcreated, referrersiteid, __id),
INDEX(referrersiteid, systemcreated, __id)
It will continue to say "filesort", but don't worry; it's only sorting 16 rows.
And, remember, SELECT *
is hurting performance. (Though maybe you can't fix that.)
order by slows query down massively
Your likely best fix is to make sure both sides of your joins have the same datatype. There's no need for one to be varchar
and the other nvarchar
.
This is a class of problems that comes up quite frequently in DBs. The database is assuming the wrong thing about the composition of the data it's about to deal with. The costs shown in your estimated execution plan are likely a long way from what you'd get in your actual plan. We all make mistakes and really it would be good if SQL Server learned from its own but currently it doesn't. It will estimate a 2 second return time despite being immediately proven wrong again and again. To be fair, I don't know of any DBMS which has machine-learning to do better.
Where your query is fast
The hardest part is done up front by sorting table3. That means it can do an efficient merge join which in turn means it has no reason to be lazy about spooling.
Where it's slow
Having an ID that refers to the same thing stored as two different types and data lengths shouldn't ever be necessary and will never be a good idea for an ID. In this case nvarchar
in one place varchar
in another. When that makes it fail to get a cardinality estimate that's the key flaw and here's why:
The optimizer is hoping to require only a few unique rows from table3. Just a handful of options like "Male", "Female", "Other". That would be what is known as "low cardinality". So imagine tradeNo actually contained IDs for genders for some weird reason. Remember, it's you with your human skills of contextualisation, who knows that's very unlikely. The DB is blind to that. So here is what it expects to happen: As it executes the query the first time it encounters the ID for "Male" it will lazily fetch the data associated (like the word "Male") and put it in the spool. Next, because it's sorted it expects just a lot more males and to simply re-use what it has already put in the spool.
Basically, it plans to fetch the data from tables 1 and 2 in a few big chunks stopping once or twice to fetch new details from table 3. In practice the stopping isn't occasional. In fact, it may even be stopping for every single row because there are lots of different IDs here. The lazy spool is like going upstairs to get one small thing at a time. Good if you think you just need your wallet. Not so good if you're moving house, in which case you'll want a big box (the eager spool).
The likely reason that shrinking the size of the field in table3 helped is that it meant it estimated less of a comparative benefit in doing the lazy spool over a full sort up front. With varchar
it doesn't know how much data there is, just how much there could potentially be. The bigger the potential chunks of data that need shuffling, the more physical work needs doing.
What you can do to avoid in future
Make your table schema and indexes reflect the real shape of the data.
- If an ID can be
varchar
in one table then it's very unlikely to need the extra characters available innvarchar
for another table. Avoid the need for conversions on IDs and also use integers instead of characters where possible. - Ask yourself if any of these tables need tradeNo to be filled in for
all rows. If so, make it not nullable on that table. Next, ask if the
ID should be unique for any of these tables and set it up as such in
the appropriate index. Unique is the definition of maximum cardinality
so it won't make that mistake again.
Nudge in the right direction with join order.
- The order you have your joins in the SQL is a signal to the database about how powerful/difficult you expect each join to be. (Sometimes as a human you know more. e.g. if querying for 50 year old astronauts you know that filtering for astronauts would be the first filter to apply but maybe begin with the age when searching for 50 year office workers.) The heavy stuff should come first. It will ignore you if it thinks it has the information to know better but in this case it's relying on your knowledge.
If all else fails
- A possible fix would be to
INCLUDE
all the fields you'll need from table3 in the index on TradeReportId. The reason the indexes couldn't help so much already is that they make it easy to identify how to re-sort but it still hasn't been physically done. That is work it was hoping to optimize with a lazy spool but if the data were included it would be already available so no work to optimize.
Speeding ORDER BY clause with index
Non-clustered indexes can absolutely be used to optimize away a sort. Indexes are essentially binary search trees, which means they contain the values sorted in order.
However, depending on the query, you may be putting SQL Server into a conundrum.
If you have a table with 100 million rows, your query will match 11 million of them, like below, is it cheaper to use an index on category
to select the rows and sort the results by name
, or to read all 100 million rows out of the index pre-sorted by name
, and then filter down 89 million of them by checking the category
?
select ...
from product
where category = ?
order by name;
In theory, SQL Server may be able to use an index on name
to read rows in order and use the index on category
to filter efficiently? I'm skeptical. I have rarely seen SQL Server use multiple indexes to access the same table in the same query (assuming a single table selection, ignoring joins or recursive CTE's). It would have to check the index 100 million times. Indexes have a high overhead cost per index search, so they are efficient when a single search narrows down the result set by a lot.
Without seeing a schema, statistics, and exact query, it's hard for me to say what makes sense, but I expect I would find SQL Server would use an index for the where clause and sort the results, ignoring an index on the sort column.
An index on the sort column may be used if you are selecting the entire table though. Like select ... from product order by name;
Again, your milage may vary. This is speculation based off past experience.
Query with index incredibly slow on Sort
You have an index that allows the
VendorId = @__vendorId_0 and ContractDate BETWEEN @__startDate_1 AND @__endDate_2
predicate to be seeked exactly.
SQL Server estimates that 6,657 rows will match this predicate and need to be sorted so it requests a memory grant suitable for that amount of rows.
In reality for the parameter values where you see the problem nearly half a million are sorted and the memory grant is insufficient and the sort spills to disc.
50 seconds for 10,299 spilled pages does still sound unexpectedly slow but I assume you may well be on some very low SKU in Azure SQL Database?
Some possible solutions to resolve the issue might be to
- Force it to use an execution plan that is compiled for parameter values with your largest vendor and wide date range (e.g. with
OPTIMIZE FOR
hint). This will mean an excessive memory grant for smaller vendors though which may mean other queries have to incur memory grant waits. - Use
OPTION (RECOMPILE)
so every invocation is recompiled for the specific parameter values passed. This means in theory every execution will get an appropriate memory grant at the cost of more time spent in compilation. - Remove the need for a sort at all. If you have an index on
VendorId, ContractAmount INCLUDE (ContractDate)
then theVendorId = @__vendorId_0
part can be seeked and the index read inContractAmount
order. Once 50 rows have been found that match theContractDate BETWEEN @__startDate_1 AND @__endDate_2
predicate then query execution can stop. SQL Server might not choose this execution plan without hints though.
I'm not sure how easy or otherwise it is to apply query hints through EF but you could look at forcing a plan via query store if you manage to get the desired plan to appear there.
Related Topics
Sql Server 2008 - How to Convert Gmt(Utc) Datetime to Local Datetime
What Is The T-Sql Equivalent of MySQL Syntax Limit X, Y
How to Get Referenced Values from Another Table
Activerecord::Statementinvalid. Pg Error
Sql - How to Find The Highest Number in a Column
Cross Apply Performance Difference
How to Retrieve a Very Long Xml-String from an SQL Database with R
Sql Server 2005 Unique Constraint on Two Columns
Update Multiple Records in Sql
Count Distinct Records (All Columns) Not Working
Updating One Column Based on The Value of Another Column
Sql Server 2008 Hierarchy Data Type Performance