Why Is My SQL Server Order by Slow Despite the Ordered Column Being Indexed

Why is my SQL Server ORDER BY slow despite the ordered column being indexed?

If your query does not contain an order by then it will return the data in whatever oreder it was found. There is no guarantee that the data will even be returned in the same order when you run the query again.

When you include an order by clause, the dabatase has to build a list of the rows in the correct order and then return the data in that order. This can take a lot of extra processing which translates into extra time.

It probably takes longer to sort a large number of columns, which your query might be returning. At some point you will run out of buffer space and the database will have to start swapping and perfromance will go downhill.

Try returning less columns (specify the columns you need instead of Select *) and see if the query runs faster.

SQL Query is slow when ORDER BY statement added

As far as why the query is slower, the query is required to return the rows "in order", so it either needs to do a sort, or it needs to use an index.

Using the index with a leading column of CreatedDate, SQL Server can avoid a sort. But SQL Server would also have to visit the pages in the underlying table to evaluate whether the row is to be returned, looking at the values in Status and Name columns.

If the optimizer chooses not to use the index with CreatedDate as the leading column, then it needs to first locate all of the rows that satisfy the predicates, and then perform a sort operation to get those rows in order. Then it can return the first fifty rows from the sorted set. (SQL Server wouldn't necessarily need to sort the entire set, but it would need to go through that whole set, and do sufficient sorting to guarantee that it's got the "first fifty" that need to be returned.

NOTE: I suspect you already know this, but to clarify: SQL Server honors the ORDER BY before the TOP 50. If you wanted any 50 rows that satisfied the predicates, but not necessarily the 50 rows with the lowest values of DateCreated,you could restructure/rewrite your query, to get (at most) 50 rows, and then perform the sort of just those.


A couple of ideas to improve performance

Adding a composite index (as other answers have suggested) may offer some improvement, for example:

 ON Documents (Status, DateCreated, Name)

SQL Server might be able to use that index to satisfy the equality predicate on Status, and also return the rows in DateCreated order without a sort operation. SQL server may also be able to satisfy the predicate on Name from the index, limiting the number of lookups to pages in the underlying table, which it needs to do for rows to be returned, to get "all" of the columns for the row.


For SQL Server 2008 or later, I'd consider a filtered index... dependent on the cardinality of Status='New' (that is, if rows that satisfy the predicate Status='New' is a relatively small subset of the table.

  CREATE NONCLUSTERED INDEX Documents_FIX 
ON Documents (Status, DateCreated, Name)
WHERE Status = 'New'

I would also modify the query to specify ORDER BY Status, DateCreated, Name

so that the order by clause matches the index, it doesn't really change the order that the rows are returned in.


As a more complicated alternative, I would consider adding a persisted computed column and adding a filtered index on that

  ALTER TABLE Documents
ADD new_none_date_created AS
CASE
WHEN Status = 'New' AND COALESCE(Name,'') IN ('','None') THEN DateCreated
ELSE NULL
END
PERSISTED
;

CREATE NONCLUSTERED INDEX Documents_FIXP
ON Documents (new_none_date_created)
WHERE new_none_date_created IS NOT NULL
;

Then the query could be re-written:

  SELECT TOP 50 *
FROM Documents
WHERE new_none_date_created IS NOT NULL
ORDER BY new_none_date_created
;

SQL 'ORDER BY' slowness

I did some performance testing last night on a more production-type database (not the developer one) and here is what I found:

Total rows in table A: 13000

Total rows in table B: 5000

Rows returned by join query : 5000

Time taken if using with ORDER BY clause: ~5.422 seconds

Time taken if not using ORDER BY clause: ~5.345 seconds.

So it looked like the ORDER BY wasnt making much of a difference. (I am okay with the few milliseconds added).

I also tested by setting all B.SYNTAX values to NULL to make sure that it wasnt just the network latency with transferring so much data.

Now I removed the B.SYNTAX from the SELECT clause and the query took only 0.8 seconds!

So it seems that the whole CLOB column is the bottleneck. Which doesnt mean that I have gotten the solution to making this query faster, but at least I wont spend time writing a sorting algorithm.

Thanks to all who replied. I learned quite a bit and it led me to try a few different things out.

SQL Server 2008: Ordering by datetime is too slow

Ordering by id probably uses a clustered index scan while ordering by datetime uses either sorting or index lookup.

Both these methods are more slow than a clustered index scan.

If your table is clustered by id, basically it means it is already sorted. The records are contained in a B+Tree which has a linked list linking the pages in id order. The engine should just traverse the linked list to get the records ordered by id.

If the ids were inserted in sequential order, this means that the physical order of the rows will match the logical order and the clustered index scan will be yet faster.

If you want your records to be ordered by datetime, there are two options:

  • Take all records from the table and sort them. Slowness is obvious.
  • Use the index on datetime. The index is stored in a separate space of the disk, this means the engine needs to shuttle between the index pages and table pages in a nested loop. It is more slow too.

To improve the ordering, you can create a separate covering index on datetime:

CREATE INDEX ix_mytable_datetime ON mytable (datetime) INCLUDE (field1, field2, …)

, and include all columns you use in your query into that index.

This index is like a shadow copy of your table but with data sorted in different order.

This will allow to get rid of the key lookups (since the index contains all data) which will make ordering by datetime as fast as that on id.

Update:

A fresh blog post on this problem:

  • SQL Server: clustered index and ordering

MySQL query is slow only when using ORDER BY field DESC and LIMIT

  • "Avg_row_length: 767640"; lots of MEDIUMTEXT. A row is limited to about 8KB; overflow goes into "off-record" blocks. Reading those blocks takes extra disk hits.

  • SELECT * will reach for all those fat columns. The total will be about 50 reads (of 16KB each). This takes time.

  • (Exp 4) SELECT a,b,c,d ran faster because it did not need to fetch all ~50 blocks per row.

  • Your secondary index, (systemcreated,referrersiteid,__id), -- only the first column is useful. This is because of systemcreated LIKE 'xxx%'. This is a "range". Once a range is hit, the rest of the index is ineffective. Except...

  • "Index hints" (USE INDEX(...)) may help today but may make things worse tomorrow when the data distribution changes.

  • If you can't get rid of the wild cards in LIKE, I recommend these two indexes:

      INDEX(systemcreated)
    INDEX(referrersiteid)
  • The real speedup can occur by turning the query inside out. That is, find the 16 ids first, then go looking for all those bulky columns:

      SELECT  r2...   -- whatever you want
    FROM
    (
    SELECT __id
    FROM referrals
    WHERE `systemcreated` LIKE 'XXXXXX%'
    AND `referrersiteid` LIKE 'XXXXXXXXXXXX%'
    order by __id desc
    limit 16
    ) AS r1
    JOIN referrals r2 USING(__id)
    ORDER BY __id DESC -- yes, this needs repeating

And keep the 3-column secondary index that you have. Even if it must scan a lot more than 16 rows to find the 16 desired, it is a lot less bulky. This means that the subquery ("derived table") will be moderately fast. Then the outer query will still have 16 lookups -- possibly 16*50 blocks to read. The total number of blocks read will still be a lot less.

There is rarely a noticeable difference between ASC and DESC on ORDER BY.

Why does the Optimizer pick the PK instead of the seemingly better secondary index? The PK might be best, especially if the 16 rows are at the 'end' (DESC) of the table. But that would be a terrible choice if it had to scan the entire table without finding 16 rows.

Meanwhile, the wildcard test makes the secondary index only partially useful. The Optimizer makes a decision based on inadequate statistics. Sometimes it feels like the flip of a coin.

If you use my inside-out reformulation, then I recommend the follow two composite indexes -- The Optimizer can make a semi-intelligent, semi-correct choice between them for the derived table:

INDEX(systemcreated, referrersiteid, __id),
INDEX(referrersiteid, systemcreated, __id)

It will continue to say "filesort", but don't worry; it's only sorting 16 rows.

And, remember, SELECT * is hurting performance. (Though maybe you can't fix that.)

order by slows query down massively

Your likely best fix is to make sure both sides of your joins have the same datatype. There's no need for one to be varchar and the other nvarchar.

This is a class of problems that comes up quite frequently in DBs. The database is assuming the wrong thing about the composition of the data it's about to deal with. The costs shown in your estimated execution plan are likely a long way from what you'd get in your actual plan. We all make mistakes and really it would be good if SQL Server learned from its own but currently it doesn't. It will estimate a 2 second return time despite being immediately proven wrong again and again. To be fair, I don't know of any DBMS which has machine-learning to do better.

Where your query is fast

The hardest part is done up front by sorting table3. That means it can do an efficient merge join which in turn means it has no reason to be lazy about spooling.

Where it's slow

Having an ID that refers to the same thing stored as two different types and data lengths shouldn't ever be necessary and will never be a good idea for an ID. In this case nvarchar in one place varchar in another. When that makes it fail to get a cardinality estimate that's the key flaw and here's why:

The optimizer is hoping to require only a few unique rows from table3. Just a handful of options like "Male", "Female", "Other". That would be what is known as "low cardinality". So imagine tradeNo actually contained IDs for genders for some weird reason. Remember, it's you with your human skills of contextualisation, who knows that's very unlikely. The DB is blind to that. So here is what it expects to happen: As it executes the query the first time it encounters the ID for "Male" it will lazily fetch the data associated (like the word "Male") and put it in the spool. Next, because it's sorted it expects just a lot more males and to simply re-use what it has already put in the spool.

Basically, it plans to fetch the data from tables 1 and 2 in a few big chunks stopping once or twice to fetch new details from table 3. In practice the stopping isn't occasional. In fact, it may even be stopping for every single row because there are lots of different IDs here. The lazy spool is like going upstairs to get one small thing at a time. Good if you think you just need your wallet. Not so good if you're moving house, in which case you'll want a big box (the eager spool).

The likely reason that shrinking the size of the field in table3 helped is that it meant it estimated less of a comparative benefit in doing the lazy spool over a full sort up front. With varchar it doesn't know how much data there is, just how much there could potentially be. The bigger the potential chunks of data that need shuffling, the more physical work needs doing.

What you can do to avoid in future

Make your table schema and indexes reflect the real shape of the data.

  • If an ID can be varchar in one table then it's very unlikely to need the extra characters available in nvarchar for another table. Avoid the need for conversions on IDs and also use integers instead of characters where possible.
  • Ask yourself if any of these tables need tradeNo to be filled in for
    all rows. If so, make it not nullable on that table. Next, ask if the
    ID should be unique for any of these tables and set it up as such in
    the appropriate index. Unique is the definition of maximum cardinality
    so it won't make that mistake again.

Nudge in the right direction with join order.

  • The order you have your joins in the SQL is a signal to the database about how powerful/difficult you expect each join to be. (Sometimes as a human you know more. e.g. if querying for 50 year old astronauts you know that filtering for astronauts would be the first filter to apply but maybe begin with the age when searching for 50 year office workers.) The heavy stuff should come first. It will ignore you if it thinks it has the information to know better but in this case it's relying on your knowledge.

If all else fails

  • A possible fix would be to INCLUDE all the fields you'll need from table3 in the index on TradeReportId. The reason the indexes couldn't help so much already is that they make it easy to identify how to re-sort but it still hasn't been physically done. That is work it was hoping to optimize with a lazy spool but if the data were included it would be already available so no work to optimize.

Speeding ORDER BY clause with index

Non-clustered indexes can absolutely be used to optimize away a sort. Indexes are essentially binary search trees, which means they contain the values sorted in order.

However, depending on the query, you may be putting SQL Server into a conundrum.

If you have a table with 100 million rows, your query will match 11 million of them, like below, is it cheaper to use an index on category to select the rows and sort the results by name, or to read all 100 million rows out of the index pre-sorted by name, and then filter down 89 million of them by checking the category?

select ...
from product
where category = ?
order by name;

In theory, SQL Server may be able to use an index on name to read rows in order and use the index on category to filter efficiently? I'm skeptical. I have rarely seen SQL Server use multiple indexes to access the same table in the same query (assuming a single table selection, ignoring joins or recursive CTE's). It would have to check the index 100 million times. Indexes have a high overhead cost per index search, so they are efficient when a single search narrows down the result set by a lot.

Without seeing a schema, statistics, and exact query, it's hard for me to say what makes sense, but I expect I would find SQL Server would use an index for the where clause and sort the results, ignoring an index on the sort column.

An index on the sort column may be used if you are selecting the entire table though. Like select ... from product order by name;

Again, your milage may vary. This is speculation based off past experience.

Query with index incredibly slow on Sort

You have an index that allows the

VendorId = @__vendorId_0 and ContractDate BETWEEN @__startDate_1 AND @__endDate_2

predicate to be seeked exactly.

Sample Image

SQL Server estimates that 6,657 rows will match this predicate and need to be sorted so it requests a memory grant suitable for that amount of rows.

In reality for the parameter values where you see the problem nearly half a million are sorted and the memory grant is insufficient and the sort spills to disc.

50 seconds for 10,299 spilled pages does still sound unexpectedly slow but I assume you may well be on some very low SKU in Azure SQL Database?

Some possible solutions to resolve the issue might be to

  1. Force it to use an execution plan that is compiled for parameter values with your largest vendor and wide date range (e.g. with OPTIMIZE FOR hint). This will mean an excessive memory grant for smaller vendors though which may mean other queries have to incur memory grant waits.
  2. Use OPTION (RECOMPILE) so every invocation is recompiled for the specific parameter values passed. This means in theory every execution will get an appropriate memory grant at the cost of more time spent in compilation.
  3. Remove the need for a sort at all. If you have an index on VendorId, ContractAmount INCLUDE (ContractDate) then the VendorId = @__vendorId_0 part can be seeked and the index read in ContractAmount order. Once 50 rows have been found that match the ContractDate BETWEEN @__startDate_1 AND @__endDate_2 predicate then query execution can stop. SQL Server might not choose this execution plan without hints though.

I'm not sure how easy or otherwise it is to apply query hints through EF but you could look at forcing a plan via query store if you manage to get the desired plan to appear there.



Related Topics



Leave a reply



Submit