SQL Server Clustered Index - Order of Index Question

SQL Server Clustered Index - Order of Index Question

You should order your composite clustered index with the most selective column first. This means the column with the most distinct values compared to total row count.

"B*TREE Indexes improve the performance of queries that select a small percentage of rows from a table." http://www.akadia.com/services/ora_index_selectivity.html?

This article is for Oracle, but still relevant.

Also, if you have a query that runs constantly and returns few fields, you may consider creating a composite index that contains all the fields - it will not have to access the base table, but will instead pull data from the index.

ligget78's comment on making sure to mention the first column in a composite index is important to remember.

Column order on a clustered index

As Michael mentioned, the column order in your index is directly linked to what you have in your WHERE clause.

To illustrate this point, as a test I created three tables, each with a different column as the first in the clustered index. Then, I populated them with 10,000 rows of data.

Executing the same SQL query across all three tables yields very different performance results:

set statistics io on
set statistics time on

select * from DtFirst where DtCol between '4/1/2010' and '6/1/2010'
select * from IntFirst where DtCol between '4/1/2010' and '6/1/2010'
select * from BitFirst where DtCol between '4/1/2010' and '6/1/2010'

set statistics io off
set statistics time off

Statistics are as follows:

First table (Date Column first)

Scan count 1, logical reads 3

CPU time = 0 ms, elapsed time = 0 ms.



Second table (Date Column second)

Scan count 1, logical reads 29

CPU time = 0 ms, elapsed time = 113 ms.



Third table (Date Column third)

Scan count 1, logical reads 29

CPU time = 0 ms, elapsed time = 145 ms.



As you can see, querying on the date on the table where the date column is ordered first in the clustered index clearly produces much better results.

Best order of columns in clustered index

For this query, the index you want is either (Active, Deleted, ControlName) or (Deleted, Active, ControlName).

It does not have to be a clustered index.

Sort order of an SQL Server 2008+ clustered index

There is a difference. Inserting out of Cluster Order causes massive fragmentation.

When you run the following code the DESC clustered index is generating additional UPDATE operations at the NONLEAF level.

CREATE TABLE dbo.TEST_ASC(ID INT IDENTITY(1,1) 
,RandNo FLOAT
);
GO
CREATE CLUSTERED INDEX cidx ON dbo.TEST_ASC(ID ASC);
GO

CREATE TABLE dbo.TEST_DESC(ID INT IDENTITY(1,1)
,RandNo FLOAT
);
GO
CREATE CLUSTERED INDEX cidx ON dbo.TEST_DESC(ID DESC);
GO

INSERT INTO dbo.TEST_ASC VALUES(RAND());
GO 100000

INSERT INTO dbo.TEST_DESC VALUES(RAND());
GO 100000

The two Insert statements produce exactly the same Execution Plan but when looking at the operational stats the differences show up against [nonleaf_update_count].

SELECT 
OBJECT_NAME(object_id)
,*
FROM sys.dm_db_index_operational_stats(DB_ID(),OBJECT_ID('TEST_ASC'),null,null)
UNION
SELECT
OBJECT_NAME(object_id)
,*
FROM sys.dm_db_index_operational_stats(DB_ID(),OBJECT_ID('TEST_DESC'),null,null)

There is an extra –under the hood- operation going on when SQL is working with DESC index that runs against the IDENTITY.
This is because the DESC table is becoming fragmented (rows inserted at the start of the page) and additional updates occur to maintain the B-tree structure.

The most noticeable thing about this example is that the DESC Clustered Index becomes over 99% fragmented. This is recreating the same bad behaviour as using a random GUID for a clustered index.
The below code demonstrates the fragmentation.

SELECT 
OBJECT_NAME(object_id)
,*
FROM sys.dm_db_index_physical_stats (DB_ID(), OBJECT_ID('dbo.TEST_ASC'), NULL, NULL ,NULL)
UNION
SELECT
OBJECT_NAME(object_id)
,*
FROM sys.dm_db_index_physical_stats (DB_ID(), OBJECT_ID('dbo.TEST_DESC'), NULL, NULL ,NULL)

UPDATE:

On some test environments I'm also seeing that the DESC table is subject to more WAITS with an increase in [page_io_latch_wait_count] and [page_io_latch_wait_in_ms]

UPDATE:

Some discussion has arisen about what is the point of a Descending Index when SQL can perform Backward Scans. Please read this article about the limitations of Backward Scans.

Clustered Index and Sorting

Note: question is cross-posted here.

if I have a clustered index on [column_a], [column_b], and [column_c] and run the same query from above, will the data ALWAYS come back sorted based on that order since that's the order that the clustered index was created on?

No.

SQL Server does not guarantee that it will return data in any order unless you specify the order. It is easy to prove things can go wrong by simply creating a covering, non-clustered index that leads on a different column:

  • Example db<>fiddle

But things can go sideways in other ways, too, e.g. when parallelism or partitioning come into play and SQL Server re-assembles the data from different threads, or when the query gets more complex using joins or filters and a different plan other than a clustered index scan makes sense. Leaving off the order by clause is telling SQL Server: "I don't care about order."

Also, just as a point of clarification:

If I have an ORDER BY clause on all the columns used in the clustered index, the execution plan will not have a sort operator.

...this is true only if the columns are listed in the exact same order as the key definition. ORDER BY c, b, a is "all the columns" but it obviously will produce different output (and require some type of sort operation to get there).

If you expect and want to be comfortable relying on a certain order, always use an ORDER BY clause.

Further reading:

  • No Seatbelt - Expecting Order without ORDER BY (Conor Cunningham)
  • Without ORDER BY, there is no default sort order. (Alexander Kuznetsov)
  • T-SQL Tuesday #56: SQL Server Assumptions (me - see #3)
  • Bad Habits to Kick : Relying on undocumented behavior (also me)
  • Why is SSMS inserting new rows at the top of a table not the bottom? (dba.se question)

Clustered index default sorting order

Ascending order is more efficient. This is true with all RDBMS I have worked with and it's true for a number of reasons. The biggest issue with descending sorts (ordered-backward scans) in SQL Server is that ordered-backward scans cannot leverage a parallel execution plan.

Keep in mind that sorting has a n(log n) complexity which is slower than linear. In other words, each row gets more expensive to sort as you add more rows. This is why the optimizer often chooses a parallel execution plan to handle sorts. If you have a lot of rows to sort you want your optimizer to have the option to parallelize the sort. So, again - ascending is more efficient.

There are other optimizations not available to the optimizer when you do an ordered-backward scan. For example, when using partitioned window functions (function that use the OVER clause and PARTITION BY) ascending is usually more efficient.

Here are two really good articles about this topic (both by Itzik Ben-Gan):

SQL Server: Avoiding a Sort with Descending Order

Descending Indexes

Speeding ORDER BY clause with index

Non-clustered indexes can absolutely be used to optimize away a sort. Indexes are essentially binary search trees, which means they contain the values sorted in order.

However, depending on the query, you may be putting SQL Server into a conundrum.

If you have a table with 100 million rows, your query will match 11 million of them, like below, is it cheaper to use an index on category to select the rows and sort the results by name, or to read all 100 million rows out of the index pre-sorted by name, and then filter down 89 million of them by checking the category?

select ...
from product
where category = ?
order by name;

In theory, SQL Server may be able to use an index on name to read rows in order and use the index on category to filter efficiently? I'm skeptical. I have rarely seen SQL Server use multiple indexes to access the same table in the same query (assuming a single table selection, ignoring joins or recursive CTE's). It would have to check the index 100 million times. Indexes have a high overhead cost per index search, so they are efficient when a single search narrows down the result set by a lot.

Without seeing a schema, statistics, and exact query, it's hard for me to say what makes sense, but I expect I would find SQL Server would use an index for the where clause and sort the results, ignoring an index on the sort column.

An index on the sort column may be used if you are selecting the entire table though. Like select ... from product order by name;

Again, your milage may vary. This is speculation based off past experience.



Related Topics



Leave a reply



Submit