What Column Should the Clustered Index Be Put On

What column should the clustered index be put on?

An index, clustered or non clustred, can be used by the query optimizer if and only if the leftmost key in the index is filtered on. So if you define an index on columns (A, B, C), a WHERE condition on B=@b, on C=@c or on B=@b AND C=@c will not fully leverage the index (see note). This applies also to join conditions. Any WHERE filter that includes A will consider the index: A=@a or A=@a AND B=@b or A=@a AND C=@c or A=@a AND B=@b AND C=@c.

So in your example if you make the clustred index on part_no as the leftmost key, then a query looking for a specific part_id will not use the index and a separate non-clustered index must exist on part-id.

Now about the question which of the many indexes should be the clustered one. If you have several query patterns that are about the same importance and frequency and contradict each other on terms of the keys needed (eg. frequent queries by either part_no or part_id) then you take other factors into consideration:

  • width: the clustered index key is used as the lookup key by all other non-clustered indexes. So if you choose a wide key (say two uniquidentifier columns) then you are making all the other indexes wider, thus consuming more space, generating more IO and slowing down everything. So between equaly good keys from a read point of view, choose the narrowest one as clustered and make the wider ones non-clustered.
  • contention: if you have specific patterns of insert and delete try to separate them physically so they occur on different portions of the clustered index. Eg. if the table acts as a queue with all inserts at one logical end and all deletes at the other logical end, try to layout the clustered index so that the physical order matches this logical order (eg. enqueue order).
  • partitioning: if the table is very large and you plan to deploy partioning then the partitioning key must be the clustered index. Typical example is historical data that is archived using a sliding window partitioning scheme. Even thow the entities have a logical primary key like 'entity_id', the clustred index is done by a datetime column that is also used for the partitioning function.
  • stability: a key that changes often is a poor candidate for a clustered key as each update the clustered key value and force all non-clustered indexes to update the lookup key they store. As an update of a clustered key will also likely relocate the record into a different page it can cause fragmentation on the clustered index.

Note: not fully leverage as sometimes the engine will choose an non-clustered index to scan instead of the clustered index simply because is narrower and thus has fewer pages to scan. In my example if you have an index on (A, B, C) and a WHERE filter on B=@b and the query projects C, the index will be likely used but not as a seek, as a scan, because is still faster than a full clustered scan (fewer pages).

If a table has 'id' column as it's clustered index, is there any benefit in adding 'id' column as an included column in any other non-clustered index

This answer assumes you are planning to run the following query:

SELECT * FROM xyz WHERE status = 'active' AND date > '2021-06-20';

If you only created a non clustered index on (status, date), then it would cover the WHERE clause, but not the SELECT clause. What this means is that SQL Server might choose to use the index to find the matching records in the query. But when it gets to evaluating the SELECT clause, it would be forced to seek back to the clustered index to find the values for the columns not included in the index, other than the id clustered index column (which includes the name column in this case). There is performance penalty in doing this, and SQL Server might, depending on your data, even choose to not use the index because it does not completely cover the query.

To mitigate this, you can define the index in your question, where you include the name value in the leaf nodes. Note that id will be included by default in the index, so we do not need to explicitly INCLUDE it. By following this approach, the index itself is said to completely cover the query, meaning that SQL Server may use the index for the entire query plan. This can lead to fast performance in many cases.

Optimizing queries based on clustered and non-clustered indexes in SQL?

For SQL Server

Q1 Extra space is only needed for the clustered index if it is not unique. SQL Server will add a 4 byte uniquifier internally to a non-unique clustered index. This is because it uses the cluster key as a rowid in non-clustered indexes.

Q2 A non-clustered index can be read in order. That may aid queries where you specify an order. It may also make merge joins attractive. It will also help with range queries (x < col and y > col).

Q3 SQL Server does an extra "bookmark lookup" when using a non-clustered index. But, this is only if it needs a column that isn't in the index. Note also, that you can include extra columns in the leaf level of indexs. If an index can be used without the additional lookup it is called a covering index.

If a bookmark lookup is required, it doesn't take a high percentage of rows until it's quicker just to scan the whole clustered index. The level depends on row size, key size etc. But 5% of rows is a typical cut off.

Q4 If the most important thing in your application was making both these queries as fast as possible, you could create covering index on both of them:

create index IX_1 on employee (age) include (name, salary);
create index IX_2 on employee (salary) include (name, age);

Note you don't have to specifically include the cluster key, as the non-clustered index has it as the row pointer.

Q5 This is more important for cluster keys than non-cluster keys due to the uniquifier. The real issue though is whether an index is selective or not for your queries. Imagine an index on a bit value. Unless the distribution of data is very skewed, such an index is unlikely to be used for anything.


More info about the uniquifier. Imagine you and a non unique clustered index on age, and a non-clustered index on salary. Say you had the following rows:

age | salary | uniqifier
20 | 1000 | 1
20 | 2000 | 2

Then the salary index would locate rows like so

1000 -> 20, 1
2000 -> 20, 2

Say you ran the query select * from employee where salary = 1000, and the optimizer chose to use the salary index. It would then find the pair (20, 1) from the index lookup, then lookup this value in the main data.

Understanding how primary key columns are included in a non-clustered index

The 'Actual Execution Plan' from a query referencing table 'demo' has
suggested that a new non-unique non-clustered index is required for
column 'b' and should include column 'a'.

...

If primary key column 'a' is already part of the non-clustered index,
is column 'a' stored as an include column or is it part of the
non-clustered key?

In your case column a will be presented on all levels of non-clustered index as the part of clustered index key. The index suggested to you is non-unique so it needs uniquefier and the clustered index key will be used for this purpose.

If the offered index was unique, column a would be stored on the leaf level of this index as the part of row locator that in case of a clustered table is clustered index key.

Column a will not be stored twice if you include it explicitly as included column of your index, so I advice you to include it. It will make difference when one day someone decides to turn your clustered table to a heap (by dropping clustered index). In this case if you did not include column a explicitly in your non clustered index, it will be lost and not contained in your non-clustered index anymore

What do Clustered and Non-Clustered index actually mean?

With a clustered index the rows are stored physically on the disk in the same order as the index. Therefore, there can be only one clustered index.

With a non clustered index there is a second list that has pointers to the physical rows. You can have many non clustered indices, although each new index will increase the time it takes to write new records.

It is generally faster to read from a clustered index if you want to get back all the columns. You do not have to go first to the index and then to the table.

Writing to a table with a clustered index can be slower, if there is a need to rearrange the data.

Should I create index on a SQL table column used frequently in WHERE select clause?

Generally yes - you would usually make the clustered index on the primary key.
The exception to this is when you never make lookups based on the primary key, in which case putting the clustered index on another column might be more pertinent.

You should generally add non-clustered indexes to columns that are used as foreign keys, providing there's a reasonably amount of diversity on that column, which I'll explain with an example.

The same applies to columns being used in where clauses, order by etc.

Example

CREATE TABLE Gender (
GenderId INT NOT NULL PRIMARY KEY CLUSTERED
Value NVARCHAR(50) NOT NULL)

INSERT Gender(Id, Value) VALUES (1, 'Male'), (2, 'Female')

CREATE TABLE Person (
PersonId INT NOT NULL IDENTITY(1,1) PRIMARY KEY CLUSTERED,
Name NVARCHAR(50) NOT NULL,
GenderId INT NOT NULL FOREIGN KEY REFERENCES Gender(GenderId)
)

CREATE TABLE Order (
OrderId INT NOT NULL IDENTITY(1,1) PRIMARY KEY CLUSTERED,
OrderDate DATETIME NOT NULL DEFAULT GETDATE(),
OrderTotal DECIMAL(14,2) NOT NULL,
OrderedByPersonId INT NOT NULL FOREIGN KEY REFERENCES Person(PersonId)
)

In this simple set of tables it would be a good idea to put an index on The OrderedByPersonId Column of the Order table, as you are very likely to want to retrieve all the orders for a given person, and it is likely to have a high amount of diversity.
By a high amount of diversity (or selectiveness) I mean that if you have say 1000 customers, each customer is only likely to have 1 or 2 orders each, so looking up all the values from the order table with a given OrderedByPersonId will result in only a very small proportion of the total records in that table being returned.

By contrast there's not much point in putting an index on the GenderId column in the Person table, as it will have a very low diversity. The query optimiser would not use such an index, and INSERT/UPDATE statements would be a marginally slower because of the extra need to maintain the index.

Now to go back to your example - the answer would have to be "it depends". If you have hundreds of cities in your database then yes, it might be a good idea to index that column
If however you only have 3 or 4 cities, then no - don't bother. As a guideline I might say if the selectivity of the column is 0.9 or higher (ie a where clause selecting a single value in the column would result in only 10% or less of the rows being returned) an index might help, but this is by no means a hard and fast figure!

Even if the column is very selective/diverse you might not bother indexing it if queries are only made very infrequently on it.

One of the easiest things to do though is try your queries with the execution plan displayed in SQL management studio. It will suggest indexes for you if the query optimiser thinks that they'll make a positive impact.

Hope that helps!

SQL Server Clustered Index - Order of Index Question

You should order your composite clustered index with the most selective column first. This means the column with the most distinct values compared to total row count.

"B*TREE Indexes improve the performance of queries that select a small percentage of rows from a table." http://www.akadia.com/services/ora_index_selectivity.html?

This article is for Oracle, but still relevant.

Also, if you have a query that runs constantly and returns few fields, you may consider creating a composite index that contains all the fields - it will not have to access the base table, but will instead pull data from the index.

ligget78's comment on making sure to mention the first column in a composite index is important to remember.

Non clustered index on the same column as the clustered primary key should be unique?

Should I make this index UNIQUE as I did?

Yes

Does it help SQL Server in any way by defining it UNIQUE or should it be a simple index?

No.

It's unique, so you should declare it as unique. It makes no difference to the actual implementation however. PersonID is the "row locator" for this table as it's the clustered index key, so it would be added to any non-unique non-clustered index as a key column. But since it's already there, physical storage doesn't change whether or not you declare the index as unique.

Do I still need the clustered index to be the PersonID as well?

You still need some clustered index, and PersonID is probably a good choice. Note that if you change the clustered index to something else, then you must make the index on PersonID unique.

Is it a problem to have two indexes on the same column like this?

It's not a problem. The clustered index has all the columns in it, so scanning it can be expensive. The seperate index on PersonID is useful for counting the rows, for paging, and (as here) for covering queries that need only the PersonID and Name.

SQL Server non-clustered index is not being used

Based on additional details from the comments, it appears that the index you want SQL Server to use isn't a covering index; this means that the index doesn't contain all the columns that are referenced in the query. As such, if SQL Server were to use said index, then it would need to first do a seek on the index, and then perform a key lookup on the clustered index to get the full details of the row. Such lookups can be expensive.

As a result of the index you want not being covering, SQL Server has determined that the index you want it to to use would produce an inferior query plan to simply scanning the entire clustered index; which is by definition covering as it INCLUDEs all other columns not in the CLUSTERED INDEX.

For your index ITINDETAIL20220504 you have INCLUDEd all the columns that are in your SELECT, which means that it is covering. This means that SQL Server can perform a seek on the index, and get all the information it needs from that seek; which is far less costly that a seek followed by a key lookup and quicker than a scan of the entire clustered index. This is why this information works.


We coould put this into some kind of analogy using a Library type scenario, which is full of Books, to help explain this idea more:

Let's say that the Clustered Index is a list of every book in the library sorted by it's ISBN number (The Primary Key). Along side that ISBN number you have the details of the Author, Title, Publication Date, Publisher, If it's hardcover or softcover, the colour of the spine, the section of the Library the Book is located in, the book case, and the shelf.

Now let's say you want to obtain any books by the the Author Brandon Sanderson published on or after 2015-01-01. If you then wanted to you could go through the entire list, one by one, finding the books by that author, checking the publication date, and then writing down it's location so you can go and visit each of those locations and collect the book. This is effectively a Clustered Index Scan.

Now let's say you have a list of all the books in the Library again. The list contains the Author, Publication Date, and the ISBN (The Primary Key), and is ordered by the Author and the Publication Date. You want to fulfil the same task; obtain any books by the the Author Brandon Sanderson published on or after 2015-01-01. Now you can easily go through that list and find all those books, but you don't know where they are. As a result even after you have gone straight to the Brandon Sanderson "section" of the list, you'll still need to write all the ISBNs down, and then find each of those ISBN in the original list, get their location and title. This is your index ITINDETAIL_004; you can easily find the rows you want to filter to, but you don't have all the information so you have to go somewhere else afterwards.

Lastly we have a 3rd list, this list is ordered by the author and then publication date (like the 2nd list), but also includes the Title, the section of the Library the Book is located in, the book case, and the shelf, as well as the ISBN (Primary key). This list is ideal for your task; it's in the right order, as you can easily go to Brandon Sanderson and then the first book published on or after 2015-01-01, and it has the title and location of the book. This is your INDEX ITINDETAIL20220504 would be; it has the information in the order you want, and contains all the information you asked for.


Saying all this, you can force SQL Server to choose the index, but as I said in my comment:

In truth, rarely is the index you think is the correct is the correct index if SQL Server thinks otherwise. Despite what some believe, SQL Server is very good at making informed choices about what index(es) it should be using, provided your statistics are up to date, and the cached plan isn't based on old and out of date information.

Let's, however, demonstrate what happens if you do with a simple set up:

CREATE TABLE dbo.SomeTable (ID int IDENTITY(1,1) PRIMARY KEY,
SomeGuid uniqueidentifier DEFAULT NEWID(),
SomeDate date)
GO
INSERT INTO dbo.SomeTable (SomeDate)
SELECT DATEADD(DAY,T.I,'19000101')
FROM dbo.Tally(100000,0) T;

Now we have a table with 100001 rows, with a Primary Key on ID. Now let's do a query which is an overly simplified version of yours:

SELECT ID,
SomeGuid
FROM dbo.SomeTable
WHERE SomeDate > '20220504';

No surprise, this results in an Clsutered Index Scan:

Query Plan showing Index Scan of Clustered Primary Key

Ok, let's add an index on SomeDate and run the query again:

CREATE INDEX IX_NonCoveringIndex ON dbo.SomeTable (SomeDate);
GO
--Index Scan of Clustered index
SELECT ID,
SomeGuid
FROM dbo.SomeTable
WHERE SomeDate > '20220504';

Same result, and SSMS has even suggested an index:

Query Plan showing Index Scan of Clustered Primary Key, and a suggested index, which "coincidentally" we'll be creating.

Now, as I mentioned, you can force SQL Server to use a specific index. Let's do that and see what happens:

SELECT ID,
SomeGuid
FROM dbo.SomeTable WITH (INDEX(IX_NonCoveringIndex))
WHERE SomeDate > '20220504';

And this gives exactly the plan I suggested; A key lookup:

Query Plan showing Index Seek of created index, and then a key lookup on the Clustered Index

This is expensive. In fact, if we turn on the statistics for IO and Time, the query without the index hint took 40ms, the one with the hint took 107ms in the first run. Subsequent runs all had the second query taking around double the time of the first. IO wise the first query has a simple scan and 398 logical reads; the latter had 5 scans and 114403 logical reads!

Now, finally, let's add that covering Index and run:

CREATE INDEX IX_CoveringIndex ON dbo.SomeTable (SomeDate) INCLUDE (SomeGuid);
GO

SELECT ID,
SomeGuid
FROM dbo.SomeTable
WHERE SomeDate > '20220504';

Here we can see that seek we wanted:

Query Plan showing Index Seek of created covering index index

If we look at the IO and times again compared to the prior 2, we get 1 scan, 202 logical reads, and it was running in about 25ms.



Related Topics



Leave a reply



Submit