Why Can't I Simply Add an Index That Includes All Columns

Why can't I simply add an index that includes all columns?

First of all, an index in SQL Server can only have at most 900 bytes in its index entry. That alone makes it impossible to have an index with all columns.

Most of all: such an index makes no sense at all. What are you trying to achieve??

Consider this: if you have an index on (LastName, FirstName, Street, City), that index will not be able to be used to speed up queries on

  • FirstName alone
  • City
  • Street

That index would be useful for searches on

  • (LastName), or
  • (LastName, FirstName), or
  • (LastName, FirstName, Street), or
  • (LastName, FirstName, Street, City)

but really nothing else - certainly not if you search for just Street or just City!

The order of the columns in your index makes quite a difference, and the query optimizer can't just use any column somewhere in the middle of an index for lookups.

Consider your phone book: it's order probably by LastName, FirstName, maybe Street. So does that indexing help you find all "Joe's" in your city? All people living on "Main Street" ?? No - you can lookup by LastName first - then you get more specific inside that set of data. Just having an index over everything doesn't help speed up searching for all columns at all.

If you want to be able to search by Street - you need to add a separate index on (Street) (and possibly another column or two that make sense).

If you want to be able to search by Occupation or whatever else - you need another specific index for that.

Just because your column exists in an index doesn't mean that'll speed up all searches for that column!

The main rule is: use as few indices as possible - too many indices can be even worse for a system than having no indices at all.... build your system, monitor its performance, and find those queries that cost the most - then optimize these, e.g. by adding indices.

Don't just blindly index every column just because you can - this is a guarantee for lousy system performance - any index also requires maintenance and upkeep, so the more indices you have, the more your INSERT, UPDATE and DELETE operations will suffer (get slower) since all those indices need to be updated.

Indexing every column in a table

Indexing any table, either memory or file system based, will speed up queries that select or sort results based on that column. This is because the index works like a tree structure and the search distance depends on the depth of the tree, which increases a lot slower than the row count of the column (logarithmic).

Indexing every column does not defeat the purpose of the index, but it will slow up inserts and updates because those changes will cause an update of every index of that table. Also, the indexes take up space on the database server, so that is another drawback to be considered.

Other SO questions to read relating to this question:

Best practices for indexing

What is an index

How many indexes are enough

Why use the INCLUDE clause when creating an index?

If the column is not in the WHERE/JOIN/GROUP BY/ORDER BY, but only in the column list in the SELECT clause is where you use INCLUDE.

The INCLUDE clause adds the data at the lowest/leaf level, rather than in the index tree.
This makes the index smaller because it's not part of the tree

INCLUDE columns are not key columns in the index, so they are not ordered.
This means it isn't really useful for predicates, sorting etc as I mentioned above. However, it may be useful if you have a residual lookup in a few rows from the key column(s)

Another MSDN article with a worked example

SQL Server non-clustered index is not being used

Based on additional details from the comments, it appears that the index you want SQL Server to use isn't a covering index; this means that the index doesn't contain all the columns that are referenced in the query. As such, if SQL Server were to use said index, then it would need to first do a seek on the index, and then perform a key lookup on the clustered index to get the full details of the row. Such lookups can be expensive.

As a result of the index you want not being covering, SQL Server has determined that the index you want it to to use would produce an inferior query plan to simply scanning the entire clustered index; which is by definition covering as it INCLUDEs all other columns not in the CLUSTERED INDEX.

For your index ITINDETAIL20220504 you have INCLUDEd all the columns that are in your SELECT, which means that it is covering. This means that SQL Server can perform a seek on the index, and get all the information it needs from that seek; which is far less costly that a seek followed by a key lookup and quicker than a scan of the entire clustered index. This is why this information works.


We coould put this into some kind of analogy using a Library type scenario, which is full of Books, to help explain this idea more:

Let's say that the Clustered Index is a list of every book in the library sorted by it's ISBN number (The Primary Key). Along side that ISBN number you have the details of the Author, Title, Publication Date, Publisher, If it's hardcover or softcover, the colour of the spine, the section of the Library the Book is located in, the book case, and the shelf.

Now let's say you want to obtain any books by the the Author Brandon Sanderson published on or after 2015-01-01. If you then wanted to you could go through the entire list, one by one, finding the books by that author, checking the publication date, and then writing down it's location so you can go and visit each of those locations and collect the book. This is effectively a Clustered Index Scan.

Now let's say you have a list of all the books in the Library again. The list contains the Author, Publication Date, and the ISBN (The Primary Key), and is ordered by the Author and the Publication Date. You want to fulfil the same task; obtain any books by the the Author Brandon Sanderson published on or after 2015-01-01. Now you can easily go through that list and find all those books, but you don't know where they are. As a result even after you have gone straight to the Brandon Sanderson "section" of the list, you'll still need to write all the ISBNs down, and then find each of those ISBN in the original list, get their location and title. This is your index ITINDETAIL_004; you can easily find the rows you want to filter to, but you don't have all the information so you have to go somewhere else afterwards.

Lastly we have a 3rd list, this list is ordered by the author and then publication date (like the 2nd list), but also includes the Title, the section of the Library the Book is located in, the book case, and the shelf, as well as the ISBN (Primary key). This list is ideal for your task; it's in the right order, as you can easily go to Brandon Sanderson and then the first book published on or after 2015-01-01, and it has the title and location of the book. This is your INDEX ITINDETAIL20220504 would be; it has the information in the order you want, and contains all the information you asked for.


Saying all this, you can force SQL Server to choose the index, but as I said in my comment:

In truth, rarely is the index you think is the correct is the correct index if SQL Server thinks otherwise. Despite what some believe, SQL Server is very good at making informed choices about what index(es) it should be using, provided your statistics are up to date, and the cached plan isn't based on old and out of date information.

Let's, however, demonstrate what happens if you do with a simple set up:

CREATE TABLE dbo.SomeTable (ID int IDENTITY(1,1) PRIMARY KEY,
SomeGuid uniqueidentifier DEFAULT NEWID(),
SomeDate date)
GO
INSERT INTO dbo.SomeTable (SomeDate)
SELECT DATEADD(DAY,T.I,'19000101')
FROM dbo.Tally(100000,0) T;

Now we have a table with 100001 rows, with a Primary Key on ID. Now let's do a query which is an overly simplified version of yours:

SELECT ID,
SomeGuid
FROM dbo.SomeTable
WHERE SomeDate > '20220504';

No surprise, this results in an Clsutered Index Scan:

Query Plan showing Index Scan of Clustered Primary Key

Ok, let's add an index on SomeDate and run the query again:

CREATE INDEX IX_NonCoveringIndex ON dbo.SomeTable (SomeDate);
GO
--Index Scan of Clustered index
SELECT ID,
SomeGuid
FROM dbo.SomeTable
WHERE SomeDate > '20220504';

Same result, and SSMS has even suggested an index:

Query Plan showing Index Scan of Clustered Primary Key, and a suggested index, which "coincidentally" we'll be creating.

Now, as I mentioned, you can force SQL Server to use a specific index. Let's do that and see what happens:

SELECT ID,
SomeGuid
FROM dbo.SomeTable WITH (INDEX(IX_NonCoveringIndex))
WHERE SomeDate > '20220504';

And this gives exactly the plan I suggested; A key lookup:

Query Plan showing Index Seek of created index, and then a key lookup on the Clustered Index

This is expensive. In fact, if we turn on the statistics for IO and Time, the query without the index hint took 40ms, the one with the hint took 107ms in the first run. Subsequent runs all had the second query taking around double the time of the first. IO wise the first query has a simple scan and 398 logical reads; the latter had 5 scans and 114403 logical reads!

Now, finally, let's add that covering Index and run:

CREATE INDEX IX_CoveringIndex ON dbo.SomeTable (SomeDate) INCLUDE (SomeGuid);
GO

SELECT ID,
SomeGuid
FROM dbo.SomeTable
WHERE SomeDate > '20220504';

Here we can see that seek we wanted:

Query Plan showing Index Seek of created covering index index

If we look at the IO and times again compared to the prior 2, we get 1 scan, 202 logical reads, and it was running in about 25ms.

Explanation on index on a datetime field and included columns

We will need more information on your table examples of your queries to address this fully, but:

  • DateTime columns should be highly selective by themselves, so an index with TimeOperation as the first column should address the bulk of queries against TimeOperation.
  • Do not add all columns blindly to an index, or even on included indexes - this will make the index page density worse and be counter productive (you would be duplicating your table in an index).
  • If all data in your database centres around TimeOperation, you might consider building your clustered index around it.
  • If you have queries just on field1 = x then you need a separate index just for field1 (assuming that it is suitably selective), i.e. no TimeOperation on the index if its not in the WHERE clause of your query.
  • Yes, you are right, when SQL locates a record in an index, it needs to do a key (or RID) lookup back into the cluster to retrieve the rest of the columns. If your non clustered index Includes the other columns in your select statement, the lookup can be avoided. But since you are using SELECT(*), covering indexes are unlikely to help .

Edit

Explanation - Selectivity and density are explained in detail here. e.g. iff your queries against TimeOperation return only a small number of rows (rule of thumb is < 5%, but this isn't always), will the index be used, i.e. your query is selective enough for SQL to choose the index on TimeOperation.

The basic starting point would be:

CREATE TABLE [MyTable]
(
intID INT ID identity(1,1) NOT NULL,
field1 NVARCHAR(20),
-- .. More columns, which may be selected, but not filtered
TimeOperation DateTime,

CONSTRAINT PK_MyTable PRIMARY KEY (IntId)
);

And the basic indexes will be

CREATE NONCLUSTERED INDEX IX_MyTable_1 ON [MyTable](TimeOperation);
CREATE NONCLUSTERED INDEX IX_MyTable_2 ON [MyTable](Field1);

Clustering Consideration / Option

If most of your records are inserted in 'serial' ascending TimeOperation order, i.e. intId and TimeOperation will both increase in tandem, then I would leave the clustering on intID (the default) (i.e. table DDL is PRIMARY KEY CLUSTERED (IntId), which is the default anyway).

However, if there is NO correlation between IntId and TimeOperation, and IF most of your queries are of the form SELECT * FROM [MyTable] WHERE TimeOperation between xx and yy then CREATE CLUSTERED INDEX CL_MyTable ON MyTable(TimeOperation) (and changing PK to PRIMARY KEY NONCLUSTERED (IntId)) should improve this query (Rationale: since contiguous times are kept together, fewer pages need to be read, and the bookmark lookup will be avoided). Even better, if values of TimeOperation are guaranteed to be unique, then CREATE UNIQUE CLUSTERED INDEX CL_MyTable ON MyTable(TimeOperation) will improve density as it will avoid the uniqueifier.

Note - for the rest of this answer, I'm assuming that your IntId and TimeOperations ARE strongly correlated and hence the clustering is by IntId.

Covering Indexes

As others have mentioned, your use of SELECT (*) is bad practice and inter alia means covering indexes won't be of any use (the exception being COUNT(*)).
If your queries weren't SELECT(*), but instead e.g.

SELECT TimeOperation, field1
FROM
WHERE TimeOperation BETWEEN x and y -- and returns < 5% data.

Then altering your index on TimeOperation to include field1

CREATE NONCLUSTERED INDEX IX_MyTable ON [MyTable](TimeOperation) INCLUDE(Field1);

OR adding both to the index (with the most common filter first, or the most selective first if both filters are always present)

CREATE NONCLUSTERED INDEX IX_MyTable ON [MyTable](TimeOperation, Field1);

Either will avoid the rid / key lookup. The second (,) option will address your query where BOTH TimeOperation and Field1 are filtered in a WHERE or HAVING clause.

Re : What's the difference between index on (TimeOperation, Field1) and separate indexes?

e.g.

CREATE NONCLUSTERED INDEX IX_MyTable ON [MyTable](TimeOperation, Field1);

will not be useful for the query

SELECT ... FROM MyTable WHERE Field1 = 'xyz';

The index will only be useful for the queries which have TimeOperation

SELECT ... FROM MyTable WHERE TimeOperation between x and y;

OR

SELECT ... FROM MyTable WHERE TimeOperation between x and y AND Field1 = 'xyz';

Hope this helps?

Why use INCLUDE in a SQL index

The answers so far are all correct and all - but they might not convey enough what you gain from a covering index.

In your case, you have a table Foo and some fields, including an Id (which I assume is the primary key), and a SubId which is some additional ID of some kind.

You also have an index IX_Foo which I assume had only Id in it for now.

So now you need to find the SubId for Id=4.

SELECT Id, SubId
FROM Foo
WHERE Id=4
  • SQL Server will look at the SELECT statement and determine it can use IX_Foo
  • it will then go search for the value Id=4 in your index IX_Foo
  • when it finds it, it now needs the value of SubId, too
  • the non-clustered index IX_Foo will contain the clustering key value
  • using that clustering key value, SQL Server will do a "bookmark lookup" to locate the actual data page where your entire data row is located
  • it will fetch that page and extract the value for SubId from it
  • it will return those values to satisfy your query

The main point here is: once SQL Server has found your Id=4 in the IX_Foo index, it will then need to do another I/O operation, a bookmark lookup, to go fetch the whole data row, in order to be able to find the SubId value.

If you have a covering index, e.g. IX_Foo also includes SubId, that extra I/O to do the bookmark lookup is eliminated. Once the value Id=4 is found in the IX_Foo index, that index page in your non-clustered index will also include the value of SubId - SQL Server can now return those two values you asked for in your SELECT query without having to do an extra (potentially expensive and thus slow) bookmark lookup just to go fetch another Id column.

That's the main benefit of covering indices - if you only need one or two extra columns, besides the index values you're doing the lookup on, by including those values into the index itself, you can save yourself a lot of bookmark lookups and thus speed things up significantly. You should however only include very few, and small bits of information - don't duplicate your entire data rows into all non-clustered indices! That's not the point.

UPDATE: the trade-off is this: if you have an index on (Id, SubId), all the pages in the index have both columns - the whole index tree through.

If you INCLUDE(SubId), the SubId fields are only present at the leaf level.

This means

  • SQL Server can't search and compare on SubId (the values are not in the index tree)
  • less space is used since the values are only on the leaf level


Related Topics



Leave a reply



Submit