Role of Selectivity in Index Scan/Seek

Role of selectivity in index scan/seek

From SimpleTalk article by Robert Sheldon: 14 SQL Server Indexing Questions You Were Too Shy To Ask

The ratio of unique values within a key column is referred to as index
selectivity. The more unique the values, the higher the selectivity,
which means that a unique index has the highest possible selectivity.
The query engine loves highly selective key columns, especially if
those columns are referenced in the WHERE clause of your frequently
run queries. The higher the selectivity, the faster the query engine
can reduce the size of the result set. The flipside, of course, is
that a column with relatively few unique values is seldom a good
candidate to be indexed.

Also check these articles:

Check this post by Pinal Dave
this other on SQL Serverpedia
This forum post on SqlServerCentral can help you too.
This article on SqlServerCentral also

From the SqlServerCentral article:

In general, a nonclustered index should be selective. That is, the
values in the column should be fairly unique and queries that filter
on it should return small portions of the table.

The reason for this is that key/RID lookups are expensive operations
and if a nonclustered index is to be used to evaluate a query it needs
to be covering or sufficiently selective that the costs of the lookups
aren’t deemed to be too high.

If SQL considers the index (or the subset of the index keys that the
query would be seeking on) insufficiently selective then it is very
likely that the index will be ignored and the query executed as a
clustered index (table) scan.

It is important to note that this does not just apply to the leading
column. There are scenarios where a very unselective column can be
used as the leading column, with the other columns in the index making
it selective enough to be used.

At what cardinality does SQL Server switch to an index scan (vs. seek)

In terms of SQL Server, this has been referred to as the tipping point, of which Kimberley's blog post is a good read on it. http://www.sqlskills.com/BLOGS/KIMBERLY/category/The-Tipping-Point.aspx

The tipping point is a guideline of 25%-33% of the total number of pages within the table, expressed as rows, e.g. 10k data pages would give a tipping point of 2500-3333 rows. As guidelines go this is pretty good, and as good as you will get - remember the query plan engine is a black box, and whilst it will give you a query plan, it only says what it decided, not why.

In terms of tipping a covering index though, that is not actually very easy, even with 100% of the data being selected a covering index will still seek over scan in the majority of cases.

That makes sense, if you consider that the cost optimizer doesn't assign any real cost to the index pages hierarchy, any only costs up the access to the leaf pages of the index. At that point, scanning or seeking 100% of a covering index is costed the same.

I found from my own experimentation (http://sqlfascination.com/2009/11/07/can-a-covering-nc-index-be-tipped ) using a between clause would cause it to scan, but other where clauses would not - from what I could tell it was purely down to the route through the query engine.

Why is this an Index Scan and not a Index Seek?

Whilst bearing in mind that it will result in a query that may perform badly as and when additional changes are made to it, using an INNER LOOP JOIN should force the covering index to be used on dbo.LocationCache.

SELECT      top 100 a.LocationId, b.SearchQuery, b.SearchRank
FROM        dbo.Locations a
INNER LOOP JOIN dbo.LocationCache b ON a.LocationId = b.LocationId
WHERE       a.CountryId = 2
AND         a.Type = 7

Using Index scan instead of seek with lookup

It looks like the query optimizer decides to scan the table instead of using an index based on the selectivity of the data.

It may be actually faster to refer to the table directly than to seek via the index and then perform a KeyLookup. This may not be the case if table has more rows (> 10k). Here 86 from 200 is more than 40%.

select a.content from Articles a where a.location = 'Japan, Tokyo';
-- clustered index scan

select a.location from Articles a where a.location = 'Japan, Tokyo';
-- covering index

Scans vs. Seeks

Thus, a seek is generally a more efficient strategy if we have a highly selective seek predicate; that is, if we have a seek predicate that eliminates a large fraction of the table.

SQL Server - Query performs Index Scan instead of Seek

An index on such a low selectivity column (only two values 0 and 1) is always going to be ignored, see the tipping point. One option is to move it as the leftmost key in the clustered index, and make the primary key constraint on DocumentId a non-clustered index:

CREATE TABLE Content (
    DocumentId bigint,
    CategoryId bigint,
    Title nvarchar(255),
    AuthorUserId bigint,
    Body nvarchar(MAX),
    IsIndexed bit,
    constraint pk_DocumentId primary key nonclustered (DocumentId)
)

create unique clustered index cdxContent on Content (IsIndexed, DocumentId);

Another option is to create a filtered covering index:

create unique index nonIndexedContent on Content (DocumentId)
  include (CategoryId, Title, AuthorUserId, Body)
  where IsIndexed = 0;

This second option would duplicate a lot of content possibly. Personally, I would go with the first option.

Turn index scan to an index seek

Your query doesn't have a WHERE filtering clause; it asks for all records in the resultset of that big join.

The query planner would choose index seek over index scan if you wanted just one, or a very small number, of rows in your resultset. index scan is a two-step process: first it seeks to the first row it needs (which could be the first row in the index). It then scans the index sequentially to get the rest of the rows it needs.

Scanning a covering index is an efficient way to get a multi-row resultset: the query can be satisfied from the index, so the query planner doesn't have to bounce back and forth between the index and the table data itself.

All is well here.

Why does SQL Server sometimes choose an index scan over a bookmark lookup?

The selectivity of CustomerID will play some part in the query optimiser's decision. If, on one hand, it was unique, then an equality operation will yield at most one result, so a SEEK/LOOKUP operation is almost guaranteed. If, on the other hand, potentially hundreds or thousands of records will match a value for CustomerID, then a clustered-index scan may seem more attractive.

You'd be surprised how selective a filter has to be to preclude a scan. I can't find the article I originally pulled this figure from, but if CustomerID 1234 will match as little as 4% of the records in the table, a scan on the clustered index may be more efficient, or at least appear that way to the optimiser (which doesn't get it right 100% of the time).

It sounds at least plausible that the statistics kept on the non-clustered index on CustomerID is causing the optimiser to toggle between seek/scan based on the selectivity criteria.

You may be able to coax the optimiser towards use of the index by introducing a JOIN or EXISTS operation:

-- Be aware: this approach is untested
select o.*
  from Orders o
       inner join Customers c on o.CustomerID = c.CustomerID
 where c.CustomerID = 1234;

Or:

-- Be aware: this approach is untested
select o.*
  from Orders o
 where exists (select 1
                 from Customers c
                where c.CustomerID = 1234 and
                      o.CustomerID = c.CustomerID);

Also be aware that with this EXISTS approach, if you don't have an index on the "join" predicate (in this case, the CustomerID field) in both tables then you will end up with a nested loop which is painfully slow. Using inner joins seems much safer, but the EXISTS approach has its place from time to time when it can exploit indexes.

These are just suggestions; I can't say if they will be effective or not. Just something to try, or for a resident expert to confirm or deny.

Role of Selectivity in Index Scan/Seek