What's the Difference Between a Table Scan and a Clustered Index Scan

What's the difference between a Table Scan and a Clustered Index Scan?

In a table without a clustered index (a heap table), data pages are not linked together - so traversing pages requires a lookup into the Index Allocation Map.

A clustered table, however, has it's data pages linked in a doubly linked list - making sequential scans a bit faster. Of course, in exchange, you have the overhead of dealing with keeping the data pages in order on INSERT, UPDATE, and DELETE. A heap table, however, requires a second write to the IAM.

If your query has a RANGE operator (e.g.: SELECT * FROM TABLE WHERE Id BETWEEN 1 AND 100), then a clustered table (being in a guaranteed order) would be more efficient - as it could use the index pages to find the relevant data page(s). A heap would have to scan all rows, since it cannot rely on ordering.

And, of course, a clustered index lets you do a CLUSTERED INDEX SEEK, which is pretty much optimal for performance...a heap with no indexes would always result in a table scan.

So:

  • For your example query where you select all rows, the only difference is the doubly linked list a clustered index maintains. This should make your clustered table just a tiny bit faster than a heap with a large number of rows.

  • For a query with a WHERE clause that can be (at least partially) satisfied by the clustered index, you'll come out ahead because of the ordering - so you won't have to scan the entire table.

  • For a query that is not satisified by the clustered index, you're pretty much even...again, the only difference being that doubly linked list for sequential scanning. In either case, you're suboptimal.

  • For INSERT, UPDATE, and DELETE a heap may or may not win. The heap doesn't have to maintain order, but does require a second write to the IAM. I think the relative performance difference would be negligible, but also pretty data dependent.

Microsoft has a whitepaper which compares a clustered index to an equivalent non-clustered index on a heap (not exactly the same as I discussed above, but close). Their conclusion is basically to put a clustered index on all tables. I'll do my best to summarize their results (again, note that they're really comparing a non-clustered index to a clustered index here - but I think it's relatively comparable):

  • INSERT performance: clustered index wins by about 3% due to the second write needed for a heap.
  • UPDATE performance: clustered index wins by about 8% due to the second lookup needed for a heap.
  • DELETE performance: clustered index wins by about 18% due to the second lookup needed and the second delete needed from the IAM for a heap.
  • single SELECT performance: clustered index wins by about 16% due to the second lookup needed for a heap.
  • range SELECT performance: clustered index wins by about 29% due to the random ordering for a heap.
  • concurrent INSERT: heap table wins by 30% under load due to page splits for the clustered index.

Whats the difference between table scanning a clustered table, vs index scanning

THE EXPLANATION

Clustered indexes are logically ordered but not physically ordered.

This means that a table scan if it's done in physical order will return different results than clustered index scan, which is sorted logically.

This logical-physical mapping is controlled by OAM (Object Allocation Map)

Why/when/how is whole clustered index scan chosen rather than full table scan?

Please read my answer under "No direct access to data row in clustered table - why?", first.

"the leaf of clustered index contains the real table row, so full clustered index, with intermediate leaves, contain much more data than the full table(?)"

See you are mixing up "Table" with storage structures. In the context of your question, eg. thinking about the size of the CI as opposed to the "table", well then you must think about the CI minus the leaf level (which is the data row). The CI, index portion only, is tiny. The intermediate levels (like any B-Tree) contain partial (not full) key entries; it excludes the lowest level, which is the full key entry, which sits in the row itself, and is not duplicated.

The table (full CI) may be 10GB. The CI only may be 10MB. There is an awful lot that can be determined from the 10MB without having to go to the 100GB.

For understanding: the equivalent NCI on the same table (CI) may be 22MB; the equivalent NCI on the same table if you removed the CI may be 21.5MB (assuming the CI key is reasonable, not fat wide).

"Why/when/how is ever whole clustered index scan chosen over the full table scan?"

Quite often. Again the context is, we are talking about the CI-minus-Leaf levels. For queries that use only the columns in the CI, the presence of those columns in the CI (any index actually) allow the query to be a "covered query", which means it can by serviced wholly from the index, no need to go to the data rows. Think range scans on partial keys: BETWEEN x AND yY; x <= y; etc.

(There is always the chance that the optimiser will choose a table scan, when you think it should choose an index scan, bu t that is a different story.)

"I still do not understand how/why clustered index full scan can be "better" over full table scan."

(The terms used by MS are less precise than my answers here.) For any query that can be answered from the 10MB CI, I would much rather churn 10MB through the data cache, than 100GB. For the same queries, bounded by a range on the CI key, that's a fraction of the 10MB.

For queries that requires a "full table scan", well yes, you must read all the Leaf pages of the CI, which is the 100GB.

Heap vs Clustered index full table scan

You asked about MySQL, and that generally means the InnoDB storage engine, which is the default.

InnoDB does not store tables as a heap.

InnoDB tables are always stored as a clustered index, where the clustered index is the primary key. A table-scan is therefore more or less equivalent to an index-scan of the clustered index.

Any index in InnoDB is not usually stored sequentially on disk. It's stored as a collection of pages, where page have a uniform size of 16KB. The index is obviously much larger than this, and over time insertions and updates expand parts of the index in the middle as well as at the end. To do this efficiently (that is, without needing to rewrite the whole table), random insertions and updates result in the pages being out of order. New pages created are placed wherever there is room in the file.

To facilitate scanning through all the pages, each page contains links to the location of the next page and the preceding page. These may be quite far away in the file, so a table-scan will not actually be sequential, it will involve many seeks to other locations in the file.

InnoDB requires that pages are loaded into RAM before it can actually use them in queries. The InnoDB buffer pool is a fixed-size allocation of RAM, which contains a set of pages loaded from disk. Once the pages are in the buffer pool, they can be accessed very quickly, and with virtually no overhead for following links. The overhead of reading a page from disk into the buffer pool is orders of magnitude much greater than reading a page once it is in RAM.

So in the case of MySQL:

  • There is no heap
  • Sequential order by clustered index has nothing to do with sequential storage on disk
  • Reads are made to pages in RAM anyway, so the physical layout on disk has little to do with the order pages will be read

What Clustered Index Scan (Clustered) means on SQL Server execution plan?

I would appreciate any explanations to "Clustered Index Scan
(Clustered)"

I will try to put in the easiest manner, for better understanding you need to understand both index seek and scan.

SO lets build the table

use tempdb GO

create table scanseek (id int , name varchar(50) default ('some random names') )

create clustered index IX_ID_scanseek on scanseek(ID)

declare @i int
SET @i = 0
while (@i <5000)
begin
insert into scanseek
select @i, 'Name' + convert( varchar(5) ,@i)
set @i =@i+1
END

An index seek is where SQL server uses the b-tree structure of the index to seek directly to matching records

Sample Image

you can check your table root and leaf nodes using the DMV below

-- check index level 
SELECT
index_level
,record_count
,page_count

,avg_record_size_in_bytes
FROM sys.dm_db_index_physical_stats(DB_ID('tempdb'),OBJECT_ID('scanseek'),NULL,NULL,'DETAILED')
GO

Now here we have clustered index on column "ID"

lets look for some direct matching records

select * from scanseek where id =340

and look at the Execution plan

Sample Image

you've requested rows directly in the query that's why you got a clustered index SEEK .

Clustered index scan: When Sql server reads through for the Row(s) from top to bottom in the clustered index.
for example searching data in non key column. In our table NAME is non key column so if we will search some data in the name column we will see clustered index scan because all the rows are in clustered index leaf level.

Example

select * from scanseek where name = 'Name340'

Sample Image

please note: I made this answer short for better understanding only, if you have any question or suggestion please comment below.

What is best among clustered index scan vs non-clustered index seek

First of all, there is no 'best' operator. Sometimes reading more data is more efficient than reading some data and massage them to get our results. 'Best' as almost everything is relative.

Lets try to understand what happened in the comments...

The query

select 
min(CampaignID),
max(CampaignID)
from Campaign
where datecreated < dateadd(day, -90, getutcdate())

Which says:

I want the first and the last ID (min/max) of any record where the date is less than a constant date.

Clustered

The first query without the index/index hint did what SQL Server thought is cheaper than reading any index even if it requires more IO (disk usage). This is because finding the minimum and maximum while validating the records in the table is cheaper than selecting half of the table, then reordering/aggregating them find the exact same info.

The clustered index stores all data on disk and is logically ordered by the key columns, in this case CampaignID (I assume). This means, that to find the minimum and maximum ID is easy: The minimum is the first ID which matches the criteria -> lets check each ID from the first one and stop once we find a record where the date is in place (this will most probably be the first one). The maximum is the first record matching the condition from the end of the index.

Index with the date as key

CREATE NONCLUSTERED INDEX [NCIX] 
ON [dbo].[Campaign](DateCreated)
INCLUDE (Campaignid)

With the first index (date as the key column), SQL Server can use the date to filter the data, true, but it did not help in sorting. It still has to check every record in that index and figure out the minimum and maximum from a possibly unordered set of values.

Index with the ID as key

CREATE NONCLUSTERED INDEX [NCIX] 
ON [dbo].[Campaign](Campaignid)
INCLUDE (DateCreated)

With the second index where the ID was the key column, SQL Server can use the same trick as with the clustered key. The only difference is that there is no other data to read, but the ID and the date, which is much smaller than the whole record would be, therefore it can fit in less pages and requires less IO.

SQL Server will most probably choose the second index even if there is no index hint.

How the second index works (approximation by query)

You can get the minimum Campaignid by

SELECT TOP(1)
Campaignid
FROM
[dbo].[Campaign]
WHERE
datecreated < dateadd(day, -90, getutcdate())
ORDER BY
Campaignid ASC

and the maximum with a very similar query

SELECT TOP(1)
Campaignid
FROM
[dbo].[Campaign]
WHERE
datecreated < dateadd(day, -90, getutcdate())
ORDER BY
Campaignid DESC

If you cross join them as subqueries, you pretty much got what the execution plan describes.

Notes

Here I would add a note: optimizing for only one query is not always the best tactic. You can't optimize for everything, if this query runs once a day/week/quarter, that 14-15 seconds runtime with the clustered key will most probably do no harm. If the index does not help other queries, I would not create it, unless it is a mission critical query.



Related Topics



Leave a reply



Submit