What's the difference between a Table Scan and a Clustered Index Scan?
In a table without a clustered index (a heap table), data pages are not linked together - so traversing pages requires a lookup into the Index Allocation Map.
A clustered table, however, has it's data pages linked in a doubly linked list - making sequential scans a bit faster. Of course, in exchange, you have the overhead of dealing with keeping the data pages in order on INSERT
, UPDATE
, and DELETE
. A heap table, however, requires a second write to the IAM.
If your query has a RANGE
operator (e.g.: SELECT * FROM TABLE WHERE Id BETWEEN 1 AND 100
), then a clustered table (being in a guaranteed order) would be more efficient - as it could use the index pages to find the relevant data page(s). A heap would have to scan all rows, since it cannot rely on ordering.
And, of course, a clustered index lets you do a CLUSTERED INDEX SEEK, which is pretty much optimal for performance...a heap with no indexes would always result in a table scan.
So:
For your example query where you select all rows, the only difference is the doubly linked list a clustered index maintains. This should make your clustered table just a tiny bit faster than a heap with a large number of rows.
For a query with a
WHERE
clause that can be (at least partially) satisfied by the clustered index, you'll come out ahead because of the ordering - so you won't have to scan the entire table.For a query that is not satisified by the clustered index, you're pretty much even...again, the only difference being that doubly linked list for sequential scanning. In either case, you're suboptimal.
For
INSERT
,UPDATE
, andDELETE
a heap may or may not win. The heap doesn't have to maintain order, but does require a second write to the IAM. I think the relative performance difference would be negligible, but also pretty data dependent.
Microsoft has a whitepaper which compares a clustered index to an equivalent non-clustered index on a heap (not exactly the same as I discussed above, but close). Their conclusion is basically to put a clustered index on all tables. I'll do my best to summarize their results (again, note that they're really comparing a non-clustered index to a clustered index here - but I think it's relatively comparable):
INSERT
performance: clustered index wins by about 3% due to the second write needed for a heap.UPDATE
performance: clustered index wins by about 8% due to the second lookup needed for a heap.DELETE
performance: clustered index wins by about 18% due to the second lookup needed and the second delete needed from the IAM for a heap.- single
SELECT
performance: clustered index wins by about 16% due to the second lookup needed for a heap. - range
SELECT
performance: clustered index wins by about 29% due to the random ordering for a heap. - concurrent
INSERT
: heap table wins by 30% under load due to page splits for the clustered index.
Whats the difference between table scanning a clustered table, vs index scanning
THE EXPLANATION
Clustered indexes are logically ordered but not physically ordered.
This means that a table scan if it's done in physical order will return different results than clustered index scan, which is sorted logically.
This logical-physical mapping is controlled by OAM (Object Allocation Map)
Why/when/how is whole clustered index scan chosen rather than full table scan?
Please read my answer under "No direct access to data row in clustered table - why?", first.
"the leaf of clustered index contains the real table row, so full clustered index, with intermediate leaves, contain much more data than the full table(?)"
See you are mixing up "Table" with storage structures. In the context of your question, eg. thinking about the size of the CI as opposed to the "table", well then you must think about the CI minus the leaf level (which is the data row). The CI, index portion only, is tiny. The intermediate levels (like any B-Tree) contain partial (not full) key entries; it excludes the lowest level, which is the full key entry, which sits in the row itself, and is not duplicated.
The table (full CI) may be 10GB. The CI only may be 10MB. There is an awful lot that can be determined from the 10MB without having to go to the 100GB.
For understanding: the equivalent NCI on the same table (CI) may be 22MB; the equivalent NCI on the same table if you removed the CI may be 21.5MB (assuming the CI key is reasonable, not fat wide).
"Why/when/how is ever whole clustered index scan chosen over the full table scan?"
Quite often. Again the context is, we are talking about the CI-minus-Leaf levels. For queries that use only the columns in the CI, the presence of those columns in the CI (any index actually) allow the query to be a "covered query", which means it can by serviced wholly from the index, no need to go to the data rows. Think range scans on partial keys: BETWEEN x AND yY; x <= y; etc.
(There is always the chance that the optimiser will choose a table scan, when you think it should choose an index scan, bu t that is a different story.)
"I still do not understand how/why clustered index full scan can be "better" over full table scan."
(The terms used by MS are less precise than my answers here.) For any query that can be answered from the 10MB CI, I would much rather churn 10MB through the data cache, than 100GB. For the same queries, bounded by a range on the CI key, that's a fraction of the 10MB.
For queries that requires a "full table scan", well yes, you must read all the Leaf pages of the CI, which is the 100GB.
Heap vs Clustered index full table scan
You asked about MySQL, and that generally means the InnoDB storage engine, which is the default.
InnoDB does not store tables as a heap.
InnoDB tables are always stored as a clustered index, where the clustered index is the primary key. A table-scan is therefore more or less equivalent to an index-scan of the clustered index.
Any index in InnoDB is not usually stored sequentially on disk. It's stored as a collection of pages, where page have a uniform size of 16KB. The index is obviously much larger than this, and over time insertions and updates expand parts of the index in the middle as well as at the end. To do this efficiently (that is, without needing to rewrite the whole table), random insertions and updates result in the pages being out of order. New pages created are placed wherever there is room in the file.
To facilitate scanning through all the pages, each page contains links to the location of the next page and the preceding page. These may be quite far away in the file, so a table-scan will not actually be sequential, it will involve many seeks to other locations in the file.
InnoDB requires that pages are loaded into RAM before it can actually use them in queries. The InnoDB buffer pool is a fixed-size allocation of RAM, which contains a set of pages loaded from disk. Once the pages are in the buffer pool, they can be accessed very quickly, and with virtually no overhead for following links. The overhead of reading a page from disk into the buffer pool is orders of magnitude much greater than reading a page once it is in RAM.
So in the case of MySQL:
- There is no heap
- Sequential order by clustered index has nothing to do with sequential storage on disk
- Reads are made to pages in RAM anyway, so the physical layout on disk has little to do with the order pages will be read
What Clustered Index Scan (Clustered) means on SQL Server execution plan?
I would appreciate any explanations to "Clustered Index Scan
(Clustered)"
I will try to put in the easiest manner, for better understanding you need to understand both index seek and scan.
SO lets build the table
use tempdb GO
create table scanseek (id int , name varchar(50) default ('some random names') )
create clustered index IX_ID_scanseek on scanseek(ID)
declare @i int
SET @i = 0
while (@i <5000)
begin
insert into scanseek
select @i, 'Name' + convert( varchar(5) ,@i)
set @i =@i+1
END
An index seek is where SQL server uses the b-tree structure of the index to seek directly to matching records
you can check your table root and leaf nodes using the DMV below
-- check index level
SELECT
index_level
,record_count
,page_count
,avg_record_size_in_bytes
FROM sys.dm_db_index_physical_stats(DB_ID('tempdb'),OBJECT_ID('scanseek'),NULL,NULL,'DETAILED')
GO
Now here we have clustered index on column "ID"
lets look for some direct matching records
select * from scanseek where id =340
and look at the Execution plan
you've requested rows directly in the query that's why you got a clustered index SEEK .
Clustered index scan: When Sql server reads through for the Row(s) from top to bottom in the clustered index.
for example searching data in non key column. In our table NAME is non key column so if we will search some data in the name column we will see clustered index scan
because all the rows are in clustered index leaf level.
Example
select * from scanseek where name = 'Name340'
please note: I made this answer short for better understanding only, if you have any question or suggestion please comment below.
What is best among clustered index scan vs non-clustered index seek
First of all, there is no 'best' operator. Sometimes reading more data is more efficient than reading some data and massage them to get our results. 'Best' as almost everything is relative.
Lets try to understand what happened in the comments...
The query
select
min(CampaignID),
max(CampaignID)
from Campaign
where datecreated < dateadd(day, -90, getutcdate())
Which says:
I want the first and the last ID (min/max) of any record where the date is less than a constant date.
Clustered
The first query without the index/index hint did what SQL Server thought is cheaper than reading any index even if it requires more IO (disk usage). This is because finding the minimum and maximum while validating the records in the table is cheaper than selecting half of the table, then reordering/aggregating them find the exact same info.
The clustered index stores all data on disk and is logically ordered by the key columns, in this case CampaignID (I assume). This means, that to find the minimum and maximum ID is easy: The minimum is the first ID which matches the criteria -> lets check each ID from the first one and stop once we find a record where the date is in place (this will most probably be the first one). The maximum is the first record matching the condition from the end of the index.
Index with the date as key
CREATE NONCLUSTERED INDEX [NCIX]
ON [dbo].[Campaign](DateCreated)
INCLUDE (Campaignid)
With the first index (date as the key column), SQL Server can use the date to filter the data, true, but it did not help in sorting. It still has to check every record in that index and figure out the minimum and maximum from a possibly unordered set of values.
Index with the ID as key
CREATE NONCLUSTERED INDEX [NCIX]
ON [dbo].[Campaign](Campaignid)
INCLUDE (DateCreated)
With the second index where the ID was the key column, SQL Server can use the same trick as with the clustered key. The only difference is that there is no other data to read, but the ID and the date, which is much smaller than the whole record would be, therefore it can fit in less pages and requires less IO.
SQL Server will most probably choose the second index even if there is no index hint.
How the second index works (approximation by query)
You can get the minimum Campaignid by
SELECT TOP(1)
Campaignid
FROM
[dbo].[Campaign]
WHERE
datecreated < dateadd(day, -90, getutcdate())
ORDER BY
Campaignid ASC
and the maximum with a very similar query
SELECT TOP(1)
Campaignid
FROM
[dbo].[Campaign]
WHERE
datecreated < dateadd(day, -90, getutcdate())
ORDER BY
Campaignid DESC
If you cross join them as subqueries, you pretty much got what the execution plan describes.
Notes
Here I would add a note: optimizing for only one query is not always the best tactic. You can't optimize for everything, if this query runs once a day/week/quarter, that 14-15 seconds runtime with the clustered key will most probably do no harm. If the index does not help other queries, I would not create it, unless it is a mission critical query.
Related Topics
Get Everything After and Before Certain Character in SQL Server
SQL Server Unique Composite Key of Two Field with Second Field Auto-Increment
Insert Dates in the Return from a Query Where There Is None
Postgresql 9.3: Dynamic Pivot Table
Update Multiple Columns in SQL
SQL Query to Show Gaps Between Multiple Date Ranges
How to Select Records Without Duplicate on Just One Field in SQL
Find SQL Records Containing Similar Strings
Postgresql Column 'Foo' Does Not Exist
Pseudo_Encrypt() Function in Plpgsql That Takes Bigint
SQL Server for Xml Path Make Repeating Nodes
SQL Server Search Using Like While Ignoring Blank Spaces