Improving Performance of Cluster Index Guid Primary Key

Improving performance of cluster index GUID primary key

You need to use newsequentialid() instead see here Some Simple Code To Show The Difference Between Newid And Newsequentialid

GUID primary key, separate clustered index column

There are a few things to consider:

  1. Yes you're correct, the clustered index keys will be present in all nonclustered indexes. Having a smaller key will help with space savings on disk and in the buffer pool.

  2. Having a clustered key of an identity will give you end of the table inserts and potentially (depending on load) make that an insertion hotspot. Where the GUIDS right now are random insert and will not give so much of a hotspot but will cause more page splits which may also adversely affect performance.

  3. To answer the question of improving performance, what is your current problem area? Have any data that we can go off of? If you don't have any problems now, it may not be worth the changes.

  4. When you add the column as an Identity it should seed itself and the order really shouldn't matter.

  5. If you do use an INT column for the key, create a unique non-clustered index on the GUID column to let the optimizer know there will only be a single value (optimization) and to allow for a quick seek. Make it covering if it isn't too expensive.

What are the best practices for using a GUID as a primary key, specifically regarding performance?

GUIDs may seem to be a natural choice for your primary key - and if you really must, you could probably argue to use it for the PRIMARY KEY of the table. What I'd strongly recommend not to do is use the GUID column as the clustering key, which SQL Server does by default, unless you specifically tell it not to.

You really need to keep two issues apart:

  1. the primary key is a logical construct - one of the candidate keys that uniquely and reliably identifies every row in your table. This can be anything, really - an INT, a GUID, a string - pick what makes most sense for your scenario.

  2. the clustering key (the column or columns that define the "clustered index" on the table) - this is a physical storage-related thing, and here, a small, stable, ever-increasing data type is your best pick - INT or BIGINT as your default option.

By default, the primary key on a SQL Server table is also used as the clustering key - but that doesn't need to be that way! I've personally seen massive performance gains when breaking up the previous GUID-based Primary / Clustered Key into two separate key - the primary (logical) key on the GUID, and the clustering (ordering) key on a separate INT IDENTITY(1,1) column.

As Kimberly Tripp - the Queen of Indexing - and others have stated a great many times - a GUID as the clustering key isn't optimal, since due to its randomness, it will lead to massive page and index fragmentation and to generally bad performance.

Yes, I know - there's newsequentialid() in SQL Server 2005 and up - but even that is not truly and fully sequential and thus also suffers from the same problems as the GUID - just a bit less prominently so.

Then there's another issue to consider: the clustering key on a table will be added to each and every entry on each and every non-clustered index on your table as well - thus you really want to make sure it's as small as possible. Typically, an INT with 2+ billion rows should be sufficient for the vast majority of tables - and compared to a GUID as the clustering key, you can save yourself hundreds of megabytes of storage on disk and in server memory.

Quick calculation - using INT vs. GUID as Primary and Clustering Key:

  • Base Table with 1'000'000 rows (3.8 MB vs. 15.26 MB)
  • 6 nonclustered indexes (22.89 MB vs. 91.55 MB)

TOTAL: 25 MB vs. 106 MB - and that's just on a single table!

Some more food for thought - excellent stuff by Kimberly Tripp - read it, read it again, digest it! It's the SQL Server indexing gospel, really.

  • GUIDs as PRIMARY KEY and/or clustered key
  • The clustered index debate continues
  • Ever-increasing clustering key - the Clustered Index Debate..........again!
  • Disk space is cheap - that's not the point!

PS: of course, if you're dealing with just a few hundred or a few thousand rows - most of these arguments won't really have much of an impact on you. However: if you get into the tens or hundreds of thousands of rows, or you start counting in millions - then those points become very crucial and very important to understand.

Update: if you want to have your PKGUID column as your primary key (but not your clustering key), and another column MYINT (INT IDENTITY) as your clustering key - use this:

CREATE TABLE dbo.MyTable
(PKGUID UNIQUEIDENTIFIER NOT NULL,
MyINT INT IDENTITY(1,1) NOT NULL,
.... add more columns as needed ...... )

ALTER TABLE dbo.MyTable
ADD CONSTRAINT PK_MyTable
PRIMARY KEY NONCLUSTERED (PKGUID)

CREATE UNIQUE CLUSTERED INDEX CIX_MyTable ON dbo.MyTable(MyINT)

Basically: you just have to explicitly tell the PRIMARY KEY constraint that it's NONCLUSTERED (otherwise it's created as your clustered index, by default) - and then you create a second index that's defined as CLUSTERED

This will work - and it's a valid option if you have an existing system that needs to be "re-engineered" for performance. For a new system, if you start from scratch, and you're not in a replication scenario, then I'd always pick ID INT IDENTITY(1,1) as my clustered primary key - much more efficient than anything else!

Searching table by Guid faster when the Guid is the clustered index?

Assuming MS SQL Server. This may or may not apply to other RDBMSs:

If you have a clustered index then it will be fastest, although if you're searching for a single row then the difference between that and a non-clustered index will be negligible. When you use a non-clustered index the server needs to first find the right value in the index and then go fetch the full record from the table storage. The table storage is the clustered index, so searching by a clustered index eliminates that step (called a Bookmark Lookup), but that step is almost imperceptible for a single row.

Clustered indexes tend to provide a bigger advantage for reading when they are on a column that is selected by range (for example, transaction date and you want to find all transactions for the past month). In that case the server can find the start and just read off the data in one quick, sequential sweep.

Having a non-clustered index on an INT (all other things being equal) will be slightly faster than using a GUID because the index itself will be smaller (because INTs are much smaller than GUIDs) which means that the server has to traverse fewer pages to find the value that it's looking to get. In the case of a clustered index I don't think that you'll see much of a difference if your row sizes are already large compared to the difference between a GUID and an INT, but I haven't done any testing on that.

Should I get rid of clustered indexes on Guid columns

A big reason for a clustered index is when you often want to retrieve rows for a range of values for a given column. Because the data is physically arranged in that order, the rows can be extracted very efficiently.

Something like a GUID, while excellent for a primary key, could be positively detrimental to performance, as there will be additional cost for inserts and no perceptible benefit on selects.

So yes, don't cluster an index on GUID.

As to why it's not offered as a recommendation, I'd suggest the tuner is aware of this fact.

Guid as PK low performance

GUID Data type is most probably one of the worst candidates for a Primary key column for two main reasons.

  1. It is a random value, column values being random means SQL Server will have to insert new rows somewhere in between the existing rows, which leads to a lot of page splits and fragmented indexes.

  2. It being a 16 bytes data, 4 x times bigger than Integer data type means SQL Server will be processing 4 times more data just to do the same operation if it were deal with an integer value.

  3. the only good thing about GUID is , it is Unique , but is it really worth paying the price you are paying for having a GUID column? I will let you decide this :)

Stick to Integer as your primary key column, make it Identity for auto-generated values.

How does the indexing of GUID keys and int keys in SQL Server work?

The confusion stems from the multiple meanings of the word "key" in the answer; to wit, the answer states:

if you want to have your PKGUID column as your primary key (but not
your clustering key), and another column MYINT (INT IDENTITY) as your
clustering key...

I would change it to say (changes bolded):

if you want to have your PKGUID column as your primary key (but not
your clustered index), and another column MYINT (INT IDENTITY) as your
clustered index...

The point is that by default, a PRIMARY KEY is buit on a clustered index (unless you specify otherwise); clustered indexes are then included in every other index, which, in the case of a GUID as a clustered PK, can be a significant performance bottleneck. The code you posted is a compromise; it satisfies the "need" to have a GUID for a Primary Key, while clustering on a smaller column value (which can lead to a performance boost).

It's not ideal, but it can be a very useful method. If you'd like to read up more on the differences between keys and indexes, here are some useful links:

What is the difference between a primary key and a index key
https://itknowledgeexchange.techtarget.com/sql-server/difference-between-an-index-and-a-primary-key/
When should I use primary key or index?

Should primary keys be always assigned as clustered index

The ideal clustered index key is:

  1. Sequential
  2. Selective (no dupes, unique for each record)
  3. Narrow
  4. Used in Queries

In general it is a very bad idea to use a GUID as a clustered index key, since it leads to mucho fragmentation as rows are added.

EDIT FOR CLARITY:

PK and Clustered key are indeed separate concepts. Your PK does not need to be your clustered index key.

In practical applications in my own experience, the same field that is your PK should/would be your clustered key since it meets the same criteria listed above.

Clustered primary key on unique identifier ID column in SQL Server

GUIDs as they are are terrible for performance since they are effectively random values (this "breaks" clustered index), and they are awful for indexes, since less entries fit on a single page/extent (SQL Server terms). SQL Server 2005 introduces newsequentialid() which helps solving first problem.



Related Topics



Leave a reply



Submit