Is It a Bad Idea to Use Guids as Primary Keys in Ms SQL

Is it a bad idea to use GUIDs as primary keys in MS SQL?

There are pros and cons:

This article covers everything.

GUID Pros

  • Unique across every table, every database, every server
  • Allows easy merging of records from different databases
  • Allows easy distribution of databases across multiple servers
  • You can generate IDs anywhere, instead of having to roundtrip to the database
  • Most replication scenarios require GUID columns anyway

GUID Cons

  • It is a whopping 4 times larger than the traditional 4-byte index value; this can have serious performance and storage implications if you're not careful
  • Cumbersome to debug (where userid='{BAE7DF4-DDF-3RG-5TY3E3RF456AS10}')
  • The generated GUIDs should be partially sequential for best performance (eg, newsequentialid() on SQL 2005) and to enable use of clustered indexes

What are the best practices for using a GUID as a primary key, specifically regarding performance?

GUIDs may seem to be a natural choice for your primary key - and if you really must, you could probably argue to use it for the PRIMARY KEY of the table. What I'd strongly recommend not to do is use the GUID column as the clustering key, which SQL Server does by default, unless you specifically tell it not to.

You really need to keep two issues apart:

  1. the primary key is a logical construct - one of the candidate keys that uniquely and reliably identifies every row in your table. This can be anything, really - an INT, a GUID, a string - pick what makes most sense for your scenario.

  2. the clustering key (the column or columns that define the "clustered index" on the table) - this is a physical storage-related thing, and here, a small, stable, ever-increasing data type is your best pick - INT or BIGINT as your default option.

By default, the primary key on a SQL Server table is also used as the clustering key - but that doesn't need to be that way! I've personally seen massive performance gains when breaking up the previous GUID-based Primary / Clustered Key into two separate key - the primary (logical) key on the GUID, and the clustering (ordering) key on a separate INT IDENTITY(1,1) column.

As Kimberly Tripp - the Queen of Indexing - and others have stated a great many times - a GUID as the clustering key isn't optimal, since due to its randomness, it will lead to massive page and index fragmentation and to generally bad performance.

Yes, I know - there's newsequentialid() in SQL Server 2005 and up - but even that is not truly and fully sequential and thus also suffers from the same problems as the GUID - just a bit less prominently so.

Then there's another issue to consider: the clustering key on a table will be added to each and every entry on each and every non-clustered index on your table as well - thus you really want to make sure it's as small as possible. Typically, an INT with 2+ billion rows should be sufficient for the vast majority of tables - and compared to a GUID as the clustering key, you can save yourself hundreds of megabytes of storage on disk and in server memory.

Quick calculation - using INT vs. GUID as Primary and Clustering Key:

  • Base Table with 1'000'000 rows (3.8 MB vs. 15.26 MB)
  • 6 nonclustered indexes (22.89 MB vs. 91.55 MB)

TOTAL: 25 MB vs. 106 MB - and that's just on a single table!

Some more food for thought - excellent stuff by Kimberly Tripp - read it, read it again, digest it! It's the SQL Server indexing gospel, really.

  • GUIDs as PRIMARY KEY and/or clustered key
  • The clustered index debate continues
  • Ever-increasing clustering key - the Clustered Index Debate..........again!
  • Disk space is cheap - that's not the point!

PS: of course, if you're dealing with just a few hundred or a few thousand rows - most of these arguments won't really have much of an impact on you. However: if you get into the tens or hundreds of thousands of rows, or you start counting in millions - then those points become very crucial and very important to understand.

Update: if you want to have your PKGUID column as your primary key (but not your clustering key), and another column MYINT (INT IDENTITY) as your clustering key - use this:

CREATE TABLE dbo.MyTable
(PKGUID UNIQUEIDENTIFIER NOT NULL,
MyINT INT IDENTITY(1,1) NOT NULL,
.... add more columns as needed ...... )

ALTER TABLE dbo.MyTable
ADD CONSTRAINT PK_MyTable
PRIMARY KEY NONCLUSTERED (PKGUID)

CREATE UNIQUE CLUSTERED INDEX CIX_MyTable ON dbo.MyTable(MyINT)

Basically: you just have to explicitly tell the PRIMARY KEY constraint that it's NONCLUSTERED (otherwise it's created as your clustered index, by default) - and then you create a second index that's defined as CLUSTERED

This will work - and it's a valid option if you have an existing system that needs to be "re-engineered" for performance. For a new system, if you start from scratch, and you're not in a replication scenario, then I'd always pick ID INT IDENTITY(1,1) as my clustered primary key - much more efficient than anything else!

When would you use GUIDs as primary keys?

You would use guids as a key if you needed multiple databases synchronising via replication.

Another reason to use guids is if you wanted to create rows on some remote client eg a winforms app and then submit those to the server via web services etc.

If you do this I would strongly suggest that you make sure that you specify your own clustered index based on an auto incrementing int that is not unique. It can be a considerable overhead inserting rows into a table where the clustered index is a guid.

Update: Here is an example of how to set up a table like this:

CREATE TABLE [dbo].[myTable](
[intId] [int] IDENTITY(1,1) NOT NULL,
[realGuidId] [uniqueidentifier] NOT NULL,
[someData] [varchar](50) NULL,
CONSTRAINT [PK_myTable] UNIQUE NONCLUSTERED
(
[realGuidId] ASC
)
)

CREATE CLUSTERED INDEX [IX_myTable] ON [dbo].[myTable]
(
[intId] ASC
)

You would insert into the table as normal e.g.:

INSERT INTO myTable VALUES(NEWID(), 'Some useful data goes here')

Update: I listened to a really good dotnetrocks episode that talks about this its worth a listen - Show #447

What are the reasons *not* to use a GUID for a primary key?

Jeff Atwood talks about this in great detail:

http://www.codinghorror.com/blog/2007/03/primary-keys-ids-versus-guids.html

Guid Pros:

Unique across every table, every database, every server

Allows easy merging of records from different databases

Allows easy distribution of databases across multiple servers

You can generate IDs anywhere, instead of having to roundtrip to the database

Most replication scenarios require GUID columns anyway

Guid Cons:

It is a whopping 4 times larger than the traditional 4-byte index value; this can have serious performance and storage implications if you're not careful

Cumbersome to debug (where userid='{BAE7DF4-DDF-3RG-5TY3E3RF456AS10}')

The generated GUIDs should be partially sequential for best performance (eg, newsequentialid() on SQL 2005) and to enable use of clustered indexes

SQL primary key, INT or GUID or..?

The advantage of using GUID primkey is that it should be unique in the world, such as whether to move data from one database to another. So you know that the row is unique.

But if we are talking about a small db, so I prefer integer.

Edit:

If you using SQL Server 2005++, can you also use NEWSEQUENTIALID(),
this generates a GUID based on the row above.Allows the index problem with newid() is not there anymore.

Why not always use GUIDs instead of Integer IDs?

Integers join a lot faster, for one. This is especially important when dealing with millions of rows.

For two, GUIDs take up more space than integers do. Again, very important when dealing with millions of rows.

For three, GUIDs sometimes take different formats that can cause hiccups in applications, etc. An integer is an integer, through and through.

A more in depth look can be found here, and on Jeff's blog.

Advantages and disadvantages of GUID / UUID database keys

Advantages:

  • Can generate them offline.
  • Makes replication trivial (as opposed to int's, which makes it REALLY hard)
  • ORM's usually like them
  • Unique across applications. So We can use the PK's from our CMS (guid) in our app (also guid) and know we are NEVER going to get a clash.

Disadvantages:

  • Larger space use, but space is cheap(er)
  • Can't order by ID to get the insert order.
  • Can look ugly in a URL, but really, WTF are you doing putting a REAL DB key in a URL!? (This point disputed in comments below)
  • Harder to do manual debugging, but not that hard.

Personally, I use them for most PK's in any system of a decent size, but I got "trained" on a system which was replicated all over the place, so we HAD to have them. YMMV.

I think the duplicate data thing is rubbish - you can get duplicate data however you do it. Surrogate keys are usually frowned upon where ever I've been working. We DO use the WordPress-like system though:

  • unique ID for the row (GUID/whatever). Never visible to the user.
  • public ID is generated ONCE from some field (e.g. the title - make it the-title-of-the-article)

UPDATE:
So this one gets +1'ed a lot, and I thought I should point out a big downside of GUID PK's: Clustered Indexes.

If you have a lot of records, and a clustered index on a GUID, your insert performance will SUCK, as you get inserts in random places in the list of items (that's the point), not at the end (which is quick).

So if you need insert performance, maybe use a auto-inc INT, and generate a GUID if you want to share it with someone else (e.g., showing it to a user in a URL).

Why is IDENTITY preferred over GUID as primary key for data warehousing?

IDENTITY fields create small, pretty indexes. They are also SEQUENTIAL which means that indexes created for them are less fragmented than regular GUID key indexes. Using SEQUENTIAL GUID's will get you closer to this behavior, but it still has its drawbacks. One advantage a GUID has is that it tends to be unique even across databases, but it's a performance and space hit in most applications.

GUID Pros
Unique across every table, every database, every server
Allows easy merging of records from different databases
Allows easy distribution of databases across multiple servers
You can generate IDs anywhere, instead of having to roundtrip to the database
Most replication scenarios require GUID columns anyway

GUID Cons
It is a whopping 4 times larger than the traditional 4-byte index value; this can have serious performance and storage implications if you're not careful
Cumbersome to debug (where userid='{BAE7DF4-DDF-3RG-5TY3E3RF456AS10}')
The generated GUIDs should be partially sequential for best performance (eg, newsequentialid() on SQL 2005) and to enable use of clustered indexes

Also, to specifically answer your question :
I don't think the article you're referencing says that "GUIDs are a bad idea for data warehousing" as much as it says "Identity" fields are more useful in data warehousing than natural keys. However, if you're storing huge amounts of records in a data warehouse, you will get better performance and smaller storage requirements from using IDENTITY columns rather than GUIDs due to the indexing complaint above, I would say that is the primary drawback.

GUIDs as Primary Keys - Offline OLTP

Using Guids as primary keys is acceptable and is considered a fairly standard practice for the same reasons that you are considering them. They can be overused which can make things a bit tedious to debug and manage, so try to keep them out of code tables and other reference data if at all possible.

The thing that you have to concern yourself with is the human readable identifier. Guids cannot be exchanged by people - can you imagine trying to confirm your order number over the phone if it is a guid? So in an offline scenario you may still have to generate something - like a publisher (workstation/user) id and some sequence number, so the order number may be 123-5678 -.

However this may not satisfy business requirements of having a sequential number. In fact regulatory requirements can be and influence - some regulations (SOX maybe) require that invoice numbers are sequential. In such cases it may be neccessary to generate a sort of proforma number which is fixed up later when the systems synchronise. You may land up with tables having OrderId (Guid), OrderNo (int), ProformaOrderNo (varchar) - some complexity may creep in.

At least having guids as primary keys means that you don't have to do a whole lot of cascading updates when the sync does eventually happen - you simply update the human readable number.

Should I get rid of clustered indexes on Guid columns

A big reason for a clustered index is when you often want to retrieve rows for a range of values for a given column. Because the data is physically arranged in that order, the rows can be extracted very efficiently.

Something like a GUID, while excellent for a primary key, could be positively detrimental to performance, as there will be additional cost for inserts and no perceptible benefit on selects.

So yes, don't cluster an index on GUID.

As to why it's not offered as a recommendation, I'd suggest the tuner is aware of this fact.



Related Topics



Leave a reply



Submit