What Are the Reasons *Not* to Use a Guid for a Primary Key

What are the reasons *not* to use a GUID for a primary key?

Jeff Atwood talks about this in great detail:

http://www.codinghorror.com/blog/2007/03/primary-keys-ids-versus-guids.html

Guid Pros:

Unique across every table, every database, every server

Allows easy merging of records from different databases

Allows easy distribution of databases across multiple servers

You can generate IDs anywhere, instead of having to roundtrip to the database

Most replication scenarios require GUID columns anyway

Guid Cons:

It is a whopping 4 times larger than the traditional 4-byte index value; this can have serious performance and storage implications if you're not careful

Cumbersome to debug (where userid='{BAE7DF4-DDF-3RG-5TY3E3RF456AS10}')

The generated GUIDs should be partially sequential for best performance (eg, newsequentialid() on SQL 2005) and to enable use of clustered indexes

What are the best practices for using a GUID as a primary key, specifically regarding performance?

GUIDs may seem to be a natural choice for your primary key - and if you really must, you could probably argue to use it for the PRIMARY KEY of the table. What I'd strongly recommend not to do is use the GUID column as the clustering key, which SQL Server does by default, unless you specifically tell it not to.

You really need to keep two issues apart:

  1. the primary key is a logical construct - one of the candidate keys that uniquely and reliably identifies every row in your table. This can be anything, really - an INT, a GUID, a string - pick what makes most sense for your scenario.

  2. the clustering key (the column or columns that define the "clustered index" on the table) - this is a physical storage-related thing, and here, a small, stable, ever-increasing data type is your best pick - INT or BIGINT as your default option.

By default, the primary key on a SQL Server table is also used as the clustering key - but that doesn't need to be that way! I've personally seen massive performance gains when breaking up the previous GUID-based Primary / Clustered Key into two separate key - the primary (logical) key on the GUID, and the clustering (ordering) key on a separate INT IDENTITY(1,1) column.

As Kimberly Tripp - the Queen of Indexing - and others have stated a great many times - a GUID as the clustering key isn't optimal, since due to its randomness, it will lead to massive page and index fragmentation and to generally bad performance.

Yes, I know - there's newsequentialid() in SQL Server 2005 and up - but even that is not truly and fully sequential and thus also suffers from the same problems as the GUID - just a bit less prominently so.

Then there's another issue to consider: the clustering key on a table will be added to each and every entry on each and every non-clustered index on your table as well - thus you really want to make sure it's as small as possible. Typically, an INT with 2+ billion rows should be sufficient for the vast majority of tables - and compared to a GUID as the clustering key, you can save yourself hundreds of megabytes of storage on disk and in server memory.

Quick calculation - using INT vs. GUID as Primary and Clustering Key:

  • Base Table with 1'000'000 rows (3.8 MB vs. 15.26 MB)
  • 6 nonclustered indexes (22.89 MB vs. 91.55 MB)

TOTAL: 25 MB vs. 106 MB - and that's just on a single table!

Some more food for thought - excellent stuff by Kimberly Tripp - read it, read it again, digest it! It's the SQL Server indexing gospel, really.

  • GUIDs as PRIMARY KEY and/or clustered key
  • The clustered index debate continues
  • Ever-increasing clustering key - the Clustered Index Debate..........again!
  • Disk space is cheap - that's not the point!

PS: of course, if you're dealing with just a few hundred or a few thousand rows - most of these arguments won't really have much of an impact on you. However: if you get into the tens or hundreds of thousands of rows, or you start counting in millions - then those points become very crucial and very important to understand.

Update: if you want to have your PKGUID column as your primary key (but not your clustering key), and another column MYINT (INT IDENTITY) as your clustering key - use this:

CREATE TABLE dbo.MyTable
(PKGUID UNIQUEIDENTIFIER NOT NULL,
MyINT INT IDENTITY(1,1) NOT NULL,
.... add more columns as needed ...... )

ALTER TABLE dbo.MyTable
ADD CONSTRAINT PK_MyTable
PRIMARY KEY NONCLUSTERED (PKGUID)

CREATE UNIQUE CLUSTERED INDEX CIX_MyTable ON dbo.MyTable(MyINT)

Basically: you just have to explicitly tell the PRIMARY KEY constraint that it's NONCLUSTERED (otherwise it's created as your clustered index, by default) - and then you create a second index that's defined as CLUSTERED

This will work - and it's a valid option if you have an existing system that needs to be "re-engineered" for performance. For a new system, if you start from scratch, and you're not in a replication scenario, then I'd always pick ID INT IDENTITY(1,1) as my clustered primary key - much more efficient than anything else!

Advantages and disadvantages of GUID / UUID database keys

Advantages:

  • Can generate them offline.
  • Makes replication trivial (as opposed to int's, which makes it REALLY hard)
  • ORM's usually like them
  • Unique across applications. So We can use the PK's from our CMS (guid) in our app (also guid) and know we are NEVER going to get a clash.

Disadvantages:

  • Larger space use, but space is cheap(er)
  • Can't order by ID to get the insert order.
  • Can look ugly in a URL, but really, WTF are you doing putting a REAL DB key in a URL!? (This point disputed in comments below)
  • Harder to do manual debugging, but not that hard.

Personally, I use them for most PK's in any system of a decent size, but I got "trained" on a system which was replicated all over the place, so we HAD to have them. YMMV.

I think the duplicate data thing is rubbish - you can get duplicate data however you do it. Surrogate keys are usually frowned upon where ever I've been working. We DO use the WordPress-like system though:

  • unique ID for the row (GUID/whatever). Never visible to the user.
  • public ID is generated ONCE from some field (e.g. the title - make it the-title-of-the-article)

UPDATE:
So this one gets +1'ed a lot, and I thought I should point out a big downside of GUID PK's: Clustered Indexes.

If you have a lot of records, and a clustered index on a GUID, your insert performance will SUCK, as you get inserts in random places in the list of items (that's the point), not at the end (which is quick).

So if you need insert performance, maybe use a auto-inc INT, and generate a GUID if you want to share it with someone else (e.g., showing it to a user in a URL).

Is it a bad idea to use GUIDs as primary keys in MS SQL?

There are pros and cons:

This article covers everything.

GUID Pros

  • Unique across every table, every database, every server
  • Allows easy merging of records from different databases
  • Allows easy distribution of databases across multiple servers
  • You can generate IDs anywhere, instead of having to roundtrip to the database
  • Most replication scenarios require GUID columns anyway

GUID Cons

  • It is a whopping 4 times larger than the traditional 4-byte index value; this can have serious performance and storage implications if you're not careful
  • Cumbersome to debug (where userid='{BAE7DF4-DDF-3RG-5TY3E3RF456AS10}')
  • The generated GUIDs should be partially sequential for best performance (eg, newsequentialid() on SQL 2005) and to enable use of clustered indexes

Why not always use GUIDs instead of Integer IDs?

Integers join a lot faster, for one. This is especially important when dealing with millions of rows.

For two, GUIDs take up more space than integers do. Again, very important when dealing with millions of rows.

For three, GUIDs sometimes take different formats that can cause hiccups in applications, etc. An integer is an integer, through and through.

A more in depth look can be found here, and on Jeff's blog.

When would you use GUIDs as primary keys?

You would use guids as a key if you needed multiple databases synchronising via replication.

Another reason to use guids is if you wanted to create rows on some remote client eg a winforms app and then submit those to the server via web services etc.

If you do this I would strongly suggest that you make sure that you specify your own clustered index based on an auto incrementing int that is not unique. It can be a considerable overhead inserting rows into a table where the clustered index is a guid.

Update: Here is an example of how to set up a table like this:

CREATE TABLE [dbo].[myTable](
[intId] [int] IDENTITY(1,1) NOT NULL,
[realGuidId] [uniqueidentifier] NOT NULL,
[someData] [varchar](50) NULL,
CONSTRAINT [PK_myTable] UNIQUE NONCLUSTERED
(
[realGuidId] ASC
)
)

CREATE CLUSTERED INDEX [IX_myTable] ON [dbo].[myTable]
(
[intId] ASC
)

You would insert into the table as normal e.g.:

INSERT INTO myTable VALUES(NEWID(), 'Some useful data goes here')

Update: I listened to a really good dotnetrocks episode that talks about this its worth a listen - Show #447

Why is IDENTITY preferred over GUID as primary key for data warehousing?

IDENTITY fields create small, pretty indexes. They are also SEQUENTIAL which means that indexes created for them are less fragmented than regular GUID key indexes. Using SEQUENTIAL GUID's will get you closer to this behavior, but it still has its drawbacks. One advantage a GUID has is that it tends to be unique even across databases, but it's a performance and space hit in most applications.

GUID Pros
Unique across every table, every database, every server
Allows easy merging of records from different databases
Allows easy distribution of databases across multiple servers
You can generate IDs anywhere, instead of having to roundtrip to the database
Most replication scenarios require GUID columns anyway

GUID Cons
It is a whopping 4 times larger than the traditional 4-byte index value; this can have serious performance and storage implications if you're not careful
Cumbersome to debug (where userid='{BAE7DF4-DDF-3RG-5TY3E3RF456AS10}')
The generated GUIDs should be partially sequential for best performance (eg, newsequentialid() on SQL 2005) and to enable use of clustered indexes

Also, to specifically answer your question :
I don't think the article you're referencing says that "GUIDs are a bad idea for data warehousing" as much as it says "Identity" fields are more useful in data warehousing than natural keys. However, if you're storing huge amounts of records in a data warehouse, you will get better performance and smaller storage requirements from using IDENTITY columns rather than GUIDs due to the indexing complaint above, I would say that is the primary drawback.

SQL primary key, INT or GUID or..?

The advantage of using GUID primkey is that it should be unique in the world, such as whether to move data from one database to another. So you know that the row is unique.

But if we are talking about a small db, so I prefer integer.

Edit:

If you using SQL Server 2005++, can you also use NEWSEQUENTIALID(),
this generates a GUID based on the row above.Allows the index problem with newid() is not there anymore.

Which Database can i Safely use a GUID as Primary Key besides SQL Server?

As others have said you can use GUIDs/UUIDs in pretty much any modern DB. The algorithm for generating a GUID is pretty straitforward and you can be reasonably sure that you won't get dupes however there are some considerations.

+) Although GUIDs are generally representations of 128 Bit values the actual format used differs from implementation to implemenation - you may want to consider normalizing them by removing non-significant characters (usually dashes or spaces).

+) To absolutely ensure uniqueness you can also append a value to the guid. For example if you're worried about MS and Oracle guids colliding add "MS" to the former and "Or" to the latter - now even if the guids themselves do collide they keys won't.

As others have mentioned however there is a potentially severe price to pay here: your keys will be large (128 bits) and won't index very well (although this is somewhat dependent on the implementation).

The techique works very well for small databases (especially those where the entire dataset can fit in memory) but as DBs grow you'll definately have to accept a performance trade-off.

One thing you might consider is a hybrid approach. Without more information it's hard to really know what you're trying to do so these might not help:

1) Remember that primary keys don't have to be a single column - you can have a simple numeric key to identify your rows and another row, containing a single value, that identifies the database that hosts the data or created the key. Creating the primary key as aggregate of both columns allows indexing to index fewer complex values and should be significantly faster.

2) You can "fake it" by constructing the key as a concatenated field (as in the above idea to append a DB identifier to the key). So your key would be a simple number followed by some DB identifier (perhaps a guid for each DB).

Indexing such a value (since the values would still be sequential) should be much faster.

In both cases you'll have some manual work to do if you ever do split the DB(s) - you'll have to update some keys with a new DB ID, but this would be a one-time,infrequent event. In exchange you can tune your DB much better.

There are definately other ways to ensure data integrity across mutiple databases. Many enterprise DBMSs have tools built-in for clustering data across multiple servers or databases, some have special tools or design patterns that make it easier, etc.

In short I would say that guids are nice and simple and do what you want, but that you should only consider them if either a) the dataset is small or b) the DBMS has specific features to optimize their use as keys (for example sequential guids). If the datasets are going to be very large or if you're trying to limit DBMS-specific dependencies I would play around more with optimizing a "key + identifier" strategy.



Related Topics



Leave a reply



Submit