Will Uuid as Primary Key in Postgresql Give Bad Index Performance

Will UUID as primary key in PostgreSQL give bad index performance?

(I work on Heroku Postgres)

We use UUIDs as primary keys on a few systems and it works great.

I recommend you use the uuid-ossp extension, and even have postgres generate UUIDs for you:

heroku pg:psql
psql (9.1.4, server 9.1.6)
SSL connection (cipher: DHE-RSA-AES256-SHA, bits: 256)
Type "help" for help.

dcvgo3fvfmbl44=> CREATE EXTENSION "uuid-ossp"; 
CREATE EXTENSION  
dcvgo3fvfmbl44=> CREATE TABLE test (id uuid primary key default uuid_generate_v4(), name text);  
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "test_pkey" for table "test"
CREATE TABLE  
dcvgo3fvfmbl44=> \d test
                 Table "public.test"  
Column | Type |              Modifiers              
--------+------+-------------------------------------  
id     | uuid | not null default uuid_generate_v4()  name   | text |  
Indexes:
    "test_pkey" PRIMARY KEY, btree (id)

dcvgo3fvfmbl44=> insert into test (name) values ('hgmnz'); 
INSERT 0 1 
dcvgo3fvfmbl44=> select * from test;
                  id                  | name  
--------------------------------------+-------   
 e535d271-91be-4291-832f-f7883a2d374f | hgmnz  
(1 row)

EDIT performance implications

It will always depend on your workload.

The integer primary key has the advantage of locality where like-data sits closer together. This can be helpful for eg: range type queries such as WHERE id between 1 and 10000 although lock contention is worse.

If your read workload is totally random in that you always make primary key lookups, there shouldn't be any measurable performance degradation: you only pay for the larger data type.

Do you write a lot to this table, and is this table very big? It's possible, although I haven't measured this, that there are implications in maintaining that index. For lots of datasets UUIDs are just fine though, and using UUIDs as identifiers has some nice properties.

Finally, I may not be the most qualified person to discuss or advice on this, as I have never run a table large enough with a UUID PK where it has become a problem. YMMV. (Having said that, I'd love to hear of people who run into problems with the approach!)

What is the performance hit of using a string type vs a uuid type for a UUID primary key?

This is a micro-optimization and is unlikely to cause a real performance problem until you get to enormous scales. Use the key that best fits your design. That said, here's the details...

UUID is a built in PostgreSQL type. It's basically a 128 bit integer. It should perform as an index just as well as any other large integer. Postgres has no built in UUID generating function. You can install various modules to do it on the database, or you can do it on the client. Generating the UUID on the client distributes the extra work (not much extra work) away from the server.

MySQL does not have a built in UUID type. Instead there's a UUID function which can generate a UUID as a string of hex numbers. Because it's a string, UUID keys may have a performance and storage hit. It may also interfere with replication.

The string UUID will be longer; hex characters only encode 4 bits of data per byte so a hex string UUID needs 256 bits to store 128 bits of information. This means more storage and memory per column which can impact performance.

Normally this would mean comparisons are twice as long, since the key being compared is twice as long. However, UUIDs are normally unique in the first few bytes, so the whole UUID does not need to be compared to know they're different. Long story short: comparing string vs binary UUIDs shouldn't cause a noticeable performance difference in a real application... though the fact that MySQL UUIDs are UTF8 encoded might add cost.

Using UUIDs on PostgreSQL is fine, it's a built-in type. MySQL's implementation of UUID keys is pretty incomplete, I'd steer away from it. Steer away from MySQL while you're at it.

UUID Primary Key in Postgres, What Insert Performance Impact?

As I understand it, Postgres does not maintain row clustering on inserts

Correct at the moment. Unfortunately.

so I imagine that in Postgres using a UUID PK does not hurt the performance of that insert.

It still does have a performance cost because of the need to maintain the PK, and because the inserted tuple is bigger.

The uuid is 4 times as wide as a typical 32-bit integer synthetic key, so the row to write is 12 bytes bigger and you can fit fewer rows into a given amount of RAM
The b-tree index that implements the primary key will be 4x as large (vs a 32-bit key), taking longer to search and requiring more memory to cache. It also needs more frequent page splits.
Writes will tend to be random within indexes, not appends to hot, recently accessed rows

is there any way to remedy [the performance impact on the index] or are UUIDs simply not a good PK on a large, un-partitioned table?

If you need a UUID key, you need a UUID key. You shouldn't use one if you don't require one, but if you cannot rely on a central source of synthetic keys and there is no suitable natural key to use, it's still the way to go.

Partitioning won't help much unless you can confine writes to one partition. Also, you won't be able to usefully use constraint exclusion on searches for the key if writing only to one partition at a time, so you'll still have to search all the partitions' indexes for a key when doing queries. I can only see it being useful if your UUID forms part of a composite key and you can partition on the other part of the composite key.

PostgreSQL UUID type performance

We had a table with about 30k rows that (for a specific unrelated architectural reason) had UUIDs stored in a text field and indexed. I noticed that the query perf was slower than I'd have expected. I created a new UUID column, copied in the text uuid primary key and compared below. 2.652ms vs 0.029ms. Quite a difference!

 -- With text index
    QUERY PLAN
    Index Scan using tmptable_pkey on tmptable (cost=0.41..1024.34 rows=1 width=1797) (actual time=0.183..2.632 rows=1 loops=1)
      Index Cond: (primarykey = '755ad490-9a34-4c9f-8027-45fa37632b04'::text)
    Planning time: 0.121 ms
    Execution time: 2.652 ms

    -- With a uuid index 
    QUERY PLAN
    Index Scan using idx_tmptable on tmptable (cost=0.29..2.51 rows=1 width=1797) (actual time=0.012..0.013 rows=1 loops=1)
      Index Cond: (uuidkey = '755ad490-9a34-4c9f-8027-45fa37632b04'::uuid)
    Planning time: 0.109 ms
    Execution time: 0.029 ms

PostgreSQL using UUID vs Text as primary key

As @Kevin mentioned, the only way to know for sure with your exact data would be to compare and contrast both methods, but from what you've described, I don't see why this would be different from any other case where a string was either the primary key in a table or part of a unique index.

What can be said up front is that your indexes will probably larger, since they have to store larger string values, and in theory the comparisons for the index will take a bit longer, but I wouldn't advocate premature optimization if to do so would be painful.

In my experience, I have seen very good performance on a unique index using md5sums on a table with billions of rows. I have found it tends to be other factors about a query which tend to result in performance issues. For example, when you end up needing to query over a very large swath of the table, say hundreds of thousands of rows, a sequential scan ends up being the better choice, so that's what the query planner chooses, and it can take much longer.

There are other mitigating strategies for that type of situation, such as chunking the query and then UNIONing the results (e.g. a manual simulation of the sort of thing that would be done in Hive or Impala in the Hadoop sphere).

Re: your concern about indexing of text, while I'm sure there are some cases where a dataset produces a key distribution such that it performs terribly, GUIDs, much like md5sums, sha1's, etc. should index quite well in general and not require sequential scans (unless, as I mentioned above, you query a huge swath of the table).

One of the big factors about how an index would perform is how many unique values there are. For that reason, a boolean index on a table with a large number of rows isn't likely to help, since it basically is going to end up having a huge number of row collisions for any of the values (true, false, and potentially NULL) in the index. A GUID index, on the other hand, is likely to have a huge number of values with no collision (in theory definitionally, since they are GUIDs).

Edit in response to comment from OP:

So are you saying that a UUID guid is the same thing as a Text guid as far as the indexing goes? Our entire table structure is using Text fields with a guid-like string, but I'm not sure Postgre recognizes it as a Guid. Just a string that happens to be unique.

Not literally the same, no. However, I am saying that they should have very similar performance for this particular case, and I don't see why optimizing up front is worth doing, especially given that you say to do so would be a very involved task.

You can always change things later if, in your specific environment, you run into performance problems. However, as I mentioned earlier, I think if you hit that scenario, there are other things that would likely yield better performance than changing the PK data types.

A UUID is a 128-bit data type (so, 16 bytes), whereas text has 1 or 4 bytes of overhead plus the actual length of the string. For a GUID, that would mean a minimum of 33 bytes, but could vary significantly depending on the encoding used.

So, with that in mind, certainly indexes of text-based UUIDs will be larger since the values are larger, and comparing two strings versus two numerical values is in theory less efficient, but is not something that's likely to make a huge difference in this case, at least not usual cases.

I would not optimize up front when to do so would be a significant cost and is likely to never be needed. That bridge can be crossed if that time does come (although I would persue other query optimizations first, as I mentioned above).

Regarding whether Postgres knows the string is a GUID, it definitely does not by default. As far as it's concerned, it's just a unique string. But that should be fine for most cases, e.g. matching rows and such. If you find yourself needing some behavior that specifically requires a GUID (for example, some non-equality based comparisons where a GUID comparison may differ from a purely lexical one), then you can always cast the string to a UUID, and Postgres will treat the value as such during that query.

e.g. for a text column foo, you can do foo::uuid to cast it to a uuid.

There's also a module available for generating uuids, uuid-ossp.

UUID or SEQUENCE for primary key?

A sequence in PostgreSQL does exactly the same as AUTOINCREMENT in MySQL. A sequence is more efficient than a uuid because it is 8 bytes instead of 16 for the uuid. You can use a uuid as a primary key, just like most any other data type.

However, I don't see how this relates to masking of an user ID. If you want to mask the ID of a certain user from other users, you should carefully manage the table privileges and/or hash the ID using - for instance - md5().

If you want to protect a table with user data from snooping hackers that are trying to guess other IDs, then the uuid type is an excellent choice. Package uuid-ossp has several flavours. The version 4 is then the best choice as it has 122 random bits (the other 6 are used for identification of the version). You can create a primary key like this:

id uuid PRIMARY KEY DEFAULT uuid_generate_v4()

and then you will never have to worry about it anymore.

PostgreSQL 13+

You can now use the built-in function gen_random_uuid() to get a version 4 random UUID.

Best practices on primary key, auto-increment, and UUID in SQL databases

It's a matter of choice actually and this question can raise opinion based answers from my point of view. What I always do, even if it's redundant is I create primary key on auto increment column (I call it technical key) to keep it consistent within the database, allow for "primary key" to change in case something went wrong at design phase and also allow for less space to be consumed in case that key is being pointed to by foreign key constraint in any other table and also I make the candidate key unique and not null.

Technical key is something you don't normally show to end users, unless you decide to. This can be the same for other technical columns that you're keeping only at database level for any purpose you may need like modify date, create date, version, user who changed the record and more.

In this case I would go for your second option, but slightly modified:

CREATE TABLE users(
  pk INT NOT NULL AUTO_INCREMENT,
  id UUID NOT NULL,
  .....
  PRIMARY KEY(pk),
  UNIQUE(id)
);

Advantages and disadvantages of GUID / UUID database keys

Advantages:

Can generate them offline.
Makes replication trivial (as opposed to int's, which makes it REALLY hard)
ORM's usually like them
Unique across applications. So We can use the PK's from our CMS (guid) in our app (also guid) and know we are NEVER going to get a clash.

Disadvantages:

Larger space use, but space is cheap(er)
Can't order by ID to get the insert order.
Can look ugly in a URL, but really, WTF are you doing putting a REAL DB key in a URL!? (This point disputed in comments below)
Harder to do manual debugging, but not that hard.

Personally, I use them for most PK's in any system of a decent size, but I got "trained" on a system which was replicated all over the place, so we HAD to have them. YMMV.

I think the duplicate data thing is rubbish - you can get duplicate data however you do it. Surrogate keys are usually frowned upon where ever I've been working. We DO use the WordPress-like system though:

unique ID for the row (GUID/whatever). Never visible to the user.
public ID is generated ONCE from some field (e.g. the title - make it the-title-of-the-article)

UPDATE:
So this one gets +1'ed a lot, and I thought I should point out a big downside of GUID PK's: Clustered Indexes.

If you have a lot of records, and a clustered index on a GUID, your insert performance will SUCK, as you get inserts in random places in the list of items (that's the point), not at the end (which is quick).

So if you need insert performance, maybe use a auto-inc INT, and generate a GUID if you want to share it with someone else (e.g., showing it to a user in a URL).

Does UUID primary key worsen read queries if table contains 'hot' and 'cold' data in postgresql?

It does not matter what data type you choose for your primary key – it will just be a couple of bytes on disk.

What I'd look into is partitioning. If you normally access new entries, you could partition by date. But this will only help if you can add a clause like WHERE creationdate > '....' to the queries that access the entries, because then the search will be limited to those partitions that match the condition. Partitioning would also make it easy to remove old data.

Unfortunately partitioning is not built into PostgreSQL (yet) and still takes a lot of hand-rolling. Moreover there are certain things lacking, like global indexes. But if you use UUIDs as primary key (to complete the circle and come back to your question), you wouldn't have duplicate entries anyway.

Will Uuid as Primary Key in Postgresql Give Bad Index Performance