Pros and Cons of Autoincrement Keys on "Every Table"

Pros and Cons of autoincrement keys on every table

I'm assuming that almost all tables will have a primary key - and it's just a question of whether that key consists of one or more natural keys or a single auto-incrementing surrogate key. If you aren't using primary keys then you will generally get a lot of advantages of using them on almost all tables.

So, here are some pros & cons of surrogate keys. First off, the pros:

  • Most importantly: they allow the natural keys to change. Trivial example, a table of persons should have a primary key of person_id rather than last_name, first_name.
  • Read performance - very small indexes are faster to scan. However, this is only helpful if you're actually constraining your query by the surrogate key. So, good for lookup tables, not so good for primary tables.
  • Simplicity - if named appropriately, it makes the database easy to learn & use.
  • Capacity - if you're designing something like a data warehouse fact table - surrogate keys on your dimensions allow you to keep a very narrow fact table - which results in huge capacity improvements.

And cons:

  • They don't prevent duplicates of the natural values. So, you'll still usually want a unique constraint (index) on the logical key.
  • Write performance. With an extra index you're going to slow down inserts, updates and deletes that much more.
  • Simplicity - for small tables of data that almost never changes they are unnecessary. For example, if you need a list of countries you can use the ISO list of countries. It includes meaningful abbreviations. This is better than a surrogate key because it's both small and useful.

In general, surrogate keys are useful, just keep in mind the cons and don't hesitate to use natural keys when appropriate.

Does every table really need an auto-incrementing artificial primary key?

No.

In most cases, having a surrogate INT IDENTITY key is an easy option: it can be guaranteed to be NOT NULL and 100% unique, something a lot of "natural" keys don't offer - names can change, so can SSN's and other items of information.

In the case of state abbreviations and names - if anything, I'd use the two-letter state abbreviation as a key.

A primary key must be:

  • unique (100% guaranteed! Not just "almost" unique)
  • NON NULL

A primary key should be:

  • stable if ever possible (not change - or at least not too frequently)

State two-letter codes definitely would offer this - that might be a candidate for a natural key. A key should also be small - an INT of 4 bytes is perfect, a two-letter CHAR(2) column just the same. I would not ever use a VARCHAR(100) field or something like that as a key - it's just too clunky, most likely will change all the time - not a good key candidate.

So while you don't have to have an auto-incrementing "artificial" (surrogate) primary key, it's often quite a good choice, since no naturally occuring data is really up to the task of being a primary key, and you want to avoid having huge primary keys with several columns - those are just too clunky and inefficient.

Why do we use primary key auto increment, and not just auto_increment?

Because of constraints, because SQL is a declarative language, and because it is self-documenting.

AUTO_INCREMENT and PRIMARY KEY do two very different things.

AUTO_INCREMENT is not standard SQL, but most databases have something like it. In usually makes the default an incrementing integer. That's the mechanistic element of a primary key: an auto-incrementing integer is a decent unique id.

PRIMARY KEY does a number of things.

  1. The columns are not null.
  2. The columns are unique.
  3. (Non-standard, but almost always) The columns are indexed.
  4. The columns are declared the primary key.
  5. No other columns can be declared the primary key.

Most are constraints on the value. It must exist (1), it must be unique (2), and it must be the only primary key (5). Constraints guarantee the data is as you expect. Without them you might accidentally insert multiple rows with the same primary key, or with no primary key. Or the table might have multiple "primary" keys, which do you use? With constraints you don't have to check the data every time you fetch it, you know it will within its declared constraints.

Indexing the primary key (3) is an optimization. You're probably going to be searching by primary key a lot. Indexes aren't part of the SQL standard. That might be surprising. They're a detail of implementing SQL. Indexes don't affect the result of the query, and the SQL standard avoids telling databases how they should do things. This is part of why SQL is so ubiquitous, it is declarative.

In a declarative language you don't say how to do it, you say what you want. A SQL query says what result you want and the database figures it out for you. You could implement most of the above as constraints and triggers, but if you did the database wouldn't understand why you're doing that and would not be able to use that to help you out. By stating to the database "this is the primary key" and it can handle that as it sees fit. The database adds the necessary constraints. The database adds its optimizations, which can be more than indexing. The database can use this information in queries to identify unique rows. You get the benefit of 50 years of database theory.

Finally, it lets people know the purpose of the column. Anyone can read the schema and know that column is the primary key. That anyone can be you six months from now.

Should primary key be auto_increment?

I want to know when design a primary key, it is needed to setting auto_increment?

No, it's not strictly necessary. There are cases when a natural key is fine.

If done, what's the benefit?

Advantages of using an auto-increment surrogate key:

  • Surrogate keys never need to change, even if all other columns in your table are possible to change.
  • It's easier for the RDBMS to ensure uniqueness of an auto-increment key without locking and without race conditions, when you have multiple users inserting concurrently.
  • Using an integer is the most compact data type you can use for a primary key, so it results in a smaller index than using a long string, for example.
  • Efficiency of inserting into B-tree indexes (see below).
  • It's a little easier and tidier to reference a row with a single column than multiple columns, when the only other candidate key consisted of several columns.

Advantages of using a natural key:

  • The column has some meaning for the entity, for example a phone number. You don't need to store an extra column for the surrogate key.
  • Other tables using foreign keys to reference a natural primary key get a meaningful value, so they can avoid a join. For example, a table of shoes referencing colors would need to do a join if you wanted to get the color name. But if you use the color name as the primary key of colors, then that value would already be part of the shoes table.

Other cases when a surrogate auto-increment key is not needed:

  • You already have a combination of other columns (whether they are surrogate keys or natural keys) that provides a candidate key for the table. A good example is found in many-to-many tables. If a table maps movies to actors, even if both movies and actors are referenced by primary keys, then you already have a candidate key over those two columns, and you don't need yet another auto-increment column.

I listen, that can keep b-tree's stable, but i don't know why?

Inserting a value into an arbitrary place in the middle of a B-tree may cause a costly restructuring of the index.

There's an animated example here: http://www.bluerwhite.org/btree/

Look at the example "Inserting Key 33 into a B-Tree (w/ Split)" where it shows the steps of inserting a value into a B-tree node that overfills it, and what the B-tree does in response.

Now imagine that the example illustration only shows the bottom part of a B-tree that is much deeper (as would be in the case of an index B-tree has millions of entries), and filling the parent node can itself be an overflow, and force the splitting operation to continue up the the higher level in the tree. This can continue all the way to the very top of the tree if all the ancestor nodes to the top of the tree were already filled.

As the nodes split and have to be restructured, they may require more space, but they're stored on some page of the database file where there's no spare space. So the storage engine has to relocate parts of the index to another part of the file, and potentially re-write a lot of pages of index just for a single INSERT.

Auto-increment values are naturally always inserted at the very rightmost edge of the B-tree. As @ BrankoDimitrijevic points out in a comment below, this does not make it less likely that they'll cause such laborious node-splitting and restructuring to the index. But the B-tree implementation code can optimize for this case in other ways, and some do.

If table has a unique column, which's better that set the unique column as primary key or add a new column 'id' as auto_increment primary key?

If the unique column is also non-nullable, then you can use it as a primary key. Primary keys require that all of their columns are non-nullable.

Should each and every table have a primary key?

Short answer: yes.

Long answer:

  • You need your table to be joinable on something
  • If you want your table to be clustered, you need some kind of a primary key.
  • If your table design does not need a primary key, rethink your design: most probably, you are missing something. Why keep identical records?

In MySQL, the InnoDB storage engine always creates a primary key if you didn't specify it explicitly, thus making an extra column you don't have access to.

Note that a primary key can be composite.

If you have a many-to-many link table, you create the primary key on all fields involved in the link. Thus you ensure that you don't have two or more records describing one link.

Besides the logical consistency issues, most RDBMS engines will benefit from including these fields in a unique index.

And since any primary key involves creating a unique index, you should declare it and get both logical consistency and performance.

See this article in my blog for why you should always create a unique index on unique data:

  • Making an index UNIQUE

P.S. There are some very, very special cases where you don't need a primary key.

Mostly they include log tables which don't have any indexes for performance reasons.

Reasons not to use an auto-incrementing number for a primary key

An auto generated ID can cause problems in situations where you are using replication (as I'm sure the techniques you've found can!). In these cases, I generally opt for a GUID.

If you are not likely to use replication, then an auto-incrementing PK will most likely work just fine.

Should every MySQL table have an auto-incremented primary key?

I am not a huge fan of surrogate keys. I have yet to see a scenario where I would prefer to use one for every table of a database.

I would say No.

Read up on this answer: surrogate-vs-natural-business-keys


The above may be seen as sarcastic or flaming (despite the surprisingly many upvotes) so it's deleted.

In the general case, there have been many questions and answers on surrogate and natural keys so I felt this question is more like a duplicate. My view is that surrogate keys are fine and very useful, mainly because natural keys can lead to very big primary keys in the low end of a chain of connected tables - and this is not handled well by many RDBMS, clustered indexes get big, etc. But saying that "every MySQL table should have an auto-incremented primary key" is a very absolute statement and I think there are cases when they really offer little or nothing.

Since the OP updated the question, I'll try to comment on that specific topic.

I think this is exactly a case where an autoincrementing primary key is not only useless but adds negative value. Supposing that table1 and table2 are in 1:1 relationship, the memberid can be both the Primary Key and a Foreign Key to table1.

Adding an autoincrementing id column adds one index and if it's a clustered one (like InnoDB PK indexes) increases the size of the memberid index. Even more, if you have such an auto-incrementing id, some JOIN of table2 to other tables will have to be done using this id (the JOINs to tables in 1:n relation to table2) and some using memberid (the JOINs to tables in 1:n relation to table1). If you only have memberid both these types of JOINs can be
done using memberid.

nature key vs auto_increment key as the primary key

According to my opinion convenience of using either natural key of auto-increment key depends on the program solution you are providing. Both methods have pros and cons. So the best solution is to understand both key types properly, analyze what kind of business solution you are trying to provide and select the appropriate primary key type.

Natural key is a column or a set of columns which we can be used to uniquely identify a record in a table. These columns contain real data which has a relationship with the rest of the columns of the table.

Auto-incremented key, also called as surrogate key is a single table column which contains unique numeric values which can be used to uniquely identify a single row of data in a table. These values are generated at run-time when a record is inserted to the table and has no relationship with the rest of the data of the row.

The main advantage of using Natural keys is it has it's own meaning and requires less joins with other tables where as if we used a surrogate key we would require to join to a foreign key table to get the results we got with the natural key.

But say we cannot get all the data required from single table and have to join with another table to get all the data required. Then it is convenient to use a surrogate key instead of natural key because most of the time natural keys are strings and larger in size than surrogate keys and it will take more time to join tables using larger values.

A natural key has it's own meaning. So when it comes to searching records it is more advantageous to use natural keys over surrogate keys. But say with time our program logic changes and we have to change the natural key value. This will be difficult and will cause a cascade effect over all foreign key relationships. We can overcome this problem using a surrogate key. Since a surrogate key does not have a relationship with the rest of the values of a row, changes of the logic won't have a affect over the surrogate key.

Likewise, as I see the convenience and inconvenience of using a surrogate key or a natural key entirely base on the solution you are providing.

Best practices on primary key, auto-increment, and UUID in SQL databases

It's a matter of choice actually and this question can raise opinion based answers from my point of view. What I always do, even if it's redundant is I create primary key on auto increment column (I call it technical key) to keep it consistent within the database, allow for "primary key" to change in case something went wrong at design phase and also allow for less space to be consumed in case that key is being pointed to by foreign key constraint in any other table and also I make the candidate key unique and not null.

Technical key is something you don't normally show to end users, unless you decide to. This can be the same for other technical columns that you're keeping only at database level for any purpose you may need like modify date, create date, version, user who changed the record and more.

In this case I would go for your second option, but slightly modified:

CREATE TABLE users(
pk INT NOT NULL AUTO_INCREMENT,
id UUID NOT NULL,
.....
PRIMARY KEY(pk),
UNIQUE(id)
);


Related Topics



Leave a reply



Submit