When to Use an Auto-Incremented Primary Key and When Not To

When to use an auto-incremented primary key and when not to?

There are a lot of already addressed questions on Stack Overflow that can help you with your questions. See here, here, here and here.

The term you should be looking for: surrogated keys.

Hope it helps.

Why do we use primary key auto increment, and not just auto_increment?

Because of constraints, because SQL is a declarative language, and because it is self-documenting.

AUTO_INCREMENT and PRIMARY KEY do two very different things.

AUTO_INCREMENT is not standard SQL, but most databases have something like it. In usually makes the default an incrementing integer. That's the mechanistic element of a primary key: an auto-incrementing integer is a decent unique id.

PRIMARY KEY does a number of things.

  1. The columns are not null.
  2. The columns are unique.
  3. (Non-standard, but almost always) The columns are indexed.
  4. The columns are declared the primary key.
  5. No other columns can be declared the primary key.

Most are constraints on the value. It must exist (1), it must be unique (2), and it must be the only primary key (5). Constraints guarantee the data is as you expect. Without them you might accidentally insert multiple rows with the same primary key, or with no primary key. Or the table might have multiple "primary" keys, which do you use? With constraints you don't have to check the data every time you fetch it, you know it will within its declared constraints.

Indexing the primary key (3) is an optimization. You're probably going to be searching by primary key a lot. Indexes aren't part of the SQL standard. That might be surprising. They're a detail of implementing SQL. Indexes don't affect the result of the query, and the SQL standard avoids telling databases how they should do things. This is part of why SQL is so ubiquitous, it is declarative.

In a declarative language you don't say how to do it, you say what you want. A SQL query says what result you want and the database figures it out for you. You could implement most of the above as constraints and triggers, but if you did the database wouldn't understand why you're doing that and would not be able to use that to help you out. By stating to the database "this is the primary key" and it can handle that as it sees fit. The database adds the necessary constraints. The database adds its optimizations, which can be more than indexing. The database can use this information in queries to identify unique rows. You get the benefit of 50 years of database theory.

Finally, it lets people know the purpose of the column. Anyone can read the schema and know that column is the primary key. That anyone can be you six months from now.

Should primary key be auto_increment?

I want to know when design a primary key, it is needed to setting auto_increment?

No, it's not strictly necessary. There are cases when a natural key is fine.

If done, what's the benefit?

Advantages of using an auto-increment surrogate key:

  • Surrogate keys never need to change, even if all other columns in your table are possible to change.
  • It's easier for the RDBMS to ensure uniqueness of an auto-increment key without locking and without race conditions, when you have multiple users inserting concurrently.
  • Using an integer is the most compact data type you can use for a primary key, so it results in a smaller index than using a long string, for example.
  • Efficiency of inserting into B-tree indexes (see below).
  • It's a little easier and tidier to reference a row with a single column than multiple columns, when the only other candidate key consisted of several columns.

Advantages of using a natural key:

  • The column has some meaning for the entity, for example a phone number. You don't need to store an extra column for the surrogate key.
  • Other tables using foreign keys to reference a natural primary key get a meaningful value, so they can avoid a join. For example, a table of shoes referencing colors would need to do a join if you wanted to get the color name. But if you use the color name as the primary key of colors, then that value would already be part of the shoes table.

Other cases when a surrogate auto-increment key is not needed:

  • You already have a combination of other columns (whether they are surrogate keys or natural keys) that provides a candidate key for the table. A good example is found in many-to-many tables. If a table maps movies to actors, even if both movies and actors are referenced by primary keys, then you already have a candidate key over those two columns, and you don't need yet another auto-increment column.

I listen, that can keep b-tree's stable, but i don't know why?

Inserting a value into an arbitrary place in the middle of a B-tree may cause a costly restructuring of the index.

There's an animated example here: http://www.bluerwhite.org/btree/

Look at the example "Inserting Key 33 into a B-Tree (w/ Split)" where it shows the steps of inserting a value into a B-tree node that overfills it, and what the B-tree does in response.

Now imagine that the example illustration only shows the bottom part of a B-tree that is much deeper (as would be in the case of an index B-tree has millions of entries), and filling the parent node can itself be an overflow, and force the splitting operation to continue up the the higher level in the tree. This can continue all the way to the very top of the tree if all the ancestor nodes to the top of the tree were already filled.

As the nodes split and have to be restructured, they may require more space, but they're stored on some page of the database file where there's no spare space. So the storage engine has to relocate parts of the index to another part of the file, and potentially re-write a lot of pages of index just for a single INSERT.

Auto-increment values are naturally always inserted at the very rightmost edge of the B-tree. As @ BrankoDimitrijevic points out in a comment below, this does not make it less likely that they'll cause such laborious node-splitting and restructuring to the index. But the B-tree implementation code can optimize for this case in other ways, and some do.

If table has a unique column, which's better that set the unique column as primary key or add a new column 'id' as auto_increment primary key?

If the unique column is also non-nullable, then you can use it as a primary key. Primary keys require that all of their columns are non-nullable.

What's the point of using AUTO_INCREMENT and PRIMARY KEY at the same time in MySQL?

The primary key has three properties:

  1. It is unique.
  2. It is non-null.
  3. There is only one per table.

Defining the key as a primary key means that it should also be used for foreign key references.

In addition, MySQL clusters the data by the primary key. So the declaration instructs new rows to go at the "end" of the table -- meaning adjacent to the most recent inserts on the data pages.

In addition, duplicate values for the auto-incremented id could be created in various ways. One way is that the increment counter can be reset, causing duplicates. MySQL should be pretty thread-safe on duplicates for concurrent updates, but bugs have been reported. As a primary key, no duplicates will be allowed into the table.

Is primary key auto increment always needed in a table?

It is not obligatory for a table to have a primary key constraint. Where a table does have a primary key, it is not obligatory for that key to be automatically generated. In some cases, there is no meaningful sense in which a given primary key even could be automatically generated.

You should be able to remove your existing primary key column from the database like so:

alter table my_table drop column id;

or perhaps you can avoid creating it in the first place.

Whether this is a wise thing to do depends on your circumstances.

Why use an auto-incrementing primary key when other unique fields exist?

Auto-incrementing primary keys are useful for several reasons:

  • They allow duplicate user names as on Stack Overflow
  • They allow the user name (or email address, if that's used to login) to be changed (easily)
  • Selects, joins and inserts are faster than varchar primary keys as its much faster to maintain a numeric index
  • As you mentioned, validation becomes very simple: if ((int)$id > 0) { ... }
  • Sanitation of input is trivial: $id = (int)$_GET['id']
  • There is far less overhead as foreign keys don't have to duplicate potentially large string values

I would say trying to use any piece of string information as a unique identifier for a record is a bad idea when an auto-incrementing numeric key is so readily available.

Systems with unique user names are fine for very small numbers of users, but the Internet has rendered them fundamentally broken. When you consider the sheer number of people named "john" that might have to interact with a website, it's ridiculous to require each of them to use a unique display name. It leads to the awful system we see so frequently with random digits and letters decorating a username.

However, even in a system where you enforced unique usernames, it's still a poor choice for a primary key. Imagine a user with 500 posts: The foreign key in the posts table is going to contain the username, duplicated 500 times. The overhead is prohibitive even before you consider that somebody might eventually need to change their username.

Reasons not to use an auto-incrementing number for a primary key

An auto generated ID can cause problems in situations where you are using replication (as I'm sure the techniques you've found can!). In these cases, I generally opt for a GUID.

If you are not likely to use replication, then an auto-incrementing PK will most likely work just fine.

Pros and Cons of autoincrement keys on every table

I'm assuming that almost all tables will have a primary key - and it's just a question of whether that key consists of one or more natural keys or a single auto-incrementing surrogate key. If you aren't using primary keys then you will generally get a lot of advantages of using them on almost all tables.

So, here are some pros & cons of surrogate keys. First off, the pros:

  • Most importantly: they allow the natural keys to change. Trivial example, a table of persons should have a primary key of person_id rather than last_name, first_name.
  • Read performance - very small indexes are faster to scan. However, this is only helpful if you're actually constraining your query by the surrogate key. So, good for lookup tables, not so good for primary tables.
  • Simplicity - if named appropriately, it makes the database easy to learn & use.
  • Capacity - if you're designing something like a data warehouse fact table - surrogate keys on your dimensions allow you to keep a very narrow fact table - which results in huge capacity improvements.

And cons:

  • They don't prevent duplicates of the natural values. So, you'll still usually want a unique constraint (index) on the logical key.
  • Write performance. With an extra index you're going to slow down inserts, updates and deletes that much more.
  • Simplicity - for small tables of data that almost never changes they are unnecessary. For example, if you need a list of countries you can use the ISO list of countries. It includes meaningful abbreviations. This is better than a surrogate key because it's both small and useful.

In general, surrogate keys are useful, just keep in mind the cons and don't hesitate to use natural keys when appropriate.



Related Topics



Leave a reply



Submit