Sql: Do You Need an Auto-Incremental Primary Key for Many-Many Tables

SQL: Do you need an auto-incremental primary key for Many-Many tables?

ArtistFans
    ArtistID (PK)
    UserID (PK)

The use of an auto incremental PK has no advantages here, even if the parent tables have them.

I'd also create a "reverse PK" index automatically on (UserID, ArtistID) too: you will need it because you'll query the table by both columns.

Autonumber/ID columns have their place. You'd choose them to improve certain things after the normalisation process based on the physical platform. But not for link tables: if your braindead ORM insists, then change ORMs...

Edit, Oct 2012

It's important to note that you'd still need unique (UserID, ArtistID) and (ArtistID, UserID) indexes. Adding an auto increments just uses more space (in memory, not just on disk) that shouldn't be used

Auto incrementing ID as primary key on all tables

I have created a lot of tables in my time. I have used AUTO_INCREMENT in only 1/3 of them. The rest had what seemed like a "perfectly good 'natural' PK", so I went that way.

"Normal Form" is a textbook way to get you started. In real life (in my opinion), NF later takes a back seat to performance and other considerations.

For InnoDB tables, you really should have an explicit PRIMARY KEY (either auto_inc or natural).

A generic pattern where auto_inc slows things down is a many:many mapping table, as Renzo points out, and which I discuss here: http://mysql.rjweb.org/doc.php/index_cookbook_mysql#many_to_many_mapping_table

In InnoDB, the PRIMARY KEY is stored (clustered) with the data, so the index structure (a BTree) occupies virtually no extra space. Each secondary index occupies a separate BTree that implicitly includes the PK column(s).

Why do we use primary key auto increment, and not just auto_increment?

Because of constraints, because SQL is a declarative language, and because it is self-documenting.

AUTO_INCREMENT and PRIMARY KEY do two very different things.

AUTO_INCREMENT is not standard SQL, but most databases have something like it. In usually makes the default an incrementing integer. That's the mechanistic element of a primary key: an auto-incrementing integer is a decent unique id.

PRIMARY KEY does a number of things.

The columns are not null.
The columns are unique.
(Non-standard, but almost always) The columns are indexed.
The columns are declared the primary key.
No other columns can be declared the primary key.

Most are constraints on the value. It must exist (1), it must be unique (2), and it must be the only primary key (5). Constraints guarantee the data is as you expect. Without them you might accidentally insert multiple rows with the same primary key, or with no primary key. Or the table might have multiple "primary" keys, which do you use? With constraints you don't have to check the data every time you fetch it, you know it will within its declared constraints.

Indexing the primary key (3) is an optimization. You're probably going to be searching by primary key a lot. Indexes aren't part of the SQL standard. That might be surprising. They're a detail of implementing SQL. Indexes don't affect the result of the query, and the SQL standard avoids telling databases how they should do things. This is part of why SQL is so ubiquitous, it is declarative.

In a declarative language you don't say how to do it, you say what you want. A SQL query says what result you want and the database figures it out for you. You could implement most of the above as constraints and triggers, but if you did the database wouldn't understand why you're doing that and would not be able to use that to help you out. By stating to the database "this is the primary key" and it can handle that as it sees fit. The database adds the necessary constraints. The database adds its optimizations, which can be more than indexing. The database can use this information in queries to identify unique rows. You get the benefit of 50 years of database theory.

Finally, it lets people know the purpose of the column. Anyone can read the schema and know that column is the primary key. That anyone can be you six months from now.

Should primary key be auto_increment?

I want to know when design a primary key, it is needed to setting auto_increment?

No, it's not strictly necessary. There are cases when a natural key is fine.

If done, what's the benefit?

Advantages of using an auto-increment surrogate key:

Surrogate keys never need to change, even if all other columns in your table are possible to change.
It's easier for the RDBMS to ensure uniqueness of an auto-increment key without locking and without race conditions, when you have multiple users inserting concurrently.
Using an integer is the most compact data type you can use for a primary key, so it results in a smaller index than using a long string, for example.
Efficiency of inserting into B-tree indexes (see below).
It's a little easier and tidier to reference a row with a single column than multiple columns, when the only other candidate key consisted of several columns.

Advantages of using a natural key:

The column has some meaning for the entity, for example a phone number. You don't need to store an extra column for the surrogate key.
Other tables using foreign keys to reference a natural primary key get a meaningful value, so they can avoid a join. For example, a table of shoes referencing colors would need to do a join if you wanted to get the color name. But if you use the color name as the primary key of colors, then that value would already be part of the shoes table.

Other cases when a surrogate auto-increment key is not needed:

You already have a combination of other columns (whether they are surrogate keys or natural keys) that provides a candidate key for the table. A good example is found in many-to-many tables. If a table maps movies to actors, even if both movies and actors are referenced by primary keys, then you already have a candidate key over those two columns, and you don't need yet another auto-increment column.

I listen, that can keep b-tree's stable, but i don't know why?

Inserting a value into an arbitrary place in the middle of a B-tree may cause a costly restructuring of the index.

There's an animated example here: http://www.bluerwhite.org/btree/

Look at the example "Inserting Key 33 into a B-Tree (w/ Split)" where it shows the steps of inserting a value into a B-tree node that overfills it, and what the B-tree does in response.

Now imagine that the example illustration only shows the bottom part of a B-tree that is much deeper (as would be in the case of an index B-tree has millions of entries), and filling the parent node can itself be an overflow, and force the splitting operation to continue up the the higher level in the tree. This can continue all the way to the very top of the tree if all the ancestor nodes to the top of the tree were already filled.

As the nodes split and have to be restructured, they may require more space, but they're stored on some page of the database file where there's no spare space. So the storage engine has to relocate parts of the index to another part of the file, and potentially re-write a lot of pages of index just for a single INSERT.

Auto-increment values are naturally always inserted at the very rightmost edge of the B-tree. As @ BrankoDimitrijevic points out in a comment below, this does not make it less likely that they'll cause such laborious node-splitting and restructuring to the index. But the B-tree implementation code can optimize for this case in other ways, and some do.

If table has a unique column, which's better that set the unique column as primary key or add a new column 'id' as auto_increment primary key?

If the unique column is also non-nullable, then you can use it as a primary key. Primary keys require that all of their columns are non-nullable.

Does every table really need an auto-incrementing artificial primary key?

No.

In most cases, having a surrogate INT IDENTITY key is an easy option: it can be guaranteed to be NOT NULL and 100% unique, something a lot of "natural" keys don't offer - names can change, so can SSN's and other items of information.

In the case of state abbreviations and names - if anything, I'd use the two-letter state abbreviation as a key.

A primary key must be:

unique (100% guaranteed! Not just "almost" unique)
NON NULL

A primary key should be:

stable if ever possible (not change - or at least not too frequently)

State two-letter codes definitely would offer this - that might be a candidate for a natural key. A key should also be small - an INT of 4 bytes is perfect, a two-letter CHAR(2) column just the same. I would not ever use a VARCHAR(100) field or something like that as a key - it's just too clunky, most likely will change all the time - not a good key candidate.

So while you don't have to have an auto-incrementing "artificial" (surrogate) primary key, it's often quite a good choice, since no naturally occuring data is really up to the task of being a primary key, and you want to avoid having huge primary keys with several columns - those are just too clunky and inefficient.

Auto Increment Composite Primary Key

You shouldn't have the ProjectRef column at all. This violates basic rules of database normalization. If you want your front end to display the ProjectRef then just calculate it from the columns that you have.

Is primary key auto increment always needed in a table?

It is not obligatory for a table to have a primary key constraint. Where a table does have a primary key, it is not obligatory for that key to be automatically generated. In some cases, there is no meaningful sense in which a given primary key even could be automatically generated.

You should be able to remove your existing primary key column from the database like so:

alter table my_table drop column id;

or perhaps you can avoid creating it in the first place.

Whether this is a wise thing to do depends on your circumstances.

Pros and Cons of autoincrement keys on every table

I'm assuming that almost all tables will have a primary key - and it's just a question of whether that key consists of one or more natural keys or a single auto-incrementing surrogate key. If you aren't using primary keys then you will generally get a lot of advantages of using them on almost all tables.

So, here are some pros & cons of surrogate keys. First off, the pros:

Most importantly: they allow the natural keys to change. Trivial example, a table of persons should have a primary key of person_id rather than last_name, first_name.
Read performance - very small indexes are faster to scan. However, this is only helpful if you're actually constraining your query by the surrogate key. So, good for lookup tables, not so good for primary tables.
Simplicity - if named appropriately, it makes the database easy to learn & use.
Capacity - if you're designing something like a data warehouse fact table - surrogate keys on your dimensions allow you to keep a very narrow fact table - which results in huge capacity improvements.

And cons:

They don't prevent duplicates of the natural values. So, you'll still usually want a unique constraint (index) on the logical key.
Write performance. With an extra index you're going to slow down inserts, updates and deletes that much more.
Simplicity - for small tables of data that almost never changes they are unnecessary. For example, if you need a list of countries you can use the ISO list of countries. It includes meaningful abbreviations. This is better than a surrogate key because it's both small and useful.

In general, surrogate keys are useful, just keep in mind the cons and don't hesitate to use natural keys when appropriate.

Sql: Do You Need an Auto-Incremental Primary Key for Many-Many Tables