Some Sort of "Different Auto-Increment Indexes" Per a Primary Key Values

Some sort of “different auto-increment indexes” per a primary key values

MySQL's MyISAM engine can do this. See their manual, in section Using AUTO_INCREMENT:

For MyISAM tables you can specify AUTO_INCREMENT on a secondary column in a multiple-column index. In this case, the generated value for the AUTO_INCREMENT column is calculated as MAX(auto_increment_column) + 1 WHERE prefix=given-prefix. This is useful when you want to put data into ordered groups.

The docs go on after that paragraph, showing an example.

The InnoDB engine in MySQL does not support this feature, which is unfortunate because it's better to use InnoDB in almost all cases.

You can't emulate this behavior using triggers (or any SQL statements limited to transaction scope) without locking tables on INSERT. Consider this sequence of actions:

  1. Mario starts transaction and inserts a new row for user 4.
  2. Bill starts transaction and inserts a new row for user 4.
  3. Mario's session fires a trigger to computes MAX(id)+1 for user 4. You get 3.
  4. Bill's session fires a trigger to compute MAX(id). I get 3.
  5. Bill's session finishes his INSERT and commits.
  6. Mario's session tries to finish his INSERT, but the row with (userid=4, id=3) now exists, so Mario gets a primary key conflict.

In general, you can't control the order of execution of these steps without some kind of synchronization.

The solutions to this are either:

  • Get an exclusive table lock. Before trying an INSERT, lock the table. This is necessary to prevent concurrent INSERTs from creating a race condition like in the example above. It's necessary to lock the whole table, since you're trying to restrict INSERT there's no specific row to lock (if you were trying to govern access to a given row with UPDATE, you could lock just the specific row). But locking the table causes access to the table to become serial, which limits your throughput.

  • Do it outside transaction scope. Generate the id number in a way that won't be hidden from two concurrent transactions. By the way, this is what AUTO_INCREMENT does. Two concurrent sessions will each get a unique id value, regardless of their order of execution or order of commit. But tracking the last generated id per userid requires access to the database, or a duplicate data store. For example, a memcached key per userid, which can be incremented atomically.

It's relatively easy to ensure that inserts get unique values. But it's hard to ensure they will get consecutive ordinal values. Also consider:

  • What happens if you INSERT in a transaction but then roll back? You've allocated id value 3 in that transaction, and then I allocated value 4, so if you roll back and I commit, now there's a gap.
  • What happens if an INSERT fails because of other constraints on the table (e.g. another column is NOT NULL)? You could get gaps this way too.
  • If you ever DELETE a row, do you need to renumber all the following rows for the same userid? What does that do to your memcached entries if you use that solution?

Why do we use primary key auto increment, and not just auto_increment?

Because of constraints, because SQL is a declarative language, and because it is self-documenting.

AUTO_INCREMENT and PRIMARY KEY do two very different things.

AUTO_INCREMENT is not standard SQL, but most databases have something like it. In usually makes the default an incrementing integer. That's the mechanistic element of a primary key: an auto-incrementing integer is a decent unique id.

PRIMARY KEY does a number of things.

  1. The columns are not null.
  2. The columns are unique.
  3. (Non-standard, but almost always) The columns are indexed.
  4. The columns are declared the primary key.
  5. No other columns can be declared the primary key.

Most are constraints on the value. It must exist (1), it must be unique (2), and it must be the only primary key (5). Constraints guarantee the data is as you expect. Without them you might accidentally insert multiple rows with the same primary key, or with no primary key. Or the table might have multiple "primary" keys, which do you use? With constraints you don't have to check the data every time you fetch it, you know it will within its declared constraints.

Indexing the primary key (3) is an optimization. You're probably going to be searching by primary key a lot. Indexes aren't part of the SQL standard. That might be surprising. They're a detail of implementing SQL. Indexes don't affect the result of the query, and the SQL standard avoids telling databases how they should do things. This is part of why SQL is so ubiquitous, it is declarative.

In a declarative language you don't say how to do it, you say what you want. A SQL query says what result you want and the database figures it out for you. You could implement most of the above as constraints and triggers, but if you did the database wouldn't understand why you're doing that and would not be able to use that to help you out. By stating to the database "this is the primary key" and it can handle that as it sees fit. The database adds the necessary constraints. The database adds its optimizations, which can be more than indexing. The database can use this information in queries to identify unique rows. You get the benefit of 50 years of database theory.

Finally, it lets people know the purpose of the column. Anyone can read the schema and know that column is the primary key. That anyone can be you six months from now.

Auto-increment a primary key in MySql

I'm pretty sure this is by design. If you had IDs up to 6 in your table and you deleted ID 2, would you want the next input to be an ID of 2? That doesn't seem to follow the ACID properties. Also, if there was a dependence on that data, for example, if it was user data, and the ID determined user IDs, it would invalidate pre-existing information, since if user X was deleted and the same ID was assigned to user Y, that could cause integrity issues in dependent systems.

Also, imagine a table with 50 billion rows. Should the table run an O(n) search for the smallest missing ID every time you're trying to insert a new record? I can see that getting out of hand really quickly.

Some links you might like to read:

  • Principles of Transaction-Oriented Database Recovery (1983)
  • How can we re-use the deleted id from any MySQL-DB table?

Auto increment MySQL decimal number problems

The MySQL auto-increment mechanism only increments by whole integers. Sorry, that's the way it is implemented.

The best way to design your Case table in MySQL is this:

CREATE TABLE Cases (
case_id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
client_id INT NOT NULL,
...other attributes of the case...
FOREIGN KEY (client_id) REFERENCES Client (client_id)
);

It will have one auto-increment counter for the table, and all clients will need to share this number. This means the case numbers won't always be consecutive for a given client, and they won't start at 1 for each client. Sorry, that's the way auto-increment works in MySQL.

The question has been asked many times with some variation of, "how can I make an auto-increment that renumbers for each group?" You could read the MAX(case_id) for the given client for which you need to insert a case, and then using the max case_id + 1 in your INSERT. In other words, forget about using the auto-increment feature, and calculate the id yourself.

You have to lock the table while doing this to avoid race conditions; two concurrent users could be inserting at the same time, and read the same value for MAX(case_id) and try to insert the same value.

Your plan of using decimal numbers will lead to problems.

  • What if one day you have a client with more than 999 cases? You'd have to reformat all your case id's, not only for the client with 1000 cases, but for all clients. Any references to the case id's that you had sent out in paper statements and reports would become invalid.

  • How would you do an SQL query to search for all cases for a given client? If you had client_id in its own column, it would be a query like SELECT ... FROM Case WHERE client_id = 3 but if you have to do a query like ... WHERE case_id BETWEEN 3.000 AND 3.999 it's less clear and harder to optimize. It's also harder to explain to a new programmer you hire for the project. If you end up extending the id format to 4 digits past the decimal, you'd have to rewrite all these SQL queries.

InnoDB clustered index performance when using random values as primary key

InnoDB:

With an AUTO_INCREMENT PRIMARY KEY, the "next" row will be put at the "end" of the BTree that holds the data for the table. This is efficient, and the "last" block will be updated a lot.

Note: Blocks are kept in the buffer_pool, to be eventually written to disk.

With a "random" PK, such as GUID, UUID, MD5, SHA1, etc, the "next" row to be inserted needs to go into some 'random' place in the BTree that holds the data. If the buffer_pool is big enough, then the necessary block will still be sitting in it. So the efficiency is not much different than with AI.

On the other hand, if the data is too big to fit in the buffer_pool (or other activity keeps bumping the blocks out), then an insert will need to fetch the block before modifying it.

If, for example, the table is 20 times as big as can be held in the buffer_pool, then the next random write will have a chance of 1 in 20 of the block being cached. That is, 95% of the time an INSERT has to wait for a disk read.

But... You prompted a discussion of INSERTs. What about SELECTs? What, if any, pattern is there to the selects? If it is 'random' anyway, then the type of PK does not matter. If, on the other hand, the selects tend to reach for "recent" items (eg, news articles), then AI wins for large tables because of the increased likelihood of the desired block being cached.

Cluster

A Comment implies some confusion over "cluster/ed/ing". Some definitions (in a MySQL/MariaDB context):

  • A group of servers with identical data, working together. NDB Cluster vs Galera Cluster vs Clustrix (3rd party offering)
  • A "clustered index" is when the data is attached to the index. In InnoDB, the PK is always clustered with the data. (Note: MyISAM, and other vendors do not necessarily do this.)
  • When records to be fetched are next to each other in the layout on disk (think the PK or a secondary index), then those rows are "clustered together". This is worth noting because fetching one block gets several rows that you need.

So, back to the Comment:

  • Jumping around in the PRIMARY KEY (due to using what I called a random PK, or due to simply not fetching rows in some relevant order) is stuck with jumping around in the table.
  • A UUID has a "sorted order", but it is not useful to much of anything.

Updating a table to increment a value per user id

Fixing the table once can be done with an UPDATE:

SET @n = 0, @p = 1;

UPDATE node_items
SET position = (@p := IF(node_id=@n, @p+1, 1)),
node_id = (@n := node_id)
ORDER BY node_id, id;

Making the table maintain the position values as you insert/update/delete data is harder. Basically, can't do it while allowing concurrent updates. You have to lock the table in every session that needs to do writes to the table. This makes concurrent sessions run serially.

You can read my old post about this here: Some sort of “different auto-increment indexes” per a primary key values

Should primary key be auto_increment?

I want to know when design a primary key, it is needed to setting auto_increment?

No, it's not strictly necessary. There are cases when a natural key is fine.

If done, what's the benefit?

Advantages of using an auto-increment surrogate key:

  • Surrogate keys never need to change, even if all other columns in your table are possible to change.
  • It's easier for the RDBMS to ensure uniqueness of an auto-increment key without locking and without race conditions, when you have multiple users inserting concurrently.
  • Using an integer is the most compact data type you can use for a primary key, so it results in a smaller index than using a long string, for example.
  • Efficiency of inserting into B-tree indexes (see below).
  • It's a little easier and tidier to reference a row with a single column than multiple columns, when the only other candidate key consisted of several columns.

Advantages of using a natural key:

  • The column has some meaning for the entity, for example a phone number. You don't need to store an extra column for the surrogate key.
  • Other tables using foreign keys to reference a natural primary key get a meaningful value, so they can avoid a join. For example, a table of shoes referencing colors would need to do a join if you wanted to get the color name. But if you use the color name as the primary key of colors, then that value would already be part of the shoes table.

Other cases when a surrogate auto-increment key is not needed:

  • You already have a combination of other columns (whether they are surrogate keys or natural keys) that provides a candidate key for the table. A good example is found in many-to-many tables. If a table maps movies to actors, even if both movies and actors are referenced by primary keys, then you already have a candidate key over those two columns, and you don't need yet another auto-increment column.

I listen, that can keep b-tree's stable, but i don't know why?

Inserting a value into an arbitrary place in the middle of a B-tree may cause a costly restructuring of the index.

There's an animated example here: http://www.bluerwhite.org/btree/

Look at the example "Inserting Key 33 into a B-Tree (w/ Split)" where it shows the steps of inserting a value into a B-tree node that overfills it, and what the B-tree does in response.

Now imagine that the example illustration only shows the bottom part of a B-tree that is much deeper (as would be in the case of an index B-tree has millions of entries), and filling the parent node can itself be an overflow, and force the splitting operation to continue up the the higher level in the tree. This can continue all the way to the very top of the tree if all the ancestor nodes to the top of the tree were already filled.

As the nodes split and have to be restructured, they may require more space, but they're stored on some page of the database file where there's no spare space. So the storage engine has to relocate parts of the index to another part of the file, and potentially re-write a lot of pages of index just for a single INSERT.

Auto-increment values are naturally always inserted at the very rightmost edge of the B-tree. As @ BrankoDimitrijevic points out in a comment below, this does not make it less likely that they'll cause such laborious node-splitting and restructuring to the index. But the B-tree implementation code can optimize for this case in other ways, and some do.

If table has a unique column, which's better that set the unique column as primary key or add a new column 'id' as auto_increment primary key?

If the unique column is also non-nullable, then you can use it as a primary key. Primary keys require that all of their columns are non-nullable.



Related Topics



Leave a reply



Submit