Is Id Column Position in Postgresql Important

Is id column position in Postgresql important?

In theory everything should be fine, but there are always scenarios when your code could fail.

For example:

a) blind insert:

 INSERT INTO tab_name
VALUES (1, 'b', 'c');

A blind insert is when an INSERT query doesn’t specify which columns receive the inserted data.

Why is this a bad thing?

Because the database schema may change. Columns may be moved, renamed,
added, or deleted. And when they are, one of at least three things can
happen:

  1. The query fails. This is the best-case scenario. Someone deleted a column from the target table, and now there aren’t enough columns for
    the insert to go into, or someone changed a data type and the inserted
    type isn’t compatible, or so on. But at least your data isn’t getting
    corrupted, and you may even know the problem exists because of an
    error message.

  2. The query continues to work, and nothing is wrong. This is a middle-worst-case scenario. Your data isn’t corrupt, but the monster
    is still hiding under the bed.

  3. The query continues to work, but now some data is being inserted somewhere it doesn’t belong. Your data is getting corrupted.

b) ORDER BY oridinal

SELECT *
FROM tab
ORDER BY 1;

How do I alter the position of a column in a PostgreSQL database table?

"Alter column position" in the PostgreSQL Wiki says:

PostgreSQL currently defines column
order based on the attnum column of
the pg_attribute table. The only way
to change column order is either by
recreating the table, or by adding
columns and rotating data until you
reach the desired layout.

That's pretty weak, but in their defense, in standard SQL, there is no solution for repositioning a column either. Database brands that support changing the ordinal position of a column are defining an extension to SQL syntax.

One other idea occurs to me: you can define a VIEW that specifies the order of columns how you like it, without changing the physical position of the column in the base table.

Is it necessary to create id column in SQL table?

No it is not necessary, but for anything short of an association table it is recommended.

This Identity column provides a unique and unchanging Identifier of your data, it makes setting up foreign key relations quite easy.

An association table would not have one of these Identity columns because it has no data itself they generally consist of 2 or more foreign key columms.

Is ID column required in SQL?

If you really do have some pre-existing column in your data set that already does uniquely identify your row - then no, there's no need for an extra ID column. The primary key however must be unique (in ALL circumstances) and cannot be empty (must be NOT NULL).

In my 20+ years of experience in database design, however, this is almost never truly the case. Most "natural" ID's that appear to be unique aren't - ultimately. US Social Security Numbers aren't guaranteed to be unique, and most other "natural" keys end up being almost unique - and that's just not good enough for a database system.

So if you really do have a proper, unique key in your data already - use it! But most of the time, it's easier and more convenient to have just a single surrogate ID that you can guarantee will be unique over all rows.

Is there any reason to worry about the column order in a table?

Column order had a big performance impact on some of the databases I've tuned, spanning Sql Server, Oracle, and MySQL. This post has good rules of thumb:

  • Primary key columns first
  • Foreign key columns next.
  • Frequently searched columns next
  • Frequently updated columns later
  • Nullable columns last.
  • Least used nullable columns after more frequently used nullable columns

An example for difference in performance is an Index lookup. The database engine finds a row based on some conditions in the index, and gets back a row address. Now say you are looking for SomeValue, and it's in this table:

 SomeId int,
SomeString varchar(100),
SomeValue int

The engine has to guess where SomeValue starts, because SomeString has an unknown length. However, if you change the order to:

 SomeId int,
SomeValue int,
SomeString varchar(100)

Now the engine knows that SomeValue can be found 4 bytes after the start of the row. So column order can have a considerable performance impact.

EDIT: Sql Server 2005 stores fixed-length fields at the start of the row. And each row has a reference to the start of a varchar. This completely negates the effect I've listed above. So for recent databases, column order no longer has any impact.

Postgresql table with one ID column, sorted index, with duplicate primary key

You said

Many parts of my system would mark documents as dirty and therefore
insert IDs to process into that table. Therefore duplicates must be
possible.

and

5 rows with the same ID mean the same thing as 1 or 10 rows with that
same ID: They mean that the document with that ID is dirty.

You don't need duplicates for that. If the only purpose of this table is to identify dirty documents, a single row containing the document's id number is sufficient. There's no compelling reason to allow duplicates.

A single row for each ID number is not sufficient if you need to track which process inserted that row, or order rows by the time they were inserted, but a single column isn't sufficient for that in the first place. So I'm sure a primary key constraint or unique constraint would work fine for you.

Other processes have to ignore duplicate key errors, but that's simple. Those processes have to trap errors anyway--there are a lot of things besides a duplicate key that can prevent an insert statement from succeeding.


An implementation that allows duplicates . . .

create table dirty_documents (
document_id integer not null
);

create index on dirty_documents (document_id);

Insert 100k ID numbers into that table for testing. This will necessarily require updating the index. (Duh.) Include a bunch of duplicates.

insert into dirty_documents 
select generate_series(1,100000);

insert into dirty_documents
select generate_series(1, 100);

insert into dirty_documents
select generate_series(1, 50);

insert into dirty_documents
select generate_series(88000, 93245);

insert into dirty_documents
select generate_series(83000, 87245);

Took less than a second on my desktop, which isn't anything special, and which is running three different database servers, two web servers, and playing a Rammstein CD.

Pick the first dirty document ID number for cleaning up.

select min(document_id) 
from dirty_documents;

document_id
--
1

Took only 0.136 ms. Now lets delete every row that has document ID 1.

delete from dirty_documents
where document_id = 1;

Took 0.272 ms.

Let's start over.

drop table dirty_documents;
create table dirty_documents (
document_id integer primary key
);

insert into dirty_documents
select generate_series(1,100000);

Took 500 ms. Let's find the first one again.

select min(document_id) 
from dirty_documents;

Took .054 ms. That's about half the time it took using a table that allowed duplicates.

delete from dirty_documents
where document_id = 1;

Also took .054 ms. That's roughly 50 times faster than the other table.

Let's start over again, and try an unindexed table.

drop table dirty_documents;
create table dirty_documents (
document_id integer not null
);

insert into dirty_documents
select generate_series(1,100000);

insert into dirty_documents
select generate_series(1, 100);

insert into dirty_documents
select generate_series(1, 50);

insert into dirty_documents
select generate_series(88000, 93245);

insert into dirty_documents
select generate_series(83000, 87245);

Get the first document.

select min(document_id) 
from dirty_documents;

Took 32.5 ms. Delete those documents . . .

delete from dirty_documents
where document_id = 1;

Took 12 ms.

All of this took me 12 minutes. (I used a stopwatch.) If you want to know what performance will be, build tables and write tests.

Does column order matter when defining unique constraints

The order matters if you expect to ever use the index as a partial index. For example, suppose you had a unique index on (col1, col2), and you wanted to optimize the following query:

SELECT col1, col2 FROM foo WHERE col1 = 'stack';

The index on (col1, col2) could still be used here, because col1, which appears in the WHERE clause, is the leftmost portion of the index. Had you defined the unique constraint on (col2, col1), the index could not be used for this query.



Related Topics



Leave a reply



Submit