Is It Better to Create an Index Before Filling a Table with Data, or After the Data Is in Place

Is it better to create an index before filling a table with data, or after the data is in place?

Creating index after data insert is more efficient way (it even often recomended to drop index before batch import and after import recreate it).

Syntetic example (PostgreSQL 9.1, slow development machine, one million rows):

CREATE TABLE test1(id serial, x integer);
INSERT INTO test1(id, x) SELECT x.id, x.id*100 FROM generate_series(1,1000000) AS x(id);
-- Time: 7816.561 ms
CREATE INDEX test1_x ON test1 (x);
-- Time: 4183.614 ms

Insert and then create index - about 12 sec

CREATE TABLE test2(id serial, x integer);
CREATE INDEX test2_x ON test2 (x);
-- Time: 2.315 ms
INSERT INTO test2(id, x) SELECT x.id, x.id*100 FROM generate_series(1,1000000) AS x(id);
-- Time: 25399.460 ms

Create index and then insert - about 25.5 sec (more than two times slower)

Creating indexes after data load Vs Before data load in a large table

Given that your table is very wide, and the indexes very narrow, creating non-clustered indexes on the table following the load should be preferred.

In this instance I would have:

  1. Create the new table with the Clustered Index in place - this is because the process of converting a heap into a clustered index is computationally expensive.
  2. Load the data into the table, in the order of the clustered index SwapData_ID
  3. Using BULK INSERT (ensuring the operation is minimally logged), load into the table
  4. Create the non-clustered indexes

The above approach should be optimal given your scenario.

There is of course then other questions around:

Data drift (will the source data change during your load process? Do these changes need to be taken across)

DR (is log shipping enabled? In this case the recovery model may need to be changed to bulk-logged)

Log file sizing (You'll need to ensure your log file is big enough to accommodate for the non-clustered index creations)

Presizing the database (ensuring it doesn't auto-grow during the load)

but these all seem to be slightly outside the context of what you're asking.

Is it better to create Oracle SQL indexes before or after data loading?

Creating indexes after loading the data is much faster. If you load data into a table with indexes, the loading will be very slow because of the constant index updates. If you create the index later, it can be efficiently populated just once (which may of course take some time, but the grand total should be smaller).

Similar logic applies to constraints. Also enable those later (unless you expect data to fail the constraints and want to know that early on).

Insertion of data after creating index on empty table or creating unique index after inserting data on oracle?

Insert your data first, then create your index.

Every time you do an UPDATE, INSERT or DELETE operation, any indexes on the table have to be updated as well. So if you create the index first, and then insert 10M rows, the index will have to be updated 10M times as well (unless you're doing bulk operations).

Most Efficient Way to Create an Index in Postgres

Your observation is correct - it is much more efficient to load data first and only then create index. Reason for this is that index updates during insert are expensive. If you create index after all data is there, it is much faster.

It goes even further - if you need to import large amount of data into existing indexed table, it is often more efficient to drop existing index first, import the data, and then re-create index again.

One downside of creating index after importing is that table must be locked, and that may take long time (it will not be locked in opposite scenario). But, in PostgreSQL 8.2 and later, you can use CREATE INDEX CONCURRENTLY, which does not lock table during indexing (with some caveats).



Related Topics



Leave a reply



Submit