Fastest way to update 120 Million records
The only sane way to update a table of 120M records is with a SELECT
statement that populates a second table. You have to take care when doing this. Instructions below.
Simple Case
For a table w/out a clustered index, during a time w/out concurrent DML:
SELECT *, new_col = 1 INTO clone.BaseTable FROM dbo.BaseTable
- recreate indexes, constraints, etc on new table
- switch old and new w/ ALTER SCHEMA ... TRANSFER.
- drop old table
If you can't create a clone schema, a different table name in the same schema will do. Remember to rename all your constraints and triggers (if applicable) after the switch.
Non-simple Case
First, recreate your BaseTable
with the same name under a different schema, eg clone.BaseTable
. Using a separate schema will simplify the rename process later.
- Include the clustered index, if applicable. Remember that primary keys and unique constraints may be clustered, but not necessarily so.
- Include identity columns and computed columns, if applicable.
- Include your new INT column, wherever it belongs.
- Do not include any of the following:
- triggers
- foreign key constraints
- non-clustered indexes/primary keys/unique constraints
- check constraints or default constraints. Defaults don't make much of difference, but we're trying to keep
things minimal.
Then, test your insert w/ 1000 rows:
-- assuming an IDENTITY column in BaseTable
SET IDENTITY_INSERT clone.BaseTable ON
GO
INSERT clone.BaseTable WITH (TABLOCK) (Col1, Col2, Col3)
SELECT TOP 1000 Col1, Col2, Col3 = -1
FROM dbo.BaseTable
GO
SET IDENTITY_INSERT clone.BaseTable OFF
Examine the results. If everything appears in order:
- truncate the clone table
- make sure the database in in bulk-logged or simple recovery model
- perform the full insert.
This will take a while, but not nearly as long as an update. Once it completes, check the data in the clone table to make sure it everything is correct.
Then, recreate all non-clustered primary keys/unique constraints/indexes and foreign key constraints (in that order). Recreate default and check constraints, if applicable. Recreate all triggers. Recreate each constraint, index or trigger in a separate batch. eg:
ALTER TABLE clone.BaseTable ADD CONSTRAINT UQ_BaseTable UNIQUE (Col2)
GO
-- next constraint/index/trigger definition here
Finally, move dbo.BaseTable
to a backup schema and clone.BaseTable
to the dbo schema (or wherever your table is supposed to live).
-- -- perform first true-up operation here, if necessary
-- EXEC clone.BaseTable_TrueUp
-- GO
-- -- create a backup schema, if necessary
-- CREATE SCHEMA backup_20100914
-- GO
BEGIN TRY
BEGIN TRANSACTION
ALTER SCHEMA backup_20100914 TRANSFER dbo.BaseTable
-- -- perform second true-up operation here, if necessary
-- EXEC clone.BaseTable_TrueUp
ALTER SCHEMA dbo TRANSFER clone.BaseTable
COMMIT TRANSACTION
END TRY
BEGIN CATCH
SELECT ERROR_MESSAGE() -- add more info here if necessary
ROLLBACK TRANSACTION
END CATCH
GO
If you need to free-up disk space, you may drop your original table at this time, though it may be prudent to keep it around a while longer.
Needless to say, this is ideally an offline operation. If you have people modifying data while you perform this operation, you will have to perform a true-up operation with the schema switch. I recommend creating a trigger on dbo.BaseTable
to log all DML to a separate table. Enable this trigger before you start the insert. Then in the same transaction that you perform the schema transfer, use the log table to perform a true-up. Test this first on a subset of the data! Deltas are easy to screw up.
Best way to update 40 million rows in batch
Declare @Rowcount INT = 1;
WHILE (@Rowcount > 0)
BEGIN
UPDATE TOP (100000) [table] --<-- define Batch Size in TOP Clause
SET [New_ID] = [Old_ID]
WHERE [New_ID] <> [Old_ID]
SET @Rowcount = @@ROWCOUNT;
CHECKPOINT; --<-- to commit the changes with each batch
END
How to speed up simple UPDATE query with millions of rows?
In my experience, looping your update so that it acts on small a numbers of rows each iteration is a good way to go. The ideal number of rows to update each iteration is largely dependent on your environment and the tables you're working with. I usually stick around 1,000 - 10,000 rows per iteration.
Example
SET ROWCOUNT 1000 -- Set the batch size (number of rows to affect each time through the loop).
WHILE (1=1) BEGIN
UPDATE PF
SET NewSourceId = 1
FROM AA..Pub_ArticleFaculty PF WITH (NOLOCK)
INNER JOIN AA2..ArticleAuthorOldNew AAON WITH (NOLOCK)
ON AAON.OldFullSourceId = PF.SourceId
WHERE NewSourceId IS NULL -- Only update rows that haven't yet been updated.
-- When no rows are affected, we're done!
IF @@ROWCOUNT = 0
BREAK
END
SET ROWCOUNT 0 -- Reset the batch size to the default (i.e. all rows).
GO
fastest way to get process and upload 42 million rows
There is an Aerospike-loader tool. If you can dump your data in a csv file format, the loader can load the data into aerospike. It can read multiple CSV files in parallel and load the data into aerospike in parallel. In the internal benchmarks, on decent hardware, we could load upto 200,000 records per second. Read the docs & examples for details.
The impact of Availability of data to users while doing large updates on table
It is unclear what you need to do.
- Replace the entire table -- populate new table, then swap
- Change one column for all rows -- Sounds like sloppy design. Please elaborate on what you are doing.
- Change one column for some rows -- ditto.
- Adding a new column and initializing it -- Consider creating a parallel table, etc. This will have zero blockage but adds some complexity to your code.
- The values are computed from other columns -- consider a "generated" column. (What version of MySQL are you using?)
Here is a discussion of how to walk through a table using the PRIMARY KEY
and have minimal impact on other queries: http://mysql.rjweb.org/doc.php/deletebig#deleting_in_chunks (It is written with DELETE
in mind, but the principle applies to UPDATE
, too.)
Table availability
When any operation occurs, the rows involved are "locked" to prevent other queries from modifying them at the same time. ("Locking involves multi-version control, etc, etc.) They need to stay locked until the entire "transaction" is completed. Meanwhile, any changes need to be recorded in case the server crashes or the user decides to "roll back" the changes.
So, if there are millions of rows are being changed, then millions of locks are being held. That takes time.
My blog recommends doing only 1000 rows at a time; this is usually a small enough number to have very little interference with other tasks, yet large enough to get the task finished in a reasonable amount of time.
Stock Split
Assuming the desired query (against a huge table) is something like
UPDATE t
SET price = 2 * price
WHERE date < '...'
AND ticker = '...'
You need an index (or possibly the PRIMARY KEY
) to be (ticker, date)
. Most writes are date-oriented, but most reads are ticker-oriented? Given this, the following may be optimal:
PRIMARY KEY(ticker, date),
INDEX(date, ticker)
With that the rows that need modifying by the UPDATE
are 'clustered' (consecutive) in the data's BTree. Hence there is some degree of efficiency. If, however, that is not "good enough", then it should be pretty easy to write code something like:
date_a = SELECT MIN(date) FROM t WHERE ticker = ?
SET AUTOCOMMIT=ON
Loop
date_z = date_a + 1 month
UPDATE t
SET price = 2 * price
WHERE date >= ? -- put date_a here
AND date < ? -- put date_z here
AND ticker = '...'
check for deadlock; if found, re-run the UPDATE
set date_a = date_z
exit loop when finished
End Loop
This will be reasonably fast, and have little impact on other queries. However, is someone looks at that ticker over a range of days, the prices may not be consistently updated. (If this concerns you; we can discuss further.)
Update statement running for too long or not
IF you updated 20M rows in one single transaction, then your time was entirely driven by your IO subsystem: what kind of drives you have, what disk files layout etc. If you have 40 spindles in raid 10 with 4 balanced files and a separate similar battery for the log then the result is worryingly slow. If you tested this with one single MDF that shares the spindle with the LDF on a single consumer quality 5000rpm HDD then your time is amazingly fast.
Is there a better way to split work for simple updates?
Use Update Top
You can update data as chunks using a while loop and Update Top option:
WHILE 1 = 1
BEGIN
UPDATE top (1000) tableToUpdate
SET Column1 = 'something new'
WHERE
Column1 = 'something old';
if @@ROWCOUNT < 1000 BREAK
END
When @@ROWCOUNT
is less than 1000
which is the chunk size it implies that all rows are updated.
Note That, based on the official documentation:
The rows referenced in the TOP expression used with INSERT, UPDATE, or DELETE are not arranged in any order.
Update using TOP and Order BY
If you are looking to update sorted data based on a timestamp, in the official documentation they mentioned that:
If you must use TOP to apply updates in a meaningful chronology, you must use TOP together with ORDER BY in a subselect statement.
As Example:
WHILE 1 = 1
BEGIN
UPDATE tableToUpdate
SET Column1 = 'something new'
FROM (SELECT TOP 1000 IDColumn FROM tableToUpdate WHERE tableToUpdate.Column1 = 'something old' ORDER BY TimeStamp DESC) tto
WHERE
tableToUpdate.ID = tto.ID;
if @@ROWCOUNT < 1000 BREAK
END
Other helpful links
- UPDATE (Transact-SQL) - official documentation
- How can I create a loop on an UPDATE statement that works until there is no row left to update?
- Fastest way to update 120 Million records
- How to update large table with millions of rows in SQL Server?
- Updating rows in a large table in sql server
Related Topics
How to Create Multiple One to One's
Generate_Series() Equivalent in MySQL
Required to Join 2 Tables With Their Fks in a 3Rd Table
Fast Way to Discover the Row Count of a Table in Postgresql
What Is the Simplest SQL Query to Find the Second Largest Value
What Is Self Join and When Would You Use It
How to Query a Value in SQL Server Xml Column
How to Return the Column Names of a Table
SQL - How to Store and Navigate Hierarchies
How to List the Primary Key of a SQL Server Table
Convert Varchar into Datetime in SQL Server
Insert into Multiple Tables in One Query
Line Count with in the Text Files Having Multiple Lines and Single Lines
What Is the Best Free SQL Gui for Linux for Various Dbms Systems
Postgres Not Allowing Localhost But Works with 127.0.0.1
Log Every SQL Query to Database in Rails
Rails Query Through Association Limited to Most Recent Record