Best Practices for Inserting/Updating Large Amount of Data in SQL Server 2008

Best practices for inserting/updating large amount of data in SQL Server 2008

Seeing that you're using SQL Server 2008, I would recommend this approach:

  • first bulkcopy your CSV files into a staging table
  • update your target table from that staging table using the MERGE command

Check out the MSDN docs and a great blog post on how to use the MERGE command.

Basically, you create a link between your actual data table and the staging table on a common criteria (e.g. a common primary key), and then you can define what to do when

  • the rows match, e.g. the row exists in both the source and the target table --> typically you'd either update some fields, or just ignore it all together
  • the row from the source doesn't exist in the target --> typically a case for an INSERT

You would have a MERGE statement something like this:

MERGE TargetTable AS t
USING SourceTable AS src
ON t.PrimaryKey = src.PrimaryKey

WHEN NOT MATCHED THEN
INSERT (list OF fields)
VALUES (list OF values)

WHEN MATCHED THEN
UPDATE
SET (list OF SET statements)
;

Of course, the ON clause can be much more involved if needed. And of course, your WHEN statements can also be more complex, e.g.

WHEN MATCHED AND (some other condition) THEN ......

and so forth.

MERGE is a very powerful and very useful new command in SQL Server 2008 - use it, if you can!

How to update large table with millions of rows in SQL Server?

  1. You should not be updating 10k rows in a set unless you are certain that the operation is getting Page Locks (due to multiple rows per page being part of the UPDATE operation). The issue is that Lock Escalation (from either Row or Page to Table locks) occurs at 5000 locks. So it is safest to keep it just below 5000, just in case the operation is using Row Locks.

  2. You should not be using SET ROWCOUNT to limit the number of rows that will be modified. There are two issues here:

    1. It has that been deprecated since SQL Server 2005 was released (11 years ago):

      Using SET ROWCOUNT will not affect DELETE, INSERT, and UPDATE statements in a future release of SQL Server. Avoid using SET ROWCOUNT with DELETE, INSERT, and UPDATE statements in new development work, and plan to modify applications that currently use it. For a similar behavior, use the TOP syntax

    2. It can affect more than just the statement you are dealing with:

      Setting the SET ROWCOUNT option causes most Transact-SQL statements to stop processing when they have been affected by the specified number of rows. This includes triggers. The ROWCOUNT option does not affect dynamic cursors, but it does limit the rowset of keyset and insensitive cursors. This option should be used with caution.

    Instead, use the TOP () clause.

  3. There is no purpose in having an explicit transaction here. It complicates the code and you have no handling for a ROLLBACK, which isn't even needed since each statement is its own transaction (i.e. auto-commit).

  4. Assuming you find a reason to keep the explicit transaction, then you do not have a TRY / CATCH structure. Please see my answer on DBA.StackExchange for a TRY / CATCH template that handles transactions:

    Are we required to handle Transaction in C# Code as well as in Store procedure

I suspect that the real WHERE clause is not being shown in the example code in the Question, so simply relying upon what has been shown, a better model (please see note below regarding performance) would be:

DECLARE @Rows INT,
@BatchSize INT; -- keep below 5000 to be safe

SET @BatchSize = 2000;

SET @Rows = @BatchSize; -- initialize just to enter the loop

BEGIN TRY
WHILE (@Rows = @BatchSize)
BEGIN
UPDATE TOP (@BatchSize) tab
SET tab.Value = 'abc1'
FROM TableName tab
WHERE tab.Parameter1 = 'abc'
AND tab.Parameter2 = 123
AND tab.Value <> 'abc1' COLLATE Latin1_General_100_BIN2;
-- Use a binary Collation (ending in _BIN2, not _BIN) to make sure
-- that you don't skip differences that compare the same due to
-- insensitivity of case, accent, etc, or linguistic equivalence.

SET @Rows = @@ROWCOUNT;
END;
END TRY
BEGIN CATCH
RAISERROR(stuff);
RETURN;
END CATCH;

By testing @Rows against @BatchSize, you can avoid that final UPDATE query (in most cases) because the final set is typically some number of rows less than @BatchSize, in which case we know that there are no more to process (which is what you see in the output shown in your answer). Only in those cases where the final set of rows is equal to @BatchSize will this code run a final UPDATE affecting 0 rows.

I also added a condition to the WHERE clause to prevent rows that have already been updated from being updated again.

NOTE REGARDING PERFORMANCE

I emphasized "better" above (as in, "this is a better model") because this has several improvements over the O.P.'s original code, and works fine in many cases, but is not perfect for all cases. For tables of at least a certain size (which varies due to several factors so I can't be more specific), performance will degrade as there are fewer rows to fix if either:

  1. there is no index to support the query, or
  2. there is an index, but at least one column in the WHERE clause is a string data type that does not use a binary collation, hence a COLLATE clause is added to the query here to force the binary collation, and doing so invalidates the index (for this particular query).

This is the situation that @mikesigs encountered, thus requiring a different approach. The updated method copies the IDs for all rows to be updated into a temporary table, then uses that temp table to INNER JOIN to the table being updated on the clustered index key column(s). (It's important to capture and join on the clustered index columns, whether or not those are the primary key columns!).

Please see @mikesigs answer below for details. The approach shown in that answer is a very effective pattern that I have used myself on many occasions. The only changes I would make are:

  1. Explicitly create the #targetIds table rather than using SELECT INTO...
  2. For the #targetIds table, declare a clustered primary key on the column(s).
  3. For the #batchIds table, declare a clustered primary key on the column(s).
  4. For inserting into #targetIds, use INSERT INTO #targetIds (column_name(s)) SELECT and remove the ORDER BY as it's unnecessary.

So, if you don't have an index that can be used for this operation, and can't temporarily create one that will actually work (a filtered index might work, depending on your WHERE clause for the UPDATE query), then try the approach shown in @mikesigs answer (and if you use that solution, please up-vote it).

How to do very fast inserts to SQL Server 2008

ExecuteNonQuery with an INSERT statement, or even a stored procedure, will get you into thousands of inserts per second range on Express. 4000-5000/sec are easily achievable, I know this for a fact.

What usually slows down individual updates is the wait time for log flush and you need to account for that. The easiest solution is to simply batch commit. Eg. commit every 1000 inserts, or every second. This will fill up the log pages and will amortize the cost of log flush wait over all the inserts in a transaction.

With batch commits you'll probably bottleneck on disk log write performance, which there is nothing you can do about it short of changing the hardware (going raid 0 stripe on log).

If you hit earlier bottlenecks (unlikely) then you can look into batching statements, ie. send one single T-SQL batch with multiple inserts on it. But this seldom pays off.

Of course, you'll need to reduce the size of your writes to a minimum, meaning reduce the width of your table to the minimally needed columns, eliminate non-clustered indexes, eliminate unneeded constraints. If possible, use a Heap instead of a clustered index, since Heap inserts are significantly faster than clustered index ones.

There is little need to use the fast insert interface (ie. SqlBulkCopy). Using ordinary INSERTS and ExecuteNoQuery on batch commits you'll exhaust the drive sequential write throughput much faster than the need to deploy bulk insert. Bulk insert is needed on fast SAN connected machines, and you mention Express so it's probably not the case. There is a perception of the contrary out there, but is simply because people don't realize that bulk insert gives them batch commit, and its the batch commit that speeds thinks up, not the bulk insert.

As with any performance test, make sure you eliminate randomness, and preallocate the database and the log, you don't want to hit db or log growth event during test measurements or during production, that is sooo amateurish.

Speed up update of 185k rows in SQL Server 2008?

Use SqlBulkCopy (to a temporary table) followed by MERGE.

The bulk-copy method can efficiently transfer data using a "push" from a .NET client. Using BULK INSERT requires a "pull" from the server (along with the required local-file access).

Then the MERGE command (in SQL Server 2008+) can be used to insert/update/upsert the data from the temporary table to the target table according to the desired rules. Since the data is entirely in the database at this point, this operation will be as fast as can be. Using MERGE may also result in performance advantages over many individual commands, even those within the same transaction.

See also:

  • Any way to SQLBulkCopy "insert or update if exists"?
  • SQL Server 2008 INSERT Optimization

Best practice for storing millions of rows with TSQL (Sql Server 2008)

Based on your comment, I would put a clustered index on (Sensor, Timestamp).

This will always cover when you want to search for SENSOR alone, but will also cover both fields checked in combination.

If you want to ever search for Timestamp alone, you can add a nonclustered index there as well.

One issue you will have with this design is the need to rebuild the table since you are going to be inserting rows non-sequentially - the new rows won't always belong at the end of the index.

Also, please do not name a field timestamp - this is a keyword in SQL Server and can cause you all kinds of issues if you don't delimit it everywhere.

Fast batch insert/update with SQL Server 2008 and BCP

In C#, look at SQLBulkCopy. It's what SSIS uses in the background.

For true bcp/BULK INSERT, you'd need bulkadmin rights which may not be allowed

Sql server 2008 - performance tuning features for insert large amount of data

I don't know if this is feasible for your problem, but if you can I would really try to develop this in code.

I had a similar question for a big project in the past, that needed to import 15 years worth of production data into a new schema (in SQL Server 2005.)
System.Data.SqlClient.SqlBulkCopy was by far the fastest option.

If you do go this way, I suggest doing inserts in lots of roughly 1 GB at a time, then manually calling the .NET GC to free up your table in memory. I was forced to do both of these things to not run into memory errors (32 bit system, though.)

Edit - Pseudocode for my solutiong was something like:

Table dataToInsert = new Table();
var sqlCommand = new SqlCommand("select * from old database");
DataReader dataFromOldSystem = sqlCommand.ExecuteReader();
foreach (DataRow oldRow in dataFromOldSystem.Tables[0])
{
// I had to modify/transpose the row from the old table in some way
DataRow newRow = new DataRow(oldRow.GetInt(0), oldRow.GetDateTime(1), oldRow.GetInt(2));
dataToInsert.AddRow(newRow);

newRow = new DataRow(oldRow.GetInt(0), oldRow.GetDateTime(1), oldRow.GetInt(3));
dataToInsert.AddRow(newRow);

newRow = new DataRow(oldRow.GetInt(0), oldRow.GetDateTime(1), oldRow.GetInt(4));
dataToInsert.AddRow(newRow);

// check if the number of rows is over some magic number that is below the memory limit
// you can check the private bytes in use by your app to help guess this number
if (dataToInsert.Rows.Count > 1000000)
{
SqlBulkCopy bulkCopier = new BulkCopy(blah);
bulkCopier.Execute();

dataToInsert = null;
GC.Finalize();
GC.Free;

dataToInsert = new Table();
}
}


Related Topics



Leave a reply



Submit