How to Insert 10 Million Records in the Shortest Time Possible

How can I insert 10 million records in the shortest time possible?

Please do not create a DataTable to load via BulkCopy. That is an ok solution for smaller sets of data, but there is absolutely no reason to load all 10 million rows into memory before calling the database.

Your best bet (outside of BCP / BULK INSERT / OPENROWSET(BULK...)) is to stream the contents from the file into the database via a Table-Valued Parameter (TVP). By using a TVP you can open the file, read a row & send a row until done, and then close the file. This method has a memory footprint of just a single row. I wrote an article, Streaming Data Into SQL Server 2008 From an Application, which has an example of this very scenario.

A simplistic overview of the structure is as follows. I am assuming the same import table and field name as shown in the question above.

Required database objects:

-- First: You need a User-Defined Table Type
CREATE TYPE ImportStructure AS TABLE (Field VARCHAR(MAX));
GO

-- Second: Use the UDTT as an input param to an import proc.
-- Hence "Tabled-Valued Parameter" (TVP)
CREATE PROCEDURE dbo.ImportData (
@ImportTable dbo.ImportStructure READONLY
)
AS
SET NOCOUNT ON;

-- maybe clear out the table first?
TRUNCATE TABLE dbo.DATAs;

INSERT INTO dbo.DATAs (DatasField)
SELECT Field
FROM @ImportTable;

GO

C# app code to make use of the above SQL objects is below. Notice how rather than filling up an object (e.g. DataTable) and then executing the Stored Procedure, in this method it is the executing of the Stored Procedure that initiates the reading of the file contents. The input parameter of the Stored Proc isn't a variable; it is the return value of a method, GetFileContents. That method is called when the SqlCommand calls ExecuteNonQuery, which opens the file, reads a row and sends the row to SQL Server via the IEnumerable<SqlDataRecord> and yield return constructs, and then closes the file. The Stored Procedure just sees a Table Variable, @ImportTable, that can be access as soon as the data starts coming over (note: the data does persist for a short time, even if not the full contents, in tempdb).

using System.Collections;
using System.Data;
using System.Data.SqlClient;
using System.IO;
using Microsoft.SqlServer.Server;

private static IEnumerable<SqlDataRecord> GetFileContents()
{
SqlMetaData[] _TvpSchema = new SqlMetaData[] {
new SqlMetaData("Field", SqlDbType.VarChar, SqlMetaData.Max)
};
SqlDataRecord _DataRecord = new SqlDataRecord(_TvpSchema);
StreamReader _FileReader = null;

try
{
_FileReader = new StreamReader("{filePath}");

// read a row, send a row
while (!_FileReader.EndOfStream)
{
// You shouldn't need to call "_DataRecord = new SqlDataRecord" as
// SQL Server already received the row when "yield return" was called.
// Unlike BCP and BULK INSERT, you have the option here to create a string
// call ReadLine() into the string, do manipulation(s) / validation(s) on
// the string, then pass that string into SetString() or discard if invalid.
_DataRecord.SetString(0, _FileReader.ReadLine());
yield return _DataRecord;
}
}
finally
{
_FileReader.Close();
}
}

The GetFileContents method above is used as the input parameter value for the Stored Procedure as shown below:

public static void test()
{
SqlConnection _Connection = new SqlConnection("{connection string}");
SqlCommand _Command = new SqlCommand("ImportData", _Connection);
_Command.CommandType = CommandType.StoredProcedure;

SqlParameter _TVParam = new SqlParameter();
_TVParam.ParameterName = "@ImportTable";
_TVParam.TypeName = "dbo.ImportStructure";
_TVParam.SqlDbType = SqlDbType.Structured;
_TVParam.Value = GetFileContents(); // return value of the method is streamed data
_Command.Parameters.Add(_TVParam);

try
{
_Connection.Open();

_Command.ExecuteNonQuery();
}
finally
{
_Connection.Close();
}

return;
}

Additional notes:

  1. With some modification, the above C# code can be adapted to batch the data in.
  2. With minor modification, the above C# code can be adapted to send in multiple fields (the example shown in the "Steaming Data..." article linked above passes in 2 fields).
  3. You can also manipulate the value of each record in the SELECT statement in the proc.
  4. You can also filter out rows by using a WHERE condition in the proc.
  5. You can access the TVP Table Variable multiple times; it is READONLY but not "forward only".
  6. Advantages over SqlBulkCopy:
    1. SqlBulkCopy is INSERT-only whereas using a TVP allows the data to be used in any fashion: you can call MERGE; you can DELETE based on some condition; you can split the data into multiple tables; and so on.
    2. Due to a TVP not being INSERT-only, you don't need a separate staging table to dump the data into.
    3. You can get data back from the database by calling ExecuteReader instead of ExecuteNonQuery. For example, if there was an IDENTITY field on the DATAs import table, you could add an OUTPUT clause to the INSERT to pass back INSERTED.[ID] (assuming ID is the name of the IDENTITY field). Or you can pass back the results of a completely different query, or both since multiple results sets can be sent and accessed via Reader.NextResult(). Getting info back from the database is not possible when using SqlBulkCopy yet there are several questions here on S.O. of people wanting to do exactly that (at least with regards to the newly created IDENTITY values).
    4. For more info on why it is sometimes faster for the overall process, even if slightly slower on getting the data from disk into SQL Server, please see this whitepaper from the SQL Server Customer Advisory Team: Maximizing Throughput with TVP

Fastest method to fill a database table with 10 Million rows

Using SQL to load a lot of data into a database will usually result in poor performance. In order to do things quickly, you need to go around the SQL engine. Most databases (including Firebird I think) have the ability to backup all the data into a text (or maybe XML) file and to restore the entire database from such a dump file. Since the restoration process doesn't need to be transaction aware and the data isn't represented as SQL, it is usually very quick.

I would write a script that generates a dump file by hand, and then use the database's restore utility to load the data.

After a bit of searching I found FBExport, that seems to be able to do exactly that - you'll just need to generate a CSV file and then use the FBExport tool to import that data into your database.

How to insert large amount of data into SQL Server 2008 with c# code?

Is there a db.closeconn(); somewhere after the try block that was pasted into the question? If not then that is a huge issue (i.e. to keep opening connections and not closing them, and that could explain why it freezes after opening 200+ of them). If there is a close connection method being called then great, but still, opening and closing the connection per each INSERT is unnecessary, let alone horribly inefficient.

At the very least you can:

  • define the query string, SqlParameters, and SqlCommand once
  • in the loop, set the parameter values and call ExecuteNonQuery();
  • (it is also preferred to not use AddWithValue() anyway)

Example:

// this should be in a try block

strSQL = "INSERT...";
db.openconn("MOMT_Report", "Report");
cmd = new SqlCommand(strSQL, db.cn);

SqlParameter _Rptdate = new SqlParameter("@Rptdate", DbType.Int);
cmd.Parameters.Add(_Rptdate);

...{repeat for remaining params}...

// optional begin transaction

for / while loop
{
_Rptdate.Value = Rptdate;
// set other param values
cmd.ExecuteNonQuery();
}

// if optional transaction was started, do commit

db.closeconn(); // this should be in a finally block

However, the fastest and cleanest way to get this data inserted is to use Table-Valued Parameters (TVPs) which were introduced in SQL Server 2008. You need to create a User-Defined Table Type (one time) to define the structure, and then you can use it in either an ad hoc insert like you current have, or pass to a stored procedure. But this way you don't need to export to a file just to import. There is no need for that additional steps.

Rather than copy/paste a large code block, I have noted three links below where I have posted the code to do this. The first two links are the full code (SQL and C#) to accomplish this. Each is a slight variation on the theme (which shows the flexibility of using TVPs). The third is another variation but not the full code as it just shows the differences from one of the first two in order to fit that particular situation. But in all 3 cases, the data is streamed from the app into SQL Server. There is no creating of any additional collection or external file; you use what you currently have and only need to duplicate the values of a single row at a time to be sent over. And on the SQL Server side, it all comes through as a populated Table Variable. This is far more efficient than taking data you already have in memory, converting it to a file (takes time and disk space) or XML (takes cpu and memory) or a DataTable (for SqlBulkCopy; takes cpu and memory) or something else, only to rely on an external factor such as the filesystem (the files will need to be cleaned up, right?) or need to parse out of XML.

  • How can I insert 10 million records in the shortest time possible?
  • Pass Dictionary<string,int> to Stored Procedure T-SQL
  • Storing a Dictionary<int,string> or KeyValuePair in a database

Fastest way to bulk insert in SQL Table from C# SQLCLR Stored Procedure

You have a few options:

  1. Create a User-Defined Table Type (UDTT) and a T-SQL stored procedure that accepts that UDTT as a parameter (i.e. a Table-Valued Parameter (TVP)). Since you already have the data in a collection, absolutely do NOT create a separate DataTable to transfer that data into as that is just a waste of time, memory, and CPU. You can stream the data in its native collection directly into the TVP by creating a method that returns IEnumerable<SqlDataRecord>, uses yield return;, and is used as the value for the SqlParameter that represents the UDTT. I have two examples of this here on SO:

    • How can I insert 10 million records in the shortest time possible?
    • Pass Dictionary<string,int> to Stored Procedure T-SQL
  2. If there is only 1 table to populate (hence 1 result set), then you can output the collection as a result set from the stored procedure and use it as follows:

    INSERT INTO schema.table (column_list)
    EXEC schema.SQLCLR_stored_procedure @param1, ....;
  3. If there is only 1 table to populate (hence 1 result set), and if you aren't doing anything over the Context Connection that violates the rules of scalar functions, then you can try changing the SQLCLR stored procedure into a TVF, return IEnumerable, and iterate over the return via yield return; to stream it out. I wrote an article years ago with an example of this on SQL Server Central: CLR Table-Valued Function Example with Full Streaming (STVF / TVF)

Adding a large amount of records using LINQ

First of all let's separate the code into two parts. First part is creating a list of valid User records to be inserted. Second part is inserting those records to the database (last two lines of your code).

Assuming you are using EntityFramework as your ORM, second part may be optimized by bulk inserting the records. It has many existing solutions that can be easily found. (example)

There are some suggestions concerning the first part.

  1. Load user ids in a HashSet or Dictionary. These data structures are optimized for searching. var userDbIds = new HashSet<int>(db.User.Select(x => x.Id));. You will quickly check if id exists without making a request to DB.

  2. Do the same for serialNumber. var serialNumbers = new HashSet<string>(db.SerialNumber.Select(x => x.SerialNumber)); assuming that type of SerialNumber property is string.

  3. Change the type of your recordToAdd variable to be Dictionary<int, User> for the same reason.

In the check would look like this:

    bool exists = userDbIds.Contains(newRecord.Id) || recordsToAdd.ContainsKey(newRecord.Id);
if (!exists)
{
bool isSerialNumberExist = serialNumbers.Contains(newRecord.SerialNumber);
if (isSerialNumberExist)
{
recordsToAdd[newRecord.Id] = newRecord;
}
else
{
resultMessages.Add(string.Format("SerialNumber doesn't exist"));
}
}
else
{
resultMessages.Add(string.Format("Record already exist"));
}

Read 5 million records and finally update a column

Since you need to transfer tha data for some operation by a COM object this is what I would do:

Use a machine with lots of memory - Load the data in chunks (for example 5000 or 50000 rows at a time) into memory, process it and do the update on the SQL Server...

For the UPDATE part use transactions and put 5000 - 20000 UPDATEs into one transaction...

[EDIT]: by partitioning the work properly and assigning for 500000 or 1000000 rows to one "worker-machine" you can speed this up to the max limit of your SQL Server... [/EDIT]

Another option - though not recommended (only because of theoretically possible security and/or stability issues introduced by the COM object in this specific case):

Though this is a desciption regarding SQL Server something similar is possible with Oracle on Windows too

You can put the logic of this transformation into your SQL Server by writing+installing a .NET assembly which exposes a Stored Procedure you can call to do the transformation... the .NET assembly in turn access that COM object... for a howto see http://www.sqlteam.com/article/writing-clr-stored-procedures-in-charp-introduction-to-charp-part-1

The MSDN reference link to this is http://msdn.microsoft.com/en-us/library/ms131094.aspx



Related Topics



Leave a reply



Submit