How to Do a Batch Insert in MySQL

How to do a batch insert in MySQL

From the MySQL manual

INSERT statements that use VALUES
syntax can insert multiple rows. To do
this, include multiple lists of column
values, each enclosed within
parentheses and separated by commas.
Example:

INSERT INTO tbl_name (a,b,c) VALUES(1,2,3),(4,5,6),(7,8,9);

MySQL Insert 20K rows in single insert

If you are inserting the rows from some other table then you can use the INSERT ... SELECT pattern to insert the rows.

However if you are inserting the values using INSERT ... VALUES pattern then you have the limit of max_allowed_packet.

Also from the docs:-

To optimize insert speed, combine many small operations into a single
large operation. Ideally, you make a single connection, send the data
for many new rows at once, and delay all index updates and consistency
checking until the very end.

Example:-

INSERT INTO `table1` (`column1`, `column2`) VALUES ("d1", "d2"),
("d1", "d2"),
("d1", "d2"),
("d1", "d2"),
("d1", "d2");

What will happen if there are errors within this 20000 rows?

If there are errors while inserting the records then the operation will be aborted.

BULK INSERT in MYSQL

In MySQL, the equivalent would be

LOAD DATA INFILE

http://dev.mysql.com/doc/refman/5.1/en/load-data.html

LOAD DATA INFILE 'C:\MyTextFile'
INTO TABLE myDatabase.MyTable
FIELDS TERMINATED BY ','

MySQL: What is the best way to do these multiple batch INSERTs?

then what's the point of START TRANSACTION/COMMIT? Surely that was invented to take care of the thing I'm describing, no?

Yes, exactly. In InnoDB, thanks to its MVCC architecture, writers never block readers. You don't have to worry about bulk inserts blocking readers.

The exception is if you're doing locking reads with SELECT...FOR UPDATE or SELECT...LOCK IN SHARE MODE. Those might conflict with INSERTs, depending on the data you're selecting, and whether it requires gap locks where the new data is being inserted.

Likewise LOAD DATA INFILE does not block non-locking readers of the table.

You might like to see the results I got for bulk loading data in my presentation, Load Data Fast!

There's only a tiny amount of time where something might be wrong which is the time it takes to rename the tables.

It's not necessary to do the table-swapping for bulk INSERT, but for what it's worth, if you ever do need to do that, you can do multiple table renames in one statement. The operation is atomic, so there's no chance any concurrent transaction can sneak in between.

RENAME my_table TO my_table_old, my_table_temp TO my_table;

Re your comments:

what if I have indexes?

Let the indexes be updated incrementally as you do the INSERT or LOAD DATA INFILE. InnoDB will do this while other concurrent reads are using the index.

There is overhead to updating an index during INSERTs, but it's usually preferable to let the INSERT take a little longer instead of disabling the index.

If you disable the index, then all concurrent clients cannot use it. Other queries will slow down. Also, when you re-enable the index, this will lock the table and block other queries while it rebuilds the index. Avoid this.

why do I need to wrap the thing in "START TRANSACTION/COMMIT"?

The primary purpose of a transaction is to group changes that should be committed as one change, so that no other concurrent query sees the change in a partially-complete state. Ideally, we'd do all your INSERTs for your bulk-load in one transaction.

The secondary purpose of the transaction is to reduce overhead. If you rely on autocommit instead of explicitly starting and committing, you're still using transactions—but autocommit implicitly starts and commits one transaction for every INSERT statement. The overhead of starting and committing is small, but it adds up if you do it 1 million times.

There's also a practical, physical reason to reduce the number of individual transactions. InnoDB by default does a filesystem sync after each commit, to ensure data is safely stored on disk. This is important to prevent data loss if you have a crash. But a filesystem sync isn't free. You can only do a finite number of syncs per second (this varies based on what type of disk you use). So if you are trying to do 1 million syncs for individual transactions, but your disk can only physically do 100 syncs per second (this typical for a single hard disk of the non-SSD type), then your bulk load will take a minimum of 10,000 seconds. This is a good reason to group your bulk INSERT into batches.

So for both logical reasons of atomic updates, and physical reasons of being kind to your hardware, use transactions when you have some bulk work to do.

However, I don't want to scare you into using transactions to group things inappropriately. Do commit your work promptly after you do some other type of UPDATE. Leaving a transaction hanging open for an unbounded amount of time is not a good idea either. MySQL can handle the rate of commits of ordinary day-to-day work. I am suggesting batching work when you need to do a bunch of bulk changes in rapid succession.

How do I do a bulk insert in mySQL using node.js

Bulk inserts are possible by using nested array, see the github page

Nested arrays are turned into grouped lists (for bulk inserts), e.g.
[['a', 'b'], ['c', 'd']] turns into ('a', 'b'), ('c', 'd')

You just insert a nested array of elements.

An example is given in here

var mysql = require('mysql');
var conn = mysql.createConnection({
...
});

var sql = "INSERT INTO Test (name, email, n) VALUES ?";
var values = [
['demian', 'demian@gmail.com', 1],
['john', 'john@gmail.com', 2],
['mark', 'mark@gmail.com', 3],
['pete', 'pete@gmail.com', 4]
];
conn.query(sql, [values], function(err) {
if (err) throw err;
conn.end();
});

Note: values is an array of arrays wrapped in an array

[ [ [...], [...], [...] ] ]

There is also a totally different node-msql package for bulk insertion

Batch insert SQL statement

INSERT INTO tbl (col1, col2) VALUES ('val1', 'val2'), ('val3', 'val4'), ...

And please read documentation first next time

MySQL bulk insert on multiple tables

Finally, I've used a strategy that uses the MySQL function LAST_INSERT_ID() like @sticky-bit sad but using bulk insert (1 insert for many products) that is much faster.

I attach a simple Ruby script to perform bulk insertions. All seems works well also with concurrency insertions.

I've run the script with the flag innodb_autoinc_lock_mode = 2 and all seems good, but I don't know if is necessary to set the flag to 1:

require 'active_record'
require 'benchmark'
require 'mysql2'
require 'securerandom'

ActiveRecord::Base.establish_connection(
adapter: 'mysql2',
host: 'localhost',
username: 'root',
database: 'test',
pool: 200
)

class ApplicationRecord < ActiveRecord::Base
self.abstract_class = true
end

class Product < ApplicationRecord
has_many :product_variants
end

class ProductVariant < ApplicationRecord
belongs_to :product
COLORS = %w[red blue green yellow pink orange].freeze
end

def migrate
ActiveRecord::Schema.define do
create_table(:products) do |t|
t.string :name
end

create_table(:product_variants) do |t|
t.references :product, null: false, foreign_key: true
t.string :color
end
end
end

def generate_data
d = []
100_000.times do
d << {
name: SecureRandom.alphanumeric(8),
product_variants: Array.new(rand(1..3)).map do
{ color: ProductVariant::COLORS.sample }
end
}
end
d
end

DATA = generate_data.freeze

def bulk_insert
# All inside a transaction
ActiveRecord::Base.transaction do
# Insert products
values = DATA.map { |row| "('#{row[:name]}')" }.join(',')
q = "INSERT INTO products (name) VALUES #{values}"
ActiveRecord::Base.connection.execute(q)

# Get last insert id
q = 'SELECT LAST_INSERT_ID()'
last_id, = ActiveRecord::Base.connection.execute(q).first

# Insert product variants
i = -1
values = DATA.map do |row|
i += 1
row[:product_variants].map { |subrow| "(#{last_id + i},'#{subrow[:color]}')" }
end.flatten.join(',')
q = "INSERT INTO product_variants (product_id,color) VALUES #{values}"
ActiveRecord::Base.connection.execute(q)
end
end

migrate

threads = []

# Spawn 100 threads that perform 200 single inserts each
100.times do
threads << Thread.new do
200.times do
Product.create(name: 'CONCURRENCY NOISE')
end
end
end

threads << Thread.new do
Benchmark.bm do |benchmark|
benchmark.report('Bulk') do
bulk_insert
end
end
end

threads.map(&:join)

After running the script I've checked that all products have associated variants with the query

SELECT * 
FROM products
LEFT OUTER JOIN product_variants
ON (products.id = product_variants.product_id)
WHERE product_variants.product_id IS NULL
AND name != "CONCURRENCY NOISE";

and correctly I get no rows.

Go sqlmock test MySQL batch insert

The answer is, at least for MySQL, that sqlmock doesn't make a difference between different rows and columns, so you can just list all the values from all the rows, one after the other, comma separated, in .WithArgs.

e.g.

mock.ExpectExec("INSERT INTO product_viewers").
WithArgs(row1col1, row1col2, row2col1, row2col2, ...).
WillReturnResult(sqlmock.NewResult(<numInsertedRows>, <numInsertedRows>))



Related Topics



Leave a reply



Submit