How to Update Thousands of Records into MySQL Db in Milliseconds

How do I update thousands of records into MySQL DB in milliseconds

Rather than batching the individual UPDATEs, you could batch INSERTs into a temporary table with rewriteBatchedStatements=true and then use a single UPDATE statement to update the main table. On my machine with a local MySQL instance, the following code takes about 2.5 seconds ...

long t0 = System.nanoTime();
conn.setAutoCommit(false);

String sql = null;
sql = "UPDATE personal_info SET result=?, score=?, uploadState=? WHERE CandidateID=?";
PreparedStatement ps = conn.prepareStatement(sql);
String tag = "X";
for (int i = 1; i <= 10000; i++) {
ps.setString(1, String.format("result_%s_%d", tag, i));
ps.setInt(2, 200000 + i);
ps.setString(3, String.format("state_%s_%d", tag, i));
ps.setInt(4, i);
ps.addBatch();
}
ps.executeBatch();
conn.commit();
System.out.printf("%d ms%n", (System.nanoTime() - t0) / 1000000);

... while this version takes about 1.3 seconds:

long t0 = System.nanoTime();
conn.setAutoCommit(false);

String sql = null;
Statement st = conn.createStatement();
st.execute("CREATE TEMPORARY TABLE tmp (CandidateID INT, result VARCHAR(255), score INT, uploadState VARCHAR(255))");
sql = "INSERT INTO tmp (result, score, uploadState, CandidateID) VALUES (?,?,?,?)";
PreparedStatement ps = conn.prepareStatement(sql);
String tag = "Y";
for (int i = 1; i <= 10000; i++) {
ps.setString(1, String.format("result_%s_%d", tag, i));
ps.setInt(2, 400000 + i);
ps.setString(3, String.format("state_%s_%d", tag, i));
ps.setInt(4, i);
ps.addBatch();
}
ps.executeBatch();
sql =
"UPDATE personal_info pi INNER JOIN tmp ON tmp.CandidateID=pi.CandidateID "
+ "SET pi.result=tmp.result, pi.score=tmp.score, pi.uploadState=tmp.uploadState";
st.execute(sql);
conn.commit();
System.out.printf("%d ms%n", (System.nanoTime() - t0) / 1000000);

Maximum records table can handle to get results in 0 seconds while searching by primary key

Let's change the question to "How can I speed up my key-value table with the following specs?".

You can get about 1 disk hit on a rotating HDD (as opposed to SSD).

So the answer is

  • If the index block and data block are not cached, 0 rows (need 2 disk hits for MyISAM) -- in 10ms (50 rows/sec)
  • If either, but not both of those blocks is cached, 1 row -- in 10ms (100 rows/sec)
  • As long as both blocks are cached for the desired, then hundreds of rows. (thousands/sec)

For 50ms, MyISAM can deliver only 2-3 rows in worst case. Perhaps we should switch to "How many rows/second as the metric"?

Next clarification. Are you talking about many connections, each asking for a single row? Or are you talking about one connection asking for consecutive (according to PRIMARY KEY) rows?

For more speed:

  • Don't use CHAR unless the strings are actually fixed length. Use VARCHAR. This shrinks the index and data, thereby making them more cacheable.

  • Change from MyISAM to InnoDB. InnoDB "clusters" the PK with the data. That is, when you find the PK, the data is right there. This eliminates the first case, above. Now the "worst case" is 5 rows in 50ms. The best case is perhaps better than MyISAM. (There are a lot of benchmarks out there; probably none exactly matches your situation.)

Another clarification needed: Will you be updating rows? Deleting rows? Adding new rows? These matter when it comes to fragmentation (which MyISAM w/VARCHAR is severely subject to and InnoDB is only slightly subject to).

Size analysis:

  • MyISAM w/CHAR: Data: 75 bytes/row * 903M rows + Index: ~60*903M = ~ 120GB. Too much to be fully cached, even too much to keep the index in RAM (key_buffer_size).

  • MyISAM w/VARCHAR: Not knowing the typical size, nor the churn, I hesitate to compute. But I suspect it would still be too much for 64GB of RAM.

  • InnoDB w/VARCHAR: No index needed other than PRIMARY KEY(key). Still the footprint might be ~120GB. So, again, it cannot be fully cached (innodb_buffer_pool_size).

Next Clarification: How 'random' are the key values used? Will you be repeating the same ones a lot? Or are they like UUIDs/MD5s (very random) and you will be bouncing around a lot.

If very random, then let's analyze the likelihood of something being in cache. Let's say the lookup is in an index that is twice as big as what is cached in RAM. This means that only half the time you will find the item in cache. Now my answer is...

  • With 50% cached, you have a disk hit the other 50%, and 50ms with InnoDB will give you an average of 10 rows per 50ms. (200 rows per second (average) on HDD).
  • If the index were 20 times as big as the cache, 95% would be cache misses. That is you would be seriously I/O-bound. (105 rows/sec for InnoDB)

Another clarification... Can the key and/or the lock be compressed in any way?

  • Hex can be UNHEXed to half its size.
  • Is CHAR defaulting to utf8 when ascii would suffice? (Factor of 3 in space for CHAR!! I did not consider this in the computation above.)
  • Are the strings the sort that would benefit from COMPRESS()? In some studies, I have seen "almost any text longer than 10 bytes can benefit from COMPRESS". English text or code or XML typically shrinks by about 3x. Your 45 chars might shrink to 20.

If you could cut the table size in half, now you have a chance of caching everything. This would take you from 200 rows/second (InnoDB) to thousands.

Move milliseconds by one position value

Best option: reload data from original source, after you fix your code.

If you no longer have access to the original data, and you must fix it all in place, use the UPDATE statement below (shown in context):

create table tbl ( ts timestamp );

insert into tbl ( ts ) values ( timestamp '2018-06-26 11:15:43.0950' );
commit;

select ts from tbl;

TS
----------------------------------------
2018-06-26 11.15.43.095000000

update tbl set ts = ts + 9 * (ts - cast(ts as timestamp(0)));

1 row updated.

commit;

select ts from tbl;

TS
----------------------------------------

2018-06-26 11.15.43.950000000

Explanation:

If your original timestamp was of the form X + w where X is down to whole seconds, and w is the fractional second part, the current value is X + z, where z = w/10. (You added an errant 0 right after the decimal point, which means you divided w by ten). So: you currently have X + z but you want X + w, or in other words, X + 10 * z. So, you must add 9 * z to what you already have.

To get z (the fractional part of the timestamp) you need to subtract X (the integral part) from the timestamp. X itself is the truncation of the timestamp to whole seconds. There is no TRUNC() function to truncate to whole seconds, but the CAST function to TIMESTAMP(0) does that job quite well.

To use your sample data: X is the timestamp '2018-06-26 11:15:43.000000'. This is also the result of cast(ts as timestamp(0)). w is .9500 and z is what made it into your table, .0950. Your current value is X + z, but you want X + w. That is X + 10 * z = (X + z) + 9 * z, and now remember that (X + z) is just ts (the value you have in the table currently) so you only need to add nine times the value z, which is the difference (ts - X).

Fastest way to update 120 Million records

The only sane way to update a table of 120M records is with a SELECT statement that populates a second table. You have to take care when doing this. Instructions below.


Simple Case

For a table w/out a clustered index, during a time w/out concurrent DML:

  • SELECT *, new_col = 1 INTO clone.BaseTable FROM dbo.BaseTable
  • recreate indexes, constraints, etc on new table
  • switch old and new w/ ALTER SCHEMA ... TRANSFER.
  • drop old table

If you can't create a clone schema, a different table name in the same schema will do. Remember to rename all your constraints and triggers (if applicable) after the switch.


Non-simple Case

First, recreate your BaseTable with the same name under a different schema, eg clone.BaseTable. Using a separate schema will simplify the rename process later.

  • Include the clustered index, if applicable. Remember that primary keys and unique constraints may be clustered, but not necessarily so.
  • Include identity columns and computed columns, if applicable.
  • Include your new INT column, wherever it belongs.
  • Do not include any of the following:

    • triggers
    • foreign key constraints
    • non-clustered indexes/primary keys/unique constraints
    • check constraints or default constraints. Defaults don't make much of difference, but we're trying to keep
      things minimal.

Then, test your insert w/ 1000 rows:

-- assuming an IDENTITY column in BaseTable
SET IDENTITY_INSERT clone.BaseTable ON
GO
INSERT clone.BaseTable WITH (TABLOCK) (Col1, Col2, Col3)
SELECT TOP 1000 Col1, Col2, Col3 = -1
FROM dbo.BaseTable
GO
SET IDENTITY_INSERT clone.BaseTable OFF

Examine the results. If everything appears in order:

  • truncate the clone table
  • make sure the database in in bulk-logged or simple recovery model
  • perform the full insert.

This will take a while, but not nearly as long as an update. Once it completes, check the data in the clone table to make sure it everything is correct.

Then, recreate all non-clustered primary keys/unique constraints/indexes and foreign key constraints (in that order). Recreate default and check constraints, if applicable. Recreate all triggers. Recreate each constraint, index or trigger in a separate batch. eg:

ALTER TABLE clone.BaseTable ADD CONSTRAINT UQ_BaseTable UNIQUE (Col2)
GO
-- next constraint/index/trigger definition here

Finally, move dbo.BaseTable to a backup schema and clone.BaseTable to the dbo schema (or wherever your table is supposed to live).

-- -- perform first true-up operation here, if necessary
-- EXEC clone.BaseTable_TrueUp
-- GO
-- -- create a backup schema, if necessary
-- CREATE SCHEMA backup_20100914
-- GO
BEGIN TRY
BEGIN TRANSACTION
ALTER SCHEMA backup_20100914 TRANSFER dbo.BaseTable
-- -- perform second true-up operation here, if necessary
-- EXEC clone.BaseTable_TrueUp
ALTER SCHEMA dbo TRANSFER clone.BaseTable
COMMIT TRANSACTION
END TRY
BEGIN CATCH
SELECT ERROR_MESSAGE() -- add more info here if necessary
ROLLBACK TRANSACTION
END CATCH
GO

If you need to free-up disk space, you may drop your original table at this time, though it may be prudent to keep it around a while longer.

Needless to say, this is ideally an offline operation. If you have people modifying data while you perform this operation, you will have to perform a true-up operation with the schema switch. I recommend creating a trigger on dbo.BaseTable to log all DML to a separate table. Enable this trigger before you start the insert. Then in the same transaction that you perform the schema transfer, use the log table to perform a true-up. Test this first on a subset of the data! Deltas are easy to screw up.

2273 msec to insert 7679 records is this fast or slow?

0.3 milliseconds per row is respectable performance, especially if you haven't yet done anything to make your code run fast. If you have any indexes in your table the insertion rate may slow down as you get up to many thousands of rows already in the data base. Then you'll need to see about disabling constraints, loading the table, and then re-enabling constraints. But you can cross that bridge if you come to it.

Is bulk update faster than single update in db2?

In general, a "bulk" update will be faster, regardless of database. Of course, you can test the performance of the two, and report back.

Each call to update requires a bunch of overhead, in terms of processing the query, setting up locks on tables/pages/rows. Doing a single update consolidates this overhead.

The downside to a single update is that it might be faster overall, but it might lock underlying resources for longer periods of time. For instance, the single updates might take 10 milliseconds each, for an elapsed time of 10 seconds for 1,000 of them. However, no resource is locked for more than 10 milliseconds. The bulk update might take 5 seconds, but the resources would be locked for more of this period.

To speed these updates, be sure that id is indexed.

I should note. This is a general principle. I have not specifically tested single versus multiple update performance on DB2.



Related Topics



Leave a reply



Submit