Best Practices for Multithreaded Processing of Database Records

Best practices for multithreaded processing of database records

The pattern I'd use is as follows:

Create columns "lockedby" and "locktime" which are a thread/process/machine ID and timestamp respectively (you'll need the machine ID when you split the processing between several machines)
Each task would do a query such as:
UPDATE taskstable SET lockedby=(my id), locktime=now() WHERE lockedby IS NULL ORDER BY ID LIMIT 10

Where 10 is the "batch size".

Then each task does a SELECT to find out which rows it has "locked" for processing, and processes those
After each row is complete, you set lockedby and locktime back to NULL
All this is done in a loop for as many batches as exist.
A cron job or scheduled task, periodically resets the "lockedby" of any row whose locktime is too long ago, as they were presumably done by a task which has hung or crashed. Someone else will then pick them up

The LIMIT 10 is MySQL specific but other databases have equivalents. The ORDER BY is import to avoid the query being nondeterministic.

Best Practice to process multiple rows from database in different thread

Right, since no-one seems to be picking this one up, I'll continue here what was started in the comments:

There are lots of solutions to this one. In your case, with just one processing thread, you might want for example to store just the records ids in the queue. Then ThreadB can fetch the row itself to make sure the status is indeed NEW. Or use optimistic locking with update table set status='IN_PROGRESS' where id=rowId and status='NEW' and quit processing this row on exception.

Optimisting locking is fun, and you could also use it to get rid of producer thread altogether. Imagine a few threads, processing records from database. They could each pick up a record, and try to set the status with optimistic locking as in the first example. It's quite possible to get a lot of contention for records this way, so each thread could fetch N rows, where N is number of threads, or twice that much. And then try to process the first row that it succeeded to set IN_PROGRESS for. This solution makes for a less complicated system, and one less thing to take care of/synchronize with.

And you can have the thread not only pick up records that are NEW, but also these which are IN_PROGRESS and started_date < sysdate = timeout, that would include records that were not processed because of system failure (like a thread managed to set one row to IN_PROGRESS and then your system went down. So you get some resilience here.

Options to use multithreading to process a group of database records?

The most straightforward way to implement this requirement is to use the Task Parallel Library's

Parallel.ForEach (or Parallel.For).

Allow it to manage individual worker threads.

From experience, I would recommend the following:

Have an additional status "Processing"
Have a column in the database that indicates when a record was picked up for processing and a cleanup task / process that runs periodically looking for records that have been "Processing" for far too long (reset the status to "ready for processing).
Even though you don't want it, "being processed" will be essential to crash recovery scenarios (unless you can tolerate the same record being processed twice).

Alternatively

Consider using a transactional queue (MSMQ or Rabbit MQ come to mind). They are optimized for this very problem.

That would be my clear choice, having done both at massive scale.

Optimizing

If it takes a non-trivial amount of time to retrieve data from the database, you can consider a Producer/Consumer pattern, which is quite straightforward to implement with a BlockingCollection. That pattern allows one thread (producer) to populate a queue with DB records to be processed, and multiple other threads (consumers) to process items off of that queue.

A New Alternative

Given that several processing steps touch the record before it is considered complete, have a look at Windows Workflow Foundation as a possible alternative.

Multithreaded application with database read - each thread unique records

I find this problem interesting partly because I'm attempting to do something similar in principle but also because I haven't seen a super intuitive industry standard solution to it. Yet.

What you are proposing to do would work if you write your SQL query correctly.
Using ROW_NUMBER / BETWEEN it should be achievable.
I'll write and document some other alternatives here along with benefits / caveats.

Parallel processing

I understand that you want to do this in SQL Server, but just as a reference, Oracle implemented this as a keyword which you can query stuff in parallel.

Documentation: https://docs.oracle.com/cd/E11882_01/server.112/e25523/parallel002.htm

SQL implements this differently, you have to explicitly turn it on through a more complex keyword and you have to be on a certain version:

A nice article on this is here: https://www.mssqltips.com/sqlservertip/4939/how-to-force-a-parallel-execution-plan-in-sql-server-2016/

You can combine the parallel processing with SQL CLR integration, which would effectively do what you're trying to do in SQL while SQL manages the data chunks and not you in your threads.

SQL CLR integration

One nice feature that you might look into is executing .net code in SQL server. Documentation here: https://learn.microsoft.com/en-us/dotnet/framework/data/adonet/sql/introduction-to-sql-server-clr-integration

This would basically allow you to run C# code in your SQL server - saving you the read / process / write roundtrip. They have improved the continuous integration regarding to this as well - documentation here: https://learn.microsoft.com/en-us/sql/integration-services/sql-server-integration-services?view=sql-server-2017

Reviewing the QoS / getting the logs in case something goes wrong is not really as easy as handling this in a worker-job though unfortunately.

Use a single thread (if you're reading from an external source)

Parallelism is only good for you if certain conditions are met. Below is from Oracle's documentation but it also applies to MSSQL: https://docs.oracle.com/cd/B19306_01/server.102/b14223/usingpe.htm#DWHSG024

Parallel execution improves processing for:

Queries requiring large table scans, joins, or partitioned index scans

Creation of large indexes

Creation of large tables (including materialized views)

Bulk inserts, updates, merges, and deletes

There are also setup / environment requirements

Parallel execution benefits systems with all of the following
characteristics:

Symmetric multiprocessors (SMPs), clusters, or massively parallel
systems

Sufficient I/O bandwidth

Underutilized or intermittently used CPUs (for example, systems where
CPU usage is typically less than 30%)

Sufficient memory to support additional memory-intensive processes,
such as sorts, hashing, and I/O buffers

There are other constraints. When you are using multiple threads to do the operation that you propose, if one of those threads gets killed / failed to do something / throws an exception etc... you will absolutely need to handle that - in a way that you keep until what's the last index that you've processed - so you could retry the rest of the records.
With a single thread that becomes way simpler.

Conclusion

Assuming that the DB is modeled correctly and couldn't be optimized even further I'd say the simplest solution, single thread is the best one. Easier to log and track the errors, easier to implement retry logic and I'd say those far outweigh the benefits you would see from the parallel processing. You might look into parallel processing bit for the batch updates that you'll do to the DB, but unless you're going to have a CLR DLL in the SQL - which you will invoke the methods of it in a parallel fashion, I don't see overcoming benefits. Your system will have to behave a certain way as well at the times that you're running the parallel query for it to be more efficient.

You can of course design your worker-role to be async and not block each record processing. So you'll be still multi-threaded but your querying would happen in a single thread.

Edit to conclusion

After talking to my colleague on this today, it's worth adding that with even with the single thread approach, you'd have to be able to recover from failure, so in principal having multiple threads vs single thread in terms of the requirement of recovery / graceful failure and remembering what you processed doesn't change. How you recover would though, given that you'd have to write more complex code to track your multiple threads and their states.

How to prevent multi threaded application to read this same Sql Server record twice

I think you should be able to do this atomically using a common table expression. (I'm not 100% certain about this, and I haven't tested, so you'll need to verify that it works for you in your situation.)

;WITH cte AS
(
    SELECT TOP(@ArrayCount)
        ID, Exported, ExportExpires, ExportSession
    FROM dbo.GameData WITH (READPAST)
    WHERE Exported IS NULL
    ORDER BY ID
)
UPDATE cte
SET Exported = @Now,
    ExportExpires = @Expire,
    ExportSession = @ExportSession
OUTPUT INSERTED.ID INTO @ExportedIDs

what is the best practice for multiple threads writing to one file

For logging (for future questions, make sure you put that information into the question rather than just a comment) there's a strong preference to not have the threads do file access they don't have to; as it means that logging negatively impacts performance for the rest of that thread.

For that reason, NathanOliver's suggestion of having the threads write to a shared container and then one dedicated to dumping that container to file would probably be the best option for you.

Multiprocessing/multithreading for database query in Python

The below link helped me
Multiprocessing with JDBC connection and pooling
I can get around 25% gain on my local.machine.

Best Practices for Multithreaded Processing of Database Records