Multi Threading C# Application with SQL Server Database Calls

Multi threading C# application with SQL Server database calls

Here's my take on the problem:

When using multiple threads to insert/update/query data in SQL Server, or any database, then deadlocks are a fact of life. You have to assume they will occur and handle them appropriately.
That's not so say we shouldn't attempt to limit the occurence of deadlocks. However, it's easy to read up on the basic causes of deadlocks and take steps to prevent them, but SQL Server will always surprise you :-)

Some reason for deadlocks:

Too many threads - try to limit the number of threads to a minimum, but of course we want more threads for maximum performance.
Not enough indexes. If selects and updates aren't selective enough SQL will take out larger range locks than is healthy. Try to specify appropriate indexes.
Too many indexes. Updating indexes causes deadlocks, so try to reduce indexes to the minimum required.
Transaction isolational level too high. The default isolation level when using .NET is 'Serializable', whereas the default using SQL Server is 'Read Committed'. Reducing the isolation level can help a lot (if appropriate of course).

This is how I might tackle your problem:

I wouldn't roll my own threading solution, I would use the TaskParallel library. My main method would look something like this:

using (var dc = new TestDataContext())
{
    // Get all the ids of interest.
    // I assume you mark successfully updated rows in some way
    // in the update transaction.
    List<int> ids = dc.TestItems.Where(...).Select(item => item.Id).ToList();

    var problematicIds = new List<ErrorType>();

    // Either allow the TaskParallel library to select what it considers
    // as the optimum degree of parallelism by omitting the 
    // ParallelOptions parameter, or specify what you want.
    Parallel.ForEach(ids, new ParallelOptions {MaxDegreeOfParallelism = 8},
                        id => CalculateDetails(id, problematicIds));
}

Execute the CalculateDetails method with retries for deadlock failures

private static void CalculateDetails(int id, List<ErrorType> problematicIds)
{
    try
    {
        // Handle deadlocks
        DeadlockRetryHelper.Execute(() => CalculateDetails(id));
    }
    catch (Exception e)
    {
        // Too many deadlock retries (or other exception). 
        // Record so we can diagnose problem or retry later
        problematicIds.Add(new ErrorType(id, e));
    }
}

The core CalculateDetails method

private static void CalculateDetails(int id)
{
    // Creating a new DeviceContext is not expensive.
    // No need to create outside of this method.
    using (var dc = new TestDataContext())
    {
        // TODO: adjust IsolationLevel to minimize deadlocks
        // If you don't need to change the isolation level 
        // then you can remove the TransactionScope altogether
        using (var scope = new TransactionScope(
            TransactionScopeOption.Required,
            new TransactionOptions {IsolationLevel = IsolationLevel.Serializable}))
        {
            TestItem item = dc.TestItems.Single(i => i.Id == id);

            // work done here

            dc.SubmitChanges();
            scope.Complete();
        }
    }
}

And of course my implementation of a deadlock retry helper

public static class DeadlockRetryHelper
{
    private const int MaxRetries = 4;
    private const int SqlDeadlock = 1205;

    public static void Execute(Action action, int maxRetries = MaxRetries)
    {
        if (HasAmbientTransaction())
        {
            // Deadlock blows out containing transaction
            // so no point retrying if already in tx.
            action();
        }

        int retries = 0;

        while (retries < maxRetries)
        {
            try
            {
                action();
                return;
            }
            catch (Exception e)
            {
                if (IsSqlDeadlock(e))
                {
                    retries++;
                    // Delay subsequent retries - not sure if this helps or not
                    Thread.Sleep(100 * retries);
                }
                else
                {
                    throw;
                }
            }
        }

        action();
    }

    private static bool HasAmbientTransaction()
    {
        return Transaction.Current != null;
    }

    private static bool IsSqlDeadlock(Exception exception)
    {
        if (exception == null)
        {
            return false;
        }

        var sqlException = exception as SqlException;

        if (sqlException != null && sqlException.Number == SqlDeadlock)
        {
            return true;
        }

        if (exception.InnerException != null)
        {
            return IsSqlDeadlock(exception.InnerException);
        }

        return false;
    }
}

One further possibility is to use a partitioning strategy

If your tables can naturally be partitioned into several distinct sets of data, then you can either use SQL Server partitioned tables and indexes, or you could manually split your existing tables into several sets of tables. I would recommend using SQL Server's partitioning, since the second option would be messy. Also built-in partitioning is only available on SQL Enterprise Edition.

If partitioning is possible for you, you could choose a partion scheme that broke you data in lets say 8 distinct sets. Now you could use your original single threaded code, but have 8 threads each targetting a separate partition. Now there won't be any (or at least a minimum number of) deadlocks.

I hope that makes sense.

Multithreaded application with database read - each thread unique records

I find this problem interesting partly because I'm attempting to do something similar in principle but also because I haven't seen a super intuitive industry standard solution to it. Yet.

What you are proposing to do would work if you write your SQL query correctly.
Using ROW_NUMBER / BETWEEN it should be achievable.
I'll write and document some other alternatives here along with benefits / caveats.

Parallel processing

I understand that you want to do this in SQL Server, but just as a reference, Oracle implemented this as a keyword which you can query stuff in parallel.

Documentation: https://docs.oracle.com/cd/E11882_01/server.112/e25523/parallel002.htm

SQL implements this differently, you have to explicitly turn it on through a more complex keyword and you have to be on a certain version:

A nice article on this is here: https://www.mssqltips.com/sqlservertip/4939/how-to-force-a-parallel-execution-plan-in-sql-server-2016/

You can combine the parallel processing with SQL CLR integration, which would effectively do what you're trying to do in SQL while SQL manages the data chunks and not you in your threads.

SQL CLR integration

One nice feature that you might look into is executing .net code in SQL server. Documentation here: https://learn.microsoft.com/en-us/dotnet/framework/data/adonet/sql/introduction-to-sql-server-clr-integration

This would basically allow you to run C# code in your SQL server - saving you the read / process / write roundtrip. They have improved the continuous integration regarding to this as well - documentation here: https://learn.microsoft.com/en-us/sql/integration-services/sql-server-integration-services?view=sql-server-2017

Reviewing the QoS / getting the logs in case something goes wrong is not really as easy as handling this in a worker-job though unfortunately.

Use a single thread (if you're reading from an external source)

Parallelism is only good for you if certain conditions are met. Below is from Oracle's documentation but it also applies to MSSQL: https://docs.oracle.com/cd/B19306_01/server.102/b14223/usingpe.htm#DWHSG024

Parallel execution improves processing for:

Queries requiring large table scans, joins, or partitioned index scans

Creation of large indexes

Creation of large tables (including materialized views)

Bulk inserts, updates, merges, and deletes

There are also setup / environment requirements

Parallel execution benefits systems with all of the following
characteristics:

Symmetric multiprocessors (SMPs), clusters, or massively parallel
systems

Sufficient I/O bandwidth

Underutilized or intermittently used CPUs (for example, systems where
CPU usage is typically less than 30%)

Sufficient memory to support additional memory-intensive processes,
such as sorts, hashing, and I/O buffers

There are other constraints. When you are using multiple threads to do the operation that you propose, if one of those threads gets killed / failed to do something / throws an exception etc... you will absolutely need to handle that - in a way that you keep until what's the last index that you've processed - so you could retry the rest of the records.
With a single thread that becomes way simpler.

Conclusion

Assuming that the DB is modeled correctly and couldn't be optimized even further I'd say the simplest solution, single thread is the best one. Easier to log and track the errors, easier to implement retry logic and I'd say those far outweigh the benefits you would see from the parallel processing. You might look into parallel processing bit for the batch updates that you'll do to the DB, but unless you're going to have a CLR DLL in the SQL - which you will invoke the methods of it in a parallel fashion, I don't see overcoming benefits. Your system will have to behave a certain way as well at the times that you're running the parallel query for it to be more efficient.

You can of course design your worker-role to be async and not block each record processing. So you'll be still multi-threaded but your querying would happen in a single thread.

Edit to conclusion

After talking to my colleague on this today, it's worth adding that with even with the single thread approach, you'd have to be able to recover from failure, so in principal having multiple threads vs single thread in terms of the requirement of recovery / graceful failure and remembering what you processed doesn't change. How you recover would though, given that you'd have to write more complex code to track your multiple threads and their states.

Multithreading with ADO.NET

You need a separate connection for each concurrent operation. Since connections are pooled and reused, feel free to create a new one in the scope that you need it, and close it at the end (it is returned to the pool, not actually closed).

This is a limitation of the on-the-wire protocols for most (all?) database servers.

.NET, the SqlConnection object, and multi-threading

The obvious solution is to just re-create the SqlConnection object every time a database call requires one - in this case, it would never be shared. Is there any reason not to do this?

On the contrary, that's absolutely what you should do. That's the behaviour SqlConnection was designed for. You should use a Using statement to automatically close the connection at the end of the block you're using it for, and the connection pool mechanism will automatically handle the real underlying connections to the database.

c# multithread database fill table

Here is a basic boilerplate you can use.
Fill in the bit where I've left a comment:

// Set your connectionstring and execute the query and fill your data here

This is basic - I've used threads instead of threadpool because compared to the amount of work done the overhead of spawning a new thread is minimal. You can extend this if you want by keeping track of the threads and using thread signals etc... to implement way more advanced behavior.

Additionally if you want to pass any extra parameters to the piece of code that does the work add these to the work item definition class.

Note: This does not support multiple parallel executions of the main RunParallel method but you could easily extend it to do this.

public static class RunParallel
{
    const int NumThreadsToRunInParallel = 8;// Tune this for your DB server performance characteristics
    public static void FillDataSet(ref DataSet Source)
    {
        WorkItemDefinition Work;
        foreach (DataRow r in Source.Tables["queries"].Rows)
        {
            Work = new WorkItemDefinition();
            Work.Query = r["QueryStatement"].ToString();
            Work.QSource = r["QuerySource"].ToString();
            Work.TableName = r["TableName"].ToString();
            EnQueueWork(Work);
        }
        System.Threading.ThreadStart NewThreadStart;
        NewThreadStart = new System.Threading.ThreadStart(ProcessPendingWork);
        for (int I = 0; I < NumThreadsToRunInParallel; I ++)
        {
            System.Threading.Thread NewThread;
            NewThread = new System.Threading.Thread(NewThreadStart);
            //NewThread.IsBackground = true; //Do this if you want to allow the application to quit before these threads finish all their work and exit
            ThreadCounterInc();
            NewThread.Start();
        }
        while (ThreadCounterValue > 0)
        {
            System.Threading.Thread.Sleep(1000);
        }
    }

    private static void ProcessPendingWork()
    {
        try
        {
            WorkItemDefinition Work;
            Work = DeQueueWork();
            while (Work != null)
            {
                Work = DeQueueWork();
                DbConnection db = new OdbcConnection();
                // Set your connectionstring and execute the query and fill your data here
            }
        }
        finally
        {
            ThreadCounterDec();
        }
    }

    private static int ThreadCounter = 0;
    private static void ThreadCounterInc()
    {
        lock(SyncRoot)
        {
            ThreadCounter += 1;
        }
    }
    private static void ThreadCounterDec()
    {
        lock (SyncRoot)
        {
            ThreadCounter -= 1;
        }
    }
    private static int ThreadCounterValue
    {
        get
        {
            lock (SyncRoot)
            {
                return ThreadCounter;
            }
        }
    }

    private static object SyncRoot = new object();
    private static Queue<WorkItemDefinition> m_PendingWork = new Queue<WorkItemDefinition>();
    private static Queue<WorkItemDefinition> PendingWork
    {
        get
        {
            return m_PendingWork;
        }
    }

    private static WorkItemDefinition DeQueueWork()
    {
        lock (SyncRoot)
        {
            if (PendingWork.Count > 0) // Catch exception overhead is higher
            {
                return PendingWork.Dequeue();
            }
        }
        return null;
    }

    private static void EnQueueWork(WorkItemDefinition Work)
    {
        lock (SyncRoot)
        {
            PendingWork.Enqueue(Work);
        }
    }

    public class WorkItemDefinition
    {
        public string Query { get; set; }
        public string QSource { get; set; }
        public string TableName { get; set; }
    }
}

read SQL data from multiple threads

If you want to read data in multiple threads without overlapping then you might try paging. You could have each thread read a different page. For example, you could have the first thread read the first 20 records, process each of those records, and then set processed=true for each, while the second thread is doing the same for the next 20 records and so on.

Check out this link for more info on paging in multiple threads.
http://ericniemiec.wordpress.com/2010/06/10/paging-records-in-sql-server-2008-for-processing-database-records-on-different-threads-in-c/

Multi Threading C# Application with SQL Server Database Calls