Program With Threads for Matrix Multiplication

Multi threaded matrix multiplication performance issue

People think that using multiple threads will automatically (magically!) make any computation go faster. This is not so¹.

There are a number of factors that can make multi-threading speedup less than you expect, or indeed result in a slowdown.

A computer with N cores (or hyperthreads) can do computations at most N times as fast as a computer with 1 core. This means that when you have T threads where T > N, the computational performance will be capped at N. (Beyond that, the threads make progress because of time slicing.)
A computer has a certain amount of memory bandwidth; i.e. it can only perform a certain number of read/write operations per second on main memory. If you have an application where the demand exceeds what the memory subsystem can achieve, it will stall (for a few nanoseconds). If there are many cores executing many threads at the same time, then it is the aggregate demand that matters.
A typical multi-threaded application working on shared variables or data structures will either use volatile or explicit synchronization to do this. Both of these increase the demand on the memory system.
When explicit synchronization is used and two threads want to hold a lock at the same time, one of them will be blocked. This lock contention slows down the computation. Indeed, the computation is likely to be slowed down if there was past contention on the lock.
Thread creation is expensive. Even acquiring an existing thread from a thread pool can be relatively expensive. If the task that you perform with the thread is too small, the setup costs can outweigh the possible speedup.

There is also the issue that you may be running into problems with a poorly written benchmark; e.g. the JVM may not be properly warmed up before taking the timing measurements.

There is insufficient detail in your question to be sure which of the above factors is likely to affect your application's performance. But it is likely to be a combination of 1 2 and 5 ... depending on how many cores are used, how big the CPUs memory caches are, how big the matrix is, and other factors.

^{1 - Indeed, if this was true then we would not need to buy computers with lots of cores. We could just use more and more threads. Provided you had enough memory, you could do an infinite amount of computation on a single machine. Bitcoin mining would be a doddle. Of course, it isn't true.}

Matrix multiplication using multiple threads

What I am having trouble figuring out is how should I sum up all the
resulting values in the correct order.

They can be summed out-of-order, which is why this is a good problem to solve with multi-threading. If ordering matters to a specific problem, you can't improve it with multithreading (to be clear, if any sub-problem can be solved out-of-order then that sub-problem is a potential candidate for multithreading).

One solution to your problem is to set up a solution vector at the call site, then pass the corresponding element by reference (also the MatrixMultiply function needs to know which problem it's solving):

void MatrixMultiply(const Array2d& matrix, 
       const vector<int>& vec, int row, int& solution);

// ...

vector<int> result(height);

for (int i = 0; i < height; i++)
{
    threads[i] = thread(MatrixMultiply, array2d, array1d, i, result[i]);
}

Your 2D array should really provide info on its height and width without having to pass these values explicitly.

BONUS INFO:

We could make this solution much more OOP in a way that you'll want to reuse for future problems (and some experienced programmers seem to miss this trick for using arrays):

MatrixMultiply function is really similar to a dot-product function:

template <typename V1, typename V2>
auto DotProduct(const V1& vec1, const V2& vec2)
{
    auto result = vec1[0] * vec2[0];

    for (size_t i = 1; i < vec1.size(); ++i)
        result += vec1[i] * vec2[i];

    return result;
}

template <typename V1, typename V2, typename T>
auto DotProduct(const V1& vec1, const V2& vec2, T& result)
{
    result = DotProduct(vec1, vec2);
}

(The above allows the vectors to be any objects that uses size() and [] methods as expected.)

We can write a wrapper class around std::vector that can be used by our array class to handle all the indexing for us; like this:

template <typename T, typename A>
class SubVector
{
    const typename std::vector<T,A>::iterator m_it;
    const size_t m_size, m_interval_size;

public:

    SubVector (std::vector<T,A>& v, size_t start, size_t sub_size, size_t i_size = 1)
        : m_it(v.begin() + start), m_size(sub_size), m_interval_size(i_size)
    {}
    
    auto size () const
    {
        return m_size;
    }
    
    const T& operator [] (size_t i) const
    {
        return it[i*m_interval_size];
    }

    T& operator [] (size_t i)
    {
        return it[i*m_interval_size];
    }
};

Then you could use this in some kind of Vectorise method in your array; like this:

template <typename T, typename A = std::allocator<T>>
class Array2D
{
    std::vector<T,A> m_data;

    size_t m_width, m_height;

public:

    // your normal methods  

    auto VectoriseRow(int r) const
    {
        return SubVector(m_data, r*m_width, m_width);
    }

    auto VectoriseColumn(int c) const
    {
        return SubVector(m_data, c, m_height, m_width);
    }
}

(Note: We could add the Vectorise feature to std::array or boost::multi_array by just writing a wrapper around them, which makes our array class more generic and saves us from having to do all the work. boost actually has this sort of feature inbuilt with array_view.)

Now our call site can be like so:

vector<int> result(height);

for (int i = 0; i < height; i++)
{
    threads[i] = thread(DotProduct, array2d.VectoriseRow(i), array1d, result[i]);
}

This might seem like a more verbose way of solving the original problem (because it is), but if you use multi-dimensional arrays in your coding you'll find you no longer have to write multi-array-specific functions, or handle ugly indices for sub-problems (even in 1D problems, like Mean of Means). When dealing with those sorts of problems, you'll invariably want to reuse something like the above code.

Program With Threads for Matrix Multiplication

Multi threaded matrix multiplication performance issue

Matrix multiplication using multiple threads

Related Topics

Leave a reply