How to Speed Up Matrix Multiplication in C++

How to speed up matrix multiplication in C++?

Pass the parameters by const reference to start with:

matrix mult_std(matrix const& a, matrix const& b) {

To give you more details we need to know the details of the other methods used.

And to answer why the original method is 4 times faster we would need to see the original method.

The problem is undoubtedly yours as this problem has been solved a million times before.

Also when asking this type of question ALWAYS provide compilable source with appropriate inputs so we can actually build and run the code and see what is happening.

Without the code we are just guessing.

Edit

After fixing the main bug in the original C code (a buffer over-run)

I have update the code to run the test side by side in a fair comparison:

 // INCLUDES -------------------------------------------------------------------
#include <stdlib.h>
#include <stdio.h>
#include <sys/time.h>
#include <time.h>

// DEFINES -------------------------------------------------------------------
// The original problem was here. The MAXDIM was 500. But we were using arrays
// that had a size of 512 in each dimension. This caused a buffer overrun that
// the dim variable and caused it to be reset to 0. The result of this was causing
// the multiplication loop to fall out before it had finished (as the loop was
// controlled by this global variable.
//
// Everything now uses the MAXDIM variable directly.
// This of course gives the C code an advantage as the compiler can optimize the
// loop explicitly for the fixed size arrays and thus unroll loops more efficiently.
#define MAXDIM 512
#define RUNS 10

// MATRIX FUNCTIONS ----------------------------------------------------------
class matrix
{
public:
matrix(int dim)
: dim_(dim)
{
data_ = new int[dim_ * dim_];

}

inline int dim() const {
return dim_;
}
inline int& operator()(unsigned row, unsigned col) {
return data_[dim_*row + col];
}

inline int operator()(unsigned row, unsigned col) const {
return data_[dim_*row + col];
}

private:
int dim_;
int* data_;
};

// ---------------------------------------------------
void random_matrix(int (&matrix)[MAXDIM][MAXDIM]) {
for (int r = 0; r < MAXDIM; r++)
for (int c = 0; c < MAXDIM; c++)
matrix[r][c] = rand() % 100;
}
void random_matrix_class(matrix& matrix) {
for (int r = 0; r < matrix.dim(); r++)
for (int c = 0; c < matrix.dim(); c++)
matrix(r, c) = rand() % 100;
}

template<typename T, typename M>
float run(T f, M const& a, M const& b, M& c)
{
float time = 0;

for (int i = 0; i < RUNS; i++) {
struct timeval start, end;
gettimeofday(&start, NULL);
f(a,b,c);
gettimeofday(&end, NULL);

long s = start.tv_sec * 1000 + start.tv_usec / 1000;
long e = end.tv_sec * 1000 + end.tv_usec / 1000;

time += e - s;
}
return time / RUNS;
}
// SEQ MULTIPLICATION ----------------------------------------------------------
int* mult_seq(int const(&a)[MAXDIM][MAXDIM], int const(&b)[MAXDIM][MAXDIM], int (&z)[MAXDIM][MAXDIM]) {
for (int r = 0; r < MAXDIM; r++) {
for (int c = 0; c < MAXDIM; c++) {
z[r][c] = 0;
for (int i = 0; i < MAXDIM; i++)
z[r][c] += a[r][i] * b[i][c];
}
}
}
void mult_std(matrix const& a, matrix const& b, matrix& z) {
for (int r = 0; r < a.dim(); r++) {
for (int c = 0; c < a.dim(); c++) {
z(r,c) = 0;
for (int i = 0; i < a.dim(); i++)
z(r,c) += a(r,i) * b(i,c);
}
}
}

// MAIN ------------------------------------------------------------------------
using namespace std;
int main(int argc, char* argv[]) {
srand(time(NULL));

int matrix_a[MAXDIM][MAXDIM];
int matrix_b[MAXDIM][MAXDIM];
int matrix_c[MAXDIM][MAXDIM];
random_matrix(matrix_a);
random_matrix(matrix_b);
printf("%d ", MAXDIM);
printf("%f \n", run(mult_seq, matrix_a, matrix_b, matrix_c));

matrix a(MAXDIM);
matrix b(MAXDIM);
matrix c(MAXDIM);
random_matrix_class(a);
random_matrix_class(b);
printf("%d ", MAXDIM);
printf("%f \n", run(mult_std, a, b, c));

return 0;
}

The results now:

$ g++ t1.cpp
$ ./a.exe
512 1270.900000
512 3308.800000

$ g++ -O3 t1.cpp
$ ./a.exe
512 284.900000
512 622.000000

From this we see the C code is about twice as fast as the C++ code when fully optimized. I can not see the reason in the code.

Improving the performance of Matrix Multiplication

There are many, many things you can do to improve the efficiency of matrix multiplication.

To examine how to improve the basic algorithm, let's first take a look at our current options. The naive implementation, of course, has 3 loops with a time complexity of the order of O(n^3). There is another method called Strassen's Method which achieves a appreciable speedup and has the order of O(n^2.73) (but in practice is useless since it offers no appreciable means of optimization).

This is in theory. Now consider how matrices are stored in memory. Row major is the standard, but you find column major too. Depending on the scheme, transposing your matrix might improve speed due to fewer cache misses. Matrix multiplication in theory is just a bunch of vector dot products and addition. The same vector is to be operated upon by multiple vectors, thus it makes sense to keep that vector in cache for faster access.

Now, with the introduction of multiple cores, parallelism and the cache concept, the game changes. If we look a little closely, we see that a dot product is nothing but a bunch of multiplications followed by summations. These multiplications can be done in parallel. Hence, we can now look at parallel loading of numbers.

Now let's make things a little more complicated. When talking about matrix multiplication, there is a distinction between single floating point and double floating point in their size. Often the former is 32 bits while the latter, 64 (of course, this depends on the CPU). Each CPU only has a fixed number of registers, meaning the bigger your numbers, the lesser you can fit in the CPU. Moral of the story is, stick to single floating point unless you really need double.

Now that we've gone through the basics of how we can tune matrix multiplication, worry not. You need do nothing of what has been discussed above since there are already subroutines to do it. As mentioned in the comments, there's GotoBLAS, OpenBLAS, Intel's MKL, and Apple's Accelerate framework. MKL/Accelerate are proprietary, but OpenBLAS is a very competitive alternative.

Here's a nice little example that multiplies 2 8k x 8k matrices in a few milliseconds on my Macintosh:

#include <sys/time.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <Accelerate/Accelerate.h>

int SIZE = 8192;

typedef float point_t;

point_t* transpose(point_t* A) {
point_t* At = (point_t*) calloc(SIZE * SIZE, sizeof(point_t));
vDSP_mtrans(A, 1, At, 1, SIZE, SIZE);

return At;
}

point_t* dot(point_t* A, point_t* B) {
point_t* C = (point_t*) calloc(SIZE * SIZE, sizeof(point_t));
int i;
int step = (SIZE * SIZE / 4);

cblas_sgemm (CblasRowMajor,
CblasNoTrans, CblasNoTrans, SIZE/4, SIZE, SIZE,
1.0, &A[0], SIZE, B, SIZE, 0.0, &C[0], SIZE);

cblas_sgemm (CblasRowMajor,
CblasNoTrans, CblasNoTrans, SIZE/4, SIZE, SIZE,
1.0, &A[step], SIZE, B, SIZE, 0.0, &C[step], SIZE);

cblas_sgemm (CblasRowMajor,
CblasNoTrans, CblasNoTrans, SIZE/4, SIZE, SIZE,
1.0, &A[step * 2], SIZE, B, SIZE, 0.0, &C[step * 2], SIZE);

cblas_sgemm (CblasRowMajor,
CblasNoTrans, CblasNoTrans, SIZE/4, SIZE, SIZE,
1.0, &A[step * 3], SIZE, B, SIZE, 0.0, &C[step * 3], SIZE);

return C;
}

void print(point_t* A) {
int i, j;
for(i = 0; i < SIZE; i++) {
for(j = 0; j < SIZE; j++) {
printf("%f ", A[i * SIZE + j]);
}
printf("\n");
}
}

int main() {
for(; SIZE <= 8192; SIZE *= 2) {
point_t* A = (point_t*) calloc(SIZE * SIZE, sizeof(point_t));
point_t* B = (point_t*) calloc(SIZE * SIZE, sizeof(point_t));

srand(getpid());

int i, j;
for(i = 0; i < SIZE * SIZE; i++) {
A[i] = ((point_t)rand() / (double)RAND_MAX);
B[i] = ((point_t)rand() / (double)RAND_MAX);
}

struct timeval t1, t2;
double elapsed_time;

gettimeofday(&t1, NULL);
point_t* C = dot(A, B);
gettimeofday(&t2, NULL);

elapsed_time = (t2.tv_sec - t1.tv_sec) * 1000.0; // sec to ms
elapsed_time += (t2.tv_usec - t1.tv_usec) / 1000.0; // us to ms

printf("Time taken for %d size matrix multiplication: %lf\n", SIZE, elapsed_time/1000.0);

free(A);
free(B);
free(C);

}
return 0;
}

At this point I should also mention SSE (Streaming SIMD Extension), which is basically something you shouldn't do unless you've worked with assembly. Basically, you're vectorising your C code, to work with vectors instead of integers. This means you can operate on blocks of data instead of single values. The compiler gives up and just translates your code as is without doing its own optimizations. If done right, it can speed up your code like nothing before - you can touch the theoretical floor of O(n^2) even! But it is easy to abuse SSE, and most people unfortunately do, making the end result worse than before.

I hope this motivates you to dig deeper. The world of matrix multiplication is a large and fascinating one. Below, I attach links for further reading.

  1. OpenBLAS
  2. More about SSE
  3. Intel Intrinsics

Speeding up matrix multiplication using SIMD and openMP

Parallelism can only give you so much. Furthermore, the more the sequential code is optimized the lesser will be the perceptual gains from parallelism you are likely to get. Nevertheless, what you can improve -- I have done it in the past and help to improve a lot -- is to divide the matrix multiplication into smaller blocks. Hence, the matrix multiplication is sub-divided into the multiplication of smaller matrices (tiles).

So by dividing the matrix multiplication into smaller blocks where you perform the matrix multiplication of smaller sub-matrices you are improving the use of the cache both the temporal locality and spatial locality. You need to divide the matrices according to the size of the cache levels (e.g., L1, L2, and L3) of the architecture that you are using. You can look in more detail in these slides about cache blocking and Matrix Multiplication. What Every Programmer Should Know About Memory? also has a vectorized cache-blocked matmul in an appendix.

For example, if you have a Cache L3, which is shared among cores you can load multiple columns of the matrix B into the L3 cache, and reuse those values to perform the matrix-multiplication of smaller tiles that will fit in cache L1 and L2. You can go even further, and inside those tiles divide the tiles even further so that you can take advantage of the registers.

After having optimized the memory usage of the matrix-multiplication, you can try to get that extra speedup from multithreading. If you have a cluster of multi-cores you can try to parallelize using MPI + OpenMP, of course at this point you will encounter another bottleneck, communication among processes.

This all depends on the architecture where your code is running, if you have a NUMA architecture then you have to also factor in things like local and non-local memory. You can also explore the GPU route: Matrix-Matrix Multiplication on the GPU with Nvidia CUDA.

Have a look at BLAS to have a good inside into efficient code.

Initialising two large 2D array of 401X401 and do fast matrix multiplication in C++

This can be sped up considerably.

First, to eliminate the stack usage issue declare the matrix like so:

std::vector<std::array<double,401>> matrix(401);

This achieves cache coherence since all the matrix memory is contiguous.

Next, transpose one of the matrixes. This will make the multiplication proceed in sequential memory accesses.

And finally, this lends itself to further speed improvements by multithreading. For instance create 4 threads that each process 100,100,100,101 rows. No thread sync required. since all writes are specific to each thread. Just join them all at the end and you're done.

I was curious and hacked some code to time 4 different conditions.

  • Simple matrix multiply
  • Matrix multiply with transpose for memory efficiency.
  • Matrix multiply with 4 threads
  • Matrix Multiply with 4 threads and transpose

The results for a vector<vector> v vector<array> are:

  vector<array> structure
Basic multiply 0.0955 secs.
Multiply with transpose 0.0738 secs.
4 Threads 0.0268 secs.
4 Threads w transpose 0.0164 secs.

vector<vector> structure
Basic multiply 0.1263 secs.
Multiply with transpose 0.0739 secs.
4 Threads 0.0346 secs.
4 Threads w transpose 0.0167 secs.

Matrix product(Matrix const& v0, Matrix const& v1, double& time, int mode = 0)
{
Matrix ret = create_matrix();
Matrix v2 = create_matrix();
Timer timer;
if (mode == 0) // Basic matrix multiply
{
for (size_t row = 0; row < N; row++)
for (size_t col = 0; col < N; col++)
for (size_t col_t = 0; col_t < N; col_t++)
ret[row][col] += v0[row][col_t] * v1[col_t][col];
}
else if (mode == 1) // Matrix multiply after transposing v1 for better memory access
{
for (size_t r = 0; r < N; r++) // transpose first
for (size_t c = 0; c < N; c++)
v2[c][r] = v1[r][c];
for (size_t row = 0; row < N; row++)
for (size_t col = 0; col < N; col++)
for (size_t col_t = 0; col_t < N; col_t++)
ret[row][col] += v0[row][col_t] * v2[col][col_t];
}
else if (mode==2) // Matrix multiply using threads
{
// lambda to process sets of rows in separate threads
auto do_row_set = [&v0, &v1, &ret](size_t start, size_t last) {
for (size_t row = start; row < last; row++)
for (size_t col = 0; col < N; col++)
for (size_t col_t = 0; col_t < N; col_t++)
ret[row][col] += v0[row][col_t] * v1[col_t][col];
};

vector<size_t> seq;
const size_t NN = N / THREADS;
// make a sequence of NN+1 rows from start to end
for (size_t thread_n = 0; thread_n < N; thread_n += NN)
seq.push_back(thread_n);
seq.push_back(N);
vector<std::future<void>> results; results.reserve(NN + 1);
for (size_t i = 0; i < seq.size() - 1; i++)
results.emplace_back(std::async(launchType, do_row_set, seq[i], seq[i + 1]));
for (auto& x : results)
x.get();
}
else
{
for (size_t r = 0; r < N; r++) // transpose first
for (size_t c = 0; c < N; c++)
v2[c][r] = v1[r][c];
// lambda to process sets of rows in separate threads
auto do_row_set = [&v0, &v2, &ret](size_t start, size_t last) {
for (size_t row = start; row < last; row++)
for (size_t col = 0; col < N; col++)
for (size_t col_t = 0; col_t < N; col_t++)
ret[row][col] += v0[row][col_t] * v2[col][col_t];
};

vector<size_t> seq;
const size_t NN = N / THREADS;
// make a sequence of NN+1 rows from start to end
for (size_t thread_n = 0; thread_n < N; thread_n += NN)
seq.push_back(thread_n);
seq.push_back(N);
vector<std::future<void>> results; results.reserve(NN + 1);
for (size_t i = 0; i < seq.size() - 1; i++)
results.emplace_back(std::async(launchType, do_row_set, seq[i], seq[i + 1]));
for (auto& x : results)
x.get();
}
time = timer;
return ret;
}

bool operator==(Matrix const& v0, Matrix const& v1)
{
for (size_t r = 0; r < N; r++)
for (size_t c = 0; c < N; c++)
if (v0[r][c] != v1[r][c])
return false;
return true;
}

int main()
{
auto fill = [](Matrix& v) {
std::mt19937_64 r(1);
std::normal_distribution dist(1.);
for (size_t row = 0; row < N; row++)
for (size_t col = 0; col < N; col++)
v[row][col] = Float(dist(r));
};
Matrix m1 = create_matrix(), m2 = create_matrix(), m3 = create_matrix(),
m4 = create_matrix(), m5 = create_matrix(), m6 = create_matrix();
fill(m1); fill(m2);

auto process_test = [&m1, &m2](int mode, Matrix& out) {
const int rpt_count = 50;
double sum = 0;
for (int i = 0; i < rpt_count; i++)
{
double time;
out = product(m1, m2, time, mode);
sum += time / rpt_count;
}
return sum;
};

std::cout << std::fixed << std::setprecision(4);
double time{};
time = process_test(0, m3);
std::cout << "Basic multiply " << time << " secs.\n";
time = process_test(1, m4);
std::cout << "Multiply with transpose " << time << " secs.\n";
time = process_test(2, m5);
std::cout << "4 Threads " << time << " secs.\n";
time = process_test(3, m6);
std::cout << "4 Threads w transpose " << time << " secs.\n";
if (!(m3==m4 && m3==m5 && m3==m6))
std::cout << "Error\n";
}


Related Topics



Leave a reply



Submit