Performance Tradeoff - When Is Matlab Better/Slower Than C/C++

Why MATLAB is faster than C++ in creating random numbers?

When you call rand(5000,5000) in Matlab, Matlab executes the command by calling Intel MKL library, which is a highly optimized library written in C/C++ with lots of hand-coded assembly.

MKL should be faster than any straightforward C++ implementation, but there is an overhead for Matlab to call external library. The net result is that, for random number generation in smaller sizes (less than 1K for instance), plain C/C++ implementation will be faster, but for larger sizes, Matlab will benefit from super optimized MKL.

Performance: Matlab vs C++ Matrix vector multiplication

As said in the comments MatLab relies on Intel's MKL library for matrix products, which is the fastest library for such kind of operations. Nonetheless, Eigen alone should be able to deliver similar performance. To this end, make sure to use latest Eigen (e.g. 3.4), and proper compilation flags to enable AVX/FMA if available and multithreading:

-O3 -DNDEBUG -march=native

Since charges_ is a vector, better use a VectorXd to Eigen knows that you want a matrix-vector product and not a matrix-matrix one.

If you have Intel's MKL, then you can also let Eigen uses it to get exact same performance than MatLab for this precise operation.

Regarding the assembly, better inverse the two loops to enable vectorization, then enable multithreading with OpenMP (add -fopenmp as compiler flags) to make the outermost loop run in parallel, and finally you can simplify your code using Eigen:

void kernel_2D(const unsigned long M, double* x, const unsigned long N,  double*  y, MatrixXd& kernel)    {
kernel.resize(M,N);
auto x0 = ArrayXd::Map(x,M);
auto x1 = ArrayXd::Map(x+M,M);
auto y0 = ArrayXd::Map(y,N);
auto y1 = ArrayXd::Map(y+N,N);
#pragma omp parallel for
for(unsigned long j=0;j<N;++j)
kernel.col(j) = sqrt((x0-y0(j)).abs2() + (x1-y1(j)).abs2());
}

With multi-threading you need to measure the wall clock time. Here (Haswell with 4 physical cores running at 2.6GHz) the assembly time drops to 0.36s for N=20000, and the matrix-vector products take 0.24s so 0.6s in total that is faster than MatLab whereas my CPU seems to be slower than yours.

Why is MATLAB so fast in matrix multiplication?

Here's my results using MATLAB R2011a + Parallel Computing Toolbox on a machine with a Tesla C2070:

>> A = rand(1024); gA = gpuArray(A);
% warm up by executing the operations a couple of times, and then:
>> tic, C = A * A; toc
Elapsed time is 0.075396 seconds.
>> tic, gC = gA * gA; toc
Elapsed time is 0.008621 seconds.

MATLAB uses highly optimized libraries for matrix multiplication which is why the plain MATLAB matrix multiplication is so fast. The gpuArray version uses MAGMA.

Update using R2014a on a machine with a Tesla K20c, and the new timeit and gputimeit functions:

>> A = rand(1024); gA = gpuArray(A);
>> timeit(@()A*A)
ans =
0.0324
>> gputimeit(@()gA*gA)
ans =
0.0022

Update using R2018b on a WIN64 machine with 16 physical cores and a Tesla V100:

>> timeit(@()A*A)
ans =
0.0229
>> gputimeit(@()gA*gA)
ans =
4.8019e-04

(NB: at some point (I forget when exactly) gpuArray switched from MAGMA to cuBLAS - MAGMA is still used for some gpuArray operations though)

Update using R2022a on a WIN64 machine with 32 physical cores and an A100 GPU:

>> timeit(@()A*A)
ans =
0.0076
>> gputimeit(@()gA*gA)
ans =
2.5344e-04

Evaluating the performance in time of a specific portion of the matlab code?

The most convenient way is to use the GUI profiler tool. You can find it in the dropdown menus (Desktop->Profiler), or you can start it from the command line by typing profile viewer. Then you enter the name of the function at the top of the window, hit "run", and wait till the code is done running. Clicking on the links brings you into the respective function, where you can see runtime line-by-line.

Note that timing code that runs very fast and for only a handful of iterations can be tricky; for these cases you may want to use the timeit function from the Matlab File Exchange.

Sample Image

Does armadillo library slow down the execution of a matrix operations?

First of all make sure that the blas and lapack library are enabled, there are instructions at Armadillo doc.
The second thing is that it might be a more extensive memory allocation in Armadillo. If you restructure your code to do the memory initialisation first as

#include<iostream>
#include<math.h>
#include<chrono>
#include<armadillo>
using namespace std;
using namespace arma;
int main()
{
double a[100][100];
double b[100][100];
double c[100][100];
mat a1=ones(100,100);
mat b1=ones(100,100);
mat c1(100,100);

auto start = std::chrono::high_resolution_clock::now();
for (int i = 0; i < 100; i++)
{
for (int j = 0; j < 100; j++)
{
a[i][j] = 1;
b[i][j] = 1;
c[i][j] = a[i][j] + b[i][j];
}
}
auto finish = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> elapsed = finish - start;
std::cout << "Elapsed time: " << elapsed.count() << " s\n";

auto start1 = std::chrono::high_resolution_clock::now();
c1 = a1 + b1;
auto finish1 = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> elapsed1 = finish1 - start1;
std::cout << "Elapsed time: " << elapsed1.count() << " s\n";

return 0;
}

With this I got the result:

Elapsed time: 0.000647521 s
Elapsed time: 0.000353198 s

I compiled it with (in Ubuntu 17.10):
g++ prog.cpp -larmadillo



Related Topics



Leave a reply



Submit