Why MATLAB is faster than C++ in creating random numbers?
When you call rand(5000,5000) in Matlab, Matlab executes the command by calling Intel MKL library, which is a highly optimized library written in C/C++ with lots of hand-coded assembly.
MKL should be faster than any straightforward C++ implementation, but there is an overhead for Matlab to call external library. The net result is that, for random number generation in smaller sizes (less than 1K for instance), plain C/C++ implementation will be faster, but for larger sizes, Matlab will benefit from super optimized MKL.
Performance: Matlab vs C++ Matrix vector multiplication
As said in the comments MatLab relies on Intel's MKL library for matrix products, which is the fastest library for such kind of operations. Nonetheless, Eigen alone should be able to deliver similar performance. To this end, make sure to use latest Eigen (e.g. 3.4), and proper compilation flags to enable AVX/FMA if available and multithreading:
-O3 -DNDEBUG -march=native
Since charges_
is a vector, better use a VectorXd
to Eigen knows that you want a matrix-vector product and not a matrix-matrix one.
If you have Intel's MKL, then you can also let Eigen uses it to get exact same performance than MatLab for this precise operation.
Regarding the assembly, better inverse the two loops to enable vectorization, then enable multithreading with OpenMP (add -fopenmp
as compiler flags) to make the outermost loop run in parallel, and finally you can simplify your code using Eigen:
void kernel_2D(const unsigned long M, double* x, const unsigned long N, double* y, MatrixXd& kernel) {
kernel.resize(M,N);
auto x0 = ArrayXd::Map(x,M);
auto x1 = ArrayXd::Map(x+M,M);
auto y0 = ArrayXd::Map(y,N);
auto y1 = ArrayXd::Map(y+N,N);
#pragma omp parallel for
for(unsigned long j=0;j<N;++j)
kernel.col(j) = sqrt((x0-y0(j)).abs2() + (x1-y1(j)).abs2());
}
With multi-threading you need to measure the wall clock time. Here (Haswell with 4 physical cores running at 2.6GHz) the assembly time drops to 0.36s for N=20000, and the matrix-vector products take 0.24s so 0.6s in total that is faster than MatLab whereas my CPU seems to be slower than yours.
Why is MATLAB so fast in matrix multiplication?
Here's my results using MATLAB R2011a + Parallel Computing Toolbox on a machine with a Tesla C2070:
>> A = rand(1024); gA = gpuArray(A);
% warm up by executing the operations a couple of times, and then:
>> tic, C = A * A; toc
Elapsed time is 0.075396 seconds.
>> tic, gC = gA * gA; toc
Elapsed time is 0.008621 seconds.
MATLAB uses highly optimized libraries for matrix multiplication which is why the plain MATLAB matrix multiplication is so fast. The gpuArray
version uses MAGMA.
Update using R2014a on a machine with a Tesla K20c, and the new timeit
and gputimeit
functions:
>> A = rand(1024); gA = gpuArray(A);
>> timeit(@()A*A)
ans =
0.0324
>> gputimeit(@()gA*gA)
ans =
0.0022
Update using R2018b on a WIN64 machine with 16 physical cores and a Tesla V100:
>> timeit(@()A*A)
ans =
0.0229
>> gputimeit(@()gA*gA)
ans =
4.8019e-04
(NB: at some point (I forget when exactly) gpuArray
switched from MAGMA to cuBLAS - MAGMA is still used for some gpuArray
operations though)
Update using R2022a on a WIN64 machine with 32 physical cores and an A100 GPU:
>> timeit(@()A*A)
ans =
0.0076
>> gputimeit(@()gA*gA)
ans =
2.5344e-04
Evaluating the performance in time of a specific portion of the matlab code?
The most convenient way is to use the GUI profiler tool. You can find it in the dropdown menus (Desktop->Profiler), or you can start it from the command line by typing profile viewer
. Then you enter the name of the function at the top of the window, hit "run", and wait till the code is done running. Clicking on the links brings you into the respective function, where you can see runtime line-by-line.
Note that timing code that runs very fast and for only a handful of iterations can be tricky; for these cases you may want to use the timeit
function from the Matlab File Exchange.
Does armadillo library slow down the execution of a matrix operations?
First of all make sure that the blas
and lapack
library are enabled, there are instructions at Armadillo doc.
The second thing is that it might be a more extensive memory allocation in Armadillo. If you restructure your code to do the memory initialisation first as
#include<iostream>
#include<math.h>
#include<chrono>
#include<armadillo>
using namespace std;
using namespace arma;
int main()
{
double a[100][100];
double b[100][100];
double c[100][100];
mat a1=ones(100,100);
mat b1=ones(100,100);
mat c1(100,100);
auto start = std::chrono::high_resolution_clock::now();
for (int i = 0; i < 100; i++)
{
for (int j = 0; j < 100; j++)
{
a[i][j] = 1;
b[i][j] = 1;
c[i][j] = a[i][j] + b[i][j];
}
}
auto finish = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> elapsed = finish - start;
std::cout << "Elapsed time: " << elapsed.count() << " s\n";
auto start1 = std::chrono::high_resolution_clock::now();
c1 = a1 + b1;
auto finish1 = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> elapsed1 = finish1 - start1;
std::cout << "Elapsed time: " << elapsed1.count() << " s\n";
return 0;
}
With this I got the result:
Elapsed time: 0.000647521 s
Elapsed time: 0.000353198 s
I compiled it with (in Ubuntu 17.10):g++ prog.cpp -larmadillo
Related Topics
What Does "Void *(*)(Void *)" Mean in C++
How to Use Dynamic Name for Variables in C++
High Delay in Rs232 Communication on a Pxa270
Can't Modify Char* - Memory Access Violation
When Instantiating a Template, Should Members of Its Incomplete Argument Types Be Visible
Operator Overload Which Permits Capturing with Rvalue But Not Assigning To
Undefined Behavior Causing Time Travel
How to Use C++11 Enum Class for Flags
C++ Overloaded Function as Template Argument
Stopping the Debugger When a Nan Floating Point Number Is Produced
Execute C++ from String Variable
Special Simple Random Number Generator
What Are Good Practices Regarding Shared Libraries on Linux
What Is Different Between Join() and Detach() for Multi Threading in C++
Counting the Number of Occurrences of a String Within a String