Why Are Numpy Arrays So Fast

Why are NumPy arrays so fast?

Numpy arrays are densely packed arrays of homogeneous type. Python lists, by contrast, are arrays of pointers to objects, even when all of them are of the same type. So, you get the benefits of locality of reference.

Also, many Numpy operations are implemented in C, avoiding the general cost of loops in Python, pointer indirection and per-element dynamic type checking. The speed boost depends on which operations you're performing, but a few orders of magnitude isn't uncommon in number crunching programs.

Why is my Python NumPy code faster than C++?

I found this question interesting, because every time I encountered similar topic about the speed of NumPy (compared to C/C++) there was always answers like "it's a thin wrapper, its core is written in C, so it's fast", but this doesn't explain why C should be slower than C with additional layer (even a thin one).

The answer is: your C++ code is not slower than your Python code when properly compiled.

I've done some benchmarks, and at first it seemed that NumPy is surprisingly faster. But I forgot about optimizing the compilation with GCC.

I've computed everything again and also compared results with a pure C version of your code. I am using GCC version 4.9.2, and Python 2.7.9 (compiled from the source with the same GCC). To compile your C++ code I used g++ -O3 main.cpp -o main, to compile my C code I used gcc -O3 main.c -lm -o main. In all examples I filled data variables with some numbers (0.1, 0.4), as it changes results. I also changed np.arrays to use doubles (dtype=np.float64), because there are doubles in C++ example. My pure C version of your code (it's similar):

#include <math.h>
#include <stdio.h>
#include <time.h>

const int k_max = 100000;
const int N = 10000;

int main(void)
{
    clock_t t_start, t_end;
    double data1[N], data2[N], coefs1[k_max], coefs2[k_max], seconds;
    int z;
    for( z = 0; z < N; z++ )
    {
        data1[z] = 0.1;
        data2[z] = 0.4;
    }

    int i, j;
    t_start = clock();
    for( i = 0; i < k_max; i++ )
    {
        for( j = 0; j < N-1; j++ )
        {
            coefs1[i] += data2[j] * (cos((i+1) * data1[j]) - cos((i+1) * data1[j+1]));
            coefs2[i] += data2[j] * (sin((i+1) * data1[j]) - sin((i+1) * data1[j+1]));
        }
    }
    t_end = clock();

    seconds = (double)(t_end - t_start) / CLOCKS_PER_SEC;
    printf("Time: %f s\n", seconds);
    return coefs1[0];
}

For k_max = 100000, N = 10000 results where following:

Python 70.284362 s
C++ 69.133199 s
C 61.638186 s

Python and C++ have basically the same time, but note that there is a Python loop of length k_max, which should be much slower compared to C/C++ one. And it is.

For k_max = 1000000, N = 1000 we have:

Python 115.42766 s
C++ 70.781380 s

For k_max = 1000000, N = 100:

Python 52.86826 s
C++ 7.050597 s

So the difference increases with fraction k_max/N, but python is not faster even for N much bigger than k_max, e. g. k_max = 100, N = 100000:

Python 0.651587 s
C++ 0.568518 s

Obviously, the main speed difference between C/C++ and Python is in the for loop. But I wanted to find out the difference between simple operations on arrays in NumPy and in C. Advantages of using NumPy in your code consists of: 1. multiplying the whole array by a number, 2. calculating sin/cos of the whole array, 3. summing all elements of the array, instead of doing those operations on every single item separately. So I prepared two scripts to compare only these operations.

Python script:

import numpy as np
from time import time

N = 10000
x_len = 100000

def main():
    x = np.ones(x_len, dtype=np.float64) * 1.2345

    start = time()
    for i in xrange(N):
        y1 = np.cos(x, dtype=np.float64)
    end = time()
    print('cos: {} s'.format(end-start))

    start = time()
    for i in xrange(N):
        y2 = x * 7.9463
    end = time()
    print('multi: {} s'.format(end-start))

    start = time()
    for i in xrange(N):
        res = np.sum(x, dtype=np.float64)
    end = time()
    print('sum: {} s'.format(end-start))

    return y1, y2, res

if __name__ == '__main__':
    main()

# results
# cos: 22.7199969292 s
# multi: 0.841291189194 s
# sum: 1.15971088409 s

C script:

#include <math.h>
#include <stdio.h>
#include <time.h>

const int N = 10000;
const int x_len = 100000;

int main()
{
    clock_t t_start, t_end;
    double x[x_len], y1[x_len], y2[x_len], res, time;
    int i, j;
    for( i = 0; i < x_len; i++ )
    {
        x[i] = 1.2345;
    }

    t_start = clock();
    for( j = 0; j < N; j++ )
    {
        for( i = 0; i < x_len; i++ )
        {
            y1[i] = cos(x[i]);
        }
    }
    t_end = clock();
    time = (double)(t_end - t_start) / CLOCKS_PER_SEC;
    printf("cos: %f s\n", time);

    t_start = clock();
    for( j = 0; j < N; j++ )
    {
        for( i = 0; i < x_len; i++ )
        {
            y2[i] = x[i] * 7.9463;
        }
    }
    t_end = clock();
    time = (double)(t_end - t_start) / CLOCKS_PER_SEC;
    printf("multi: %f s\n", time);

    t_start = clock();
    for( j = 0; j < N; j++ )
    {
        res = 0.0;
        for( i = 0; i < x_len; i++ )
        {
            res += x[i];
        }
    }
    t_end = clock();
    time = (double)(t_end - t_start) / CLOCKS_PER_SEC;
    printf("sum: %f s\n", time);

    return y1[0], y2[0], res;
}

// results
// cos: 20.910590 s
// multi: 0.633281 s
// sum: 1.153001 s

Python results:

cos: 22.7199969292 s
multi: 0.841291189194 s
sum: 1.15971088409 s

C results:

cos: 20.910590 s
multi: 0.633281 s
sum: 1.153001 s

As you can see NumPy is incredibly fast, but always a bit slower than pure C.

Are numpy arrays faster than lists?

introducing numpy in your code means introduce another kind of thinking at problems.

numpy in general are more expensive in the creation phase but blazing fast for vectorial computation.

I think this is more or less what you are searching for

timeit(stmt='list(map(math.sqrt,a_list))', setup='import math; a_list = list(range(1,100000))',number=1000)
#8.64

vs:

timeit(stmt='numpy.sqrt(a_list)', setup='import numpy; a_list = numpy.arange(100000)',number=1000)
#0.4

These are the results of timeit, only applied on the computation part. I mean, computing the sqrt and ignoring time to import the libraries.

How is numpy so fast?

This is a complement to the answer posted by @celakev .
I think I finally got to understand what exactly was the issue. The issue was not about allocating the memory in the main function that does the computation.

What was actually taking time is to access new (fresh) memory. I believe that the malloc call returns pages of memory which are virtual, i.e. that does not corresponds to actual physical memory -- until it is explicitly accessed. What actually takes time is the process of allocating physical memory on the fly (which I think is OS-level) when it is accessed in the function code.

Here is a proof. Consider the two following trivial functions:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

float* just_alloc( size_t N ) 
{
    return (float*) aligned_alloc( 32, sizeof(float)*N );
}

void just_fill( float* _arr, size_t N ) 
{
    for (size_t i = 0; i < N; i++)
        _arr[i] = 1;
}

#define Time( code_to_benchmark, cleanup_code ) \
    do { \
        double best = 9e9; \
        for( int i = 0; i < 5; i++) { \
            struct timespec start, stop; \
            clock_gettime(CLOCK_THREAD_CPUTIME_ID, &start); \
            code_to_benchmark; \
            clock_gettime(CLOCK_THREAD_CPUTIME_ID, &stop); \
            double t = (stop.tv_sec - start.tv_sec) * 1e3 + (stop.tv_nsec - start.tv_nsec) / 1e6; \
            printf("Time[%d] = %f ms\n", i, t); \
            if (t < best) best = t; \
            cleanup_code; \
        } \
        printf("Best of 5 for '" #code_to_benchmark "' = %f ms\n\n", best); \
    } while(0)

int main() 
{
    const size_t N = 512;

    Time( float* arr = just_alloc(N*N*N), free(arr) );
    
    float* arr = just_alloc(N*N*N);
    Time( just_fill(arr, N*N*N), ; );
    free(arr);

    return 0;
}

I get the following timings, which I now detail for each of the calls:

Time[0] = 0.000931 ms
Time[1] = 0.000540 ms
Time[2] = 0.000523 ms
Time[3] = 0.000524 ms
Time[4] = 0.000521 ms
Best of 5 for 'float* arr = just_alloc(N*N*N)' = 0.000521 ms

Time[0] = 189.822237 ms
Time[1] = 45.041083 ms
Time[2] = 46.331428 ms
Time[3] = 44.729433 ms
Time[4] = 42.241279 ms
Best of 5 for 'just_fill(arr, N*N*N)' = 42.241279 ms

As you can see, allocating memory is blazingly fast, but the first time that the memory is accessed, it is 5 times slower than the other times. So, basically the reason that my code was slow was because i was each time reallocating fresh memory that had no physical address yet. (Correct me if I'm wrong but I think that's the gist of it!)

Why numpy arrays are slower than lists with for loops?

np.arrays are much faster when using vectorized functions or to say it another way, specifically when you are not iterating over the elements.

Here is an example. We will add 1 to each number in the ordered list you created.

The setup is similar except I use lists instead of ranges so I can operate on the values.

length = 150000000
my_list = list(range(length))
my_array = np.array(my_list)

Using the %%timeit magic function in jupyter notebooks

%%timeit
for i in range(len(my_list)):
    my_list[i] += 1

gives

38.6 s ± 3.71 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

thats about 38 seconds per loop

On the other hand

%%timeit
new_array = my_array + 1

gives

772 ms ± 58.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The operation on the numpy array took under one second (~772ms) per loop while iterating through the list and adding one to each element too around 36 seconds per loop.

When do numpy arrays become more efficient than python lists for access operations? (considering number of elements as variable)

Let's do some simple list and array comparisons.

Make a list of 0s (as you do):

In [108]: timeit [0]*1000
2.83 µs ± 0.399 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

Make an array from that list - a lot more time:

In [109]: timeit np.array([0]*1000)
84.9 µs ± 103 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

Make the array of 0s in the correct numpy way:

In [110]: timeit np.zeros(1000)
735 ns ± 0.727 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

Now sum all the elements of the list:

In [111]: %%timeit alist = [0]*1000
     ...: sum(alist)
5.74 µs ± 215 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

Sum of the array:

In [112]: %%timeit arr = np.zeros(1000)
     ...: arr.sum() 
6.41 µs ± 26.5 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In this case the list is a bit faster; but make a much larger list/array:

In [113]: %%timeit alist = [0]*100000
     ...: sum(alist) 
545 µs ± 17.9 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [114]: %%timeit arr = np.zeros(100000)
     ...: arr.sum()   
56.7 µs ± 37 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

The list sum scales O(n); the array scales much better (it has a higher "setup", but lower O(n) dependency). The iteration is done in c code.

Add 1 to all elements of the list:

In [115]: %%timeit alist = [0]*100000
     ...: [i+1 for i in alist] 
4.64 ms ± 168 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

A similar list comprehension on the array is slower:

In [116]: %%timeit arr = np.zeros(100000)
     ...: np.array([i+1 for i in arr]) 
24.2 ms ± 37.6 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

But doing a whole-array operation (where it iteration and addition is done in c code):

In [117]: %%timeit arr = np.zeros(100000)
     ...: arr+1
64.1 µs ± 85.3 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

For smaller size the [115] comprehension will be faster; there's no clear threshold.

Why are variable assignments faster than calling from arrays in python?

Python operations and memory allocations are generally much slower than Numpy's highly optimized, vectorized array operations. Since you are looping over the array and allocating memory, you don't get any of the benefits that Numpy offers. It's especially bad in your first one because it causes an undue number of allocations of small arrays.

Compare your code to one that offloads all the operations to Numpy instead of having Python do the operations one by one:

def test3(coords):
    mins = (coords - 1)**2
    results = np.sqrt(np.sum(mins, axis=1))
    return results

On my system, this results in:

Test 1 speed: 4.995761550962925
Test 2 speed: 1.3881473205983639
Test 3 speed: 0.05562112480401993

What are the advantages of NumPy over regular Python lists?

NumPy's arrays are more compact than Python lists -- a list of lists as you describe, in Python, would take at least 20 MB or so, while a NumPy 3D array with single-precision floats in the cells would fit in 4 MB. Access in reading and writing items is also faster with NumPy.

Maybe you don't care that much for just a million cells, but you definitely would for a billion cells -- neither approach would fit in a 32-bit architecture, but with 64-bit builds NumPy would get away with 4 GB or so, Python alone would need at least about 12 GB (lots of pointers which double in size) -- a much costlier piece of hardware!

The difference is mostly due to "indirectness" -- a Python list is an array of pointers to Python objects, at least 4 bytes per pointer plus 16 bytes for even the smallest Python object (4 for type pointer, 4 for reference count, 4 for value -- and the memory allocators rounds up to 16). A NumPy array is an array of uniform values -- single-precision numbers takes 4 bytes each, double-precision ones, 8 bytes. Less flexible, but you pay substantially for the flexibility of standard Python lists!

Why Are Numpy Arrays So Fast