Numpy Style Arrays for C++

NumPy style arrays for C++?

Here are several free software that may suit your needs.

  1. The GNU Scientific Library is a GPL software written in C. Thus, it has a C-like allocation and way of programming (pointers, etc.). With the GSLwrap, you can have a C++ way of programming, while still using the GSL. GSL has a BLAS implementation, but you can use ATLAS instead of the default CBLAS, if you want even more performances.

  2. The boost/uBLAS library is a BSL library, written in C++ and distributed as a boost package. It is a C++-way of implementing the BLAS standard. uBLAS comes with a few linear algebra functions, and there is an experimental binding to ATLAS.

  3. eigen is a linear algebra library written in C++, distributed under the MPL2 license (starting from version 3.1.1) or LGPL3/GPL2 (older versions). It's a C++ way of programming, but more integrated than the two others (more algorithms and data structures are available). Eigen claims to be faster than the BLAS implementations above, while not following the de-facto standard BLAS API. Eigen does not seem to put a lot of effort on parallel implementation.

  4. Armadillo is LGPL3 library for C++. It has binding for LAPACK (the library used by numpy). It uses recursive templates and template meta-programming, which is a good point (I don't know if other libraries are doing it also?).

  5. xtensor is a C++ library that is BSD licensed. It offers A C++ API very similar to that of NumPy. See https://xtensor.readthedocs.io/en/latest/numpy.html for a cheat sheet.

These alternatives are really good if you just want to get data structures and basic linear algebra. Depending on your taste about style, license or sysadmin challenges (installing big libraries like LAPACK may be difficult), you may choose the one that best suits your needs.

C++ library with array operations similar to Python/NumPy

I think Eigen is probably suitable for you. If you have not tried it (i don't know based on your question), you should take a look at it. Apart from being very fast (amongst the fasterst, from what i've gathered), it has also a pretty natural syntax. It does not get much better, if C++ only is concerned.

http://eigen.tuxfamily.org

Can't correctly convert Numpy array to c array for Python Extension (c++)

Specifying the data type in your python deceleration for a,b,c as dtype=np.float64. Double in C parlance is 64 bit float. using np.array like the way you've used it usually returns np.int64. using np.array like so will return a np.float64

a=np.array([1.,2.,3.])

C array vs NumPy array

My knowledge on this is still imperfect, but this may be helpful.
I ran some informal benchmarks to show what each array type is good for and was intrigued by what I found.

Though these array types are different in many ways, if you are doing heavy computation with large arrays, you should be able to get similar performance out of any of them since item-by-item access should be roughly the same across the board.

A NumPy array is a Python object implemented using Python's C API.
NumPy arrays do provide an API at the C level, but they cannot be created independent from the Python interpreter.
They are especially useful because of all the different array manipulation routines available in NumPy and SciPy.

A Cython memory view is also a Python object, but it is made as a Cython extension type.
It does not appear to be designed for use in pure Python since it isn't a part of Cython that can be imported directly from Python, but you can return a view to Python from a Cython function.
You can look at the implementation at https://github.com/cython/cython/blob/master/Cython/Utility/MemoryView.pyx

A C array is a native type in the C language.
It is indexed like a pointer, but arrays and pointers are different.
There is some good discussion on this at http://c-faq.com/aryptr/index.html
They can be allocated on the stack and are easier for the C compiler to optimize, but they will be more difficult to access outside of Cython.
I know you can make a NumPy array from memory that has been dynamically allocated by other programs, but it seems a lot more difficult that way.
Travis Oliphant posted an example of this at http://blog.enthought.com/python/numpy-arrays-with-pre-allocated-memory/
If you are using C arrays or pointers for temporary storage within your program they should work very well for you.
They will not be as convenient for slicing or for any other sort of vectorized computation since you will have to do everything yourself with explicit looping, but they should allocate and deallocate faster and ought to provide a good baseline for speed.

Cython also provides an array class.
It looks like it is designed for internal use.
Instances are created when a memoryview is copied.
See http://docs.cython.org/src/userguide/memoryviews.html#view-cython-arrays

In Cython, you can also allocate memory and index a pointer to treat the allocated memory somewhat like an array.
See http://docs.cython.org/src/tutorial/memory_allocation.html

Here are some benchmarks that show somewhat similar performance for indexing large arrays.
This is the Cython file.

from numpy cimport ndarray as ar, uint64_t
cimport cython
import numpy as np

@cython.boundscheck(False)
@cython.wraparound(False)
def ndarr_time(uint64_t n=1000000, uint64_t size=10000):
cdef:
ar[uint64_t] A = np.empty(n, dtype=np.uint64)
uint64_t i, j
for i in range(n):
for j in range(size):
A[j] = n

def carr_time(uint64_t n=1000000):
cdef:
ar[uint64_t] A = np.empty(n, dtype=np.uint64)
uint64_t AC[10000]
uint64_t a
int i, j
for i in range(n):
for j in range(10000):
AC[j] = n

@cython.boundscheck(False)
@cython.wraparound(False)
def ptr_time(uint64_t n=1000000, uint64_t size=10000):
cdef:
ar[uint64_t] A = np.empty(n, dtype=np.uint64)
uint64_t* AP = &A[0]
uint64_t a
int i, j
for i in range(n):
for j in range(size):
AP[j] = n

@cython.boundscheck(False)
@cython.wraparound(False)
def view_time(uint64_t n=1000000, uint64_t size=10000):
cdef:
ar[uint64_t] A = np.empty(n, dtype=np.uint64)
uint64_t[:] AV = A
uint64_t i, j
for i in range(n):
for j in range(size):
AV[j] = n

Timing these using IPython we obtain

%timeit -n 10 ndarr_time()
%timeit -n 10 carr_time()
%timeit -n 10 ptr_time()
%timeit -n 10 view_time()

10 loops, best of 3: 6.33 s per loop
10 loops, best of 3: 3.12 s per loop
10 loops, best of 3: 6.26 s per loop
10 loops, best of 3: 3.74 s per loop

These results struck me as a little odd, considering that, as per Efficiency: arrays vs pointers , arrays are unlikely to be significantly faster than pointers.
It appears that some sort of compiler optimization is making the pure C arrays and the typed memory views faster.
I tried turning off all the optimization flags on my C compiler and got the timings

1 loops, best of 3: 25.1 s per loop
1 loops, best of 3: 25.5 s per loop
1 loops, best of 3: 32 s per loop
1 loops, best of 3: 28.4 s per loop

It looks to me like the item-by item access is pretty much the same across the board, except that C arrays and Cython memory views seem to be easier for the compiler to optimize.

More commentary on this can be seen at a these two blog posts I found some time ago:
http://jakevdp.github.io/blog/2012/08/08/memoryview-benchmarks/
http://jakevdp.github.io/blog/2012/08/16/memoryview-benchmarks-2/

In the second blog post he comments on how, if memory view slices are inlined, they can provide speeds similar to that of pointer arithmetic.
I have noticed in some of my own tests that explicitly inlining functions that use Memory View slices isn't always necessary.
As an example of this, I'll compute the inner product of every combination of two rows of an array.

from numpy cimport ndarray as ar
cimport cython
from numpy import empty

# An inlined dot product
@cython.boundscheck(False)
@cython.wraparound(False)
cdef inline double dot_product(double[:] a, double[:] b, int size):
cdef int i
cdef double tot = 0.
for i in range(size):
tot += a[i] * b[i]
return tot

# non-inlined dot-product
@cython.boundscheck(False)
@cython.wraparound(False)
cdef double dot_product_no_inline(double[:] a, double[:] b, int size):
cdef int i
cdef double tot = 0.
for i in range(size):
tot += a[i] * b[i]
return tot

# function calling inlined dot product
@cython.boundscheck(False)
@cython.wraparound(False)
def dot_rows_slicing(ar[double,ndim=2] A):
cdef:
double[:,:] Aview = A
ar[double,ndim=2] res = empty((A.shape[0], A.shape[0]))
int i, j
for i in range(A.shape[0]):
for j in range(A.shape[0]):
res[i,j] = dot_product(Aview[i], Aview[j], A.shape[1])
return res

# function calling non-inlined version
@cython.boundscheck(False)
@cython.wraparound(False)
def dot_rows_slicing_no_inline(ar[double,ndim=2] A):
cdef:
double[:,:] Aview = A
ar[double,ndim=2] res = empty((A.shape[0], A.shape[0]))
int i, j
for i in range(A.shape[0]):
for j in range(A.shape[0]):
res[i,j] = dot_product_no_inline(Aview[i], Aview[j], A.shape[1])
return res

# inlined dot product using numpy arrays
@cython.boundscheck(False)
@cython.boundscheck(False)
cdef inline double ndarr_dot_product(ar[double] a, ar[double] b):
cdef int i
cdef double tot = 0.
for i in range(a.size):
tot += a[i] * b[i]
return tot

# non-inlined dot product using numpy arrays
@cython.boundscheck(False)
@cython.boundscheck(False)
cdef double ndarr_dot_product_no_inline(ar[double] a, ar[double] b):
cdef int i
cdef double tot = 0.
for i in range(a.size):
tot += a[i] * b[i]
return tot

# function calling inlined numpy array dot product
@cython.boundscheck(False)
@cython.wraparound(False)
def ndarr_dot_rows_slicing(ar[double,ndim=2] A):
cdef:
ar[double,ndim=2] res = empty((A.shape[0], A.shape[0]))
int i, j
for i in range(A.shape[0]):
for j in range(A.shape[0]):
res[i,j] = ndarr_dot_product(A[i], A[j])
return res

# function calling nun-inlined version for numpy arrays
@cython.boundscheck(False)
@cython.wraparound(False)
def ndarr_dot_rows_slicing_no_inline(ar[double,ndim=2] A):
cdef:
ar[double,ndim=2] res = empty((A.shape[0], A.shape[0]))
int i, j
for i in range(A.shape[0]):
for j in range(A.shape[0]):
res[i,j] = ndarr_dot_product(A[i], A[j])
return res

# Version with explicit looping and item-by-item access.
@cython.boundscheck(False)
@cython.wraparound(False)
def dot_rows_loops(ar[double,ndim=2] A):
cdef:
ar[double,ndim=2] res = empty((A.shape[0], A.shape[0]))
int i, j, k
double tot
for i in range(A.shape[0]):
for j in range(A.shape[0]):
tot = 0.
for k in range(A.shape[1]):
tot += A[i,k] * A[j,k]
res[i,j] = tot
return res

Timing these we see

A = rand(1000, 1000)
%timeit dot_rows_slicing(A)
%timeit dot_rows_slicing_no_inline(A)
%timeit ndarr_dot_rows_slicing(A)
%timeit ndarr_dot_rows_slicing_no_inline(A)
%timeit dot_rows_loops(A)

1 loops, best of 3: 1.02 s per loop
1 loops, best of 3: 1.02 s per loop
1 loops, best of 3: 3.65 s per loop
1 loops, best of 3: 3.66 s per loop
1 loops, best of 3: 1.04 s per loop

The results were as fast with explicit inlining as they were without it.
In both cases, the typed memory views were comparable to a version of the function that was written without slicing.

In the blog post, he had to write a specific example to force the compiler to not inline a function.
It appears that a decent C compiler (I'm using MinGW) is able to take care of these optimizations without being told to inline certain functions.
Memoryviews can be faster for passing array slices between functions within a Cython module, even without explicit inlining.

In this particular case, however, even pushing the loops to C doesn't really reach a speed anywhere near what can be achieved through proper use of matrix multiplication.
The BLAS is still the best way to do things like this.

%timeit A.dot(A.T)
10 loops, best of 3: 25.7 ms per loop

There is also automatic conversion from NumPy arrays to memoryviews as in

cimport cython

@cython.boundscheck(False)
@cython.wraparound(False)
def cysum(double[:] A):
cdef tot = 0.
cdef int i
for i in range(A.size):
tot += A[i]
return tot

The one catch is that, if you want a function to return a NumPy array, you will have to use np.asarray to convert the memory view object to a NumPy array again.
This is a relatively inexpensive operation since memory views comply with http://www.python.org/dev/peps/pep-3118/

Conclusion

Typed memory views seem to be a viable alternative to NumPy arrays for internal use in a Cython module.
Array slicing will be faster with memory views, but there are not as many functions and methods written for memory views as there are for NumPy arrays.
If you don't need to call a bunch of the NumPy array methods and want easy array slicing, you can use memory views in place of NumPy arrays.
If you need both the array slicing and the NumPy functionality for a given array, you can make a memory view that points to the same memory as the NumPy array.
You can then use the view for passing slices between functions and the array for calling NumPy functions.
That approach is still somewhat limited, but it will work well if you are doing most of your processing with a single array.

C arrays and/or dynamically allocated blocks of memory could be useful for intermediate calculations, but they are not as easy to pass back to Python for use there.
In my opinion, it is also more cumbersome to dynamically allocate multidimensional C arrays.
The best approach I am aware of is to allocate a large block of memory and then use integer arithmetic to index it as if it were a multidimensional array.
This could be an issue if you want easy allocation of arrays on the fly.
On the other hand, allocation times are probably a good bit faster for C arrays.
The other array types are designed to be nearly as fast and much more convenient, so I would recommend using them unless there is a compelling reason to do otherwise.

Update: As mentioned in the answer by @Veedrac you can still pass Cython memory views to most NumPy functions.
When you do this, NumPy will usually have to create a new NumPy array object to work with the memory view anyway, so this will be somewhat slower.
For large arrays the effect will be negligible.
A call to np.asarray for a memory view will be relatively fast regardless of array size.
However, to demonstrate this effect, here is another benchmark:

Cython file:

def npy_call_on_view(npy_func, double[:] A, int n):
cdef int i
for i in range(n):
npy_func(A)

def npy_call_on_arr(npy_func, ar[double] A, int n):
cdef int i
for i in range(n):
npy_func(A)

in IPython:

from numpy.random import rand
A = rand(1)
%timeit npy_call_on_view(np.amin, A, 10000)
%timeit npy_call_on_arr(np.amin, A, 10000)

output:

10 loops, best of 3: 282 ms per loop
10 loops, best of 3: 35.9 ms per loop

I tried to choose an example that would show this effect well.
Unless many NumPy function calls on relatively small arrays are involved, this shouldn't change the time a whole lot.
Keep in mind that, regardless of which way we are calling NumPy, a Python function call still occurs.

This applies only to the functions in NumPy.
Most of the array methods are not available for memoryviews (some of the attributes still are, like size and shape and T).
For example A.dot(A.T) with NumPy arrays would become np.dot(A, A.T).

Halide with C layout numpy arrays

The problem is that PyArray_SimpleNewFromData made a C style ndarray from the data, where in the host C++ code the arrays are Fortran style. A solution is to convert the ndarrays just after they are created, which can be done by code like:

def swap(img):
(sh1,sh2)=img.shape
(st1,st2)=img.strides
img.shape=(sh2,sh1)
img.strides=(st2,st1)

After this within Halide we can normally vectorize in zero (x) dimension.



Related Topics



Leave a reply



Submit