What is the recommended way of allocating memory for a typed memory view?
Look here for an answer.
The basic idea is that you want cpython.array.array
and cpython.array.clone
(not cython.array.*
):
from cpython.array cimport array, clone
# This type is what you want and can be cast to things of
# the "double[:]" syntax, so no problems there
cdef array[double] armv, templatemv
templatemv = array('d')
# This is fast
armv = clone(templatemv, L, False)
EDIT
It turns out that the benchmarks in that thread were rubbish. Here's my set, with my timings:
# cython: language_level=3
# cython: boundscheck=False
# cython: wraparound=False
import time
import sys
from cpython.array cimport array, clone
from cython.view cimport array as cvarray
from libc.stdlib cimport malloc, free
import numpy as numpy
cimport numpy as numpy
cdef int loops
def timefunc(name):
def timedecorator(f):
cdef int L, i
print("Running", name)
for L in [1, 10, 100, 1000, 10000, 100000, 1000000]:
start = time.clock()
f(L)
end = time.clock()
print(format((end-start) / loops * 1e6, "2f"), end=" ")
sys.stdout.flush()
print("μs")
return timedecorator
print()
print("INITIALISATIONS")
loops = 100000
@timefunc("cpython.array buffer")
def _(int L):
cdef int i
cdef array[double] arr, template = array('d')
for i in range(loops):
arr = clone(template, L, False)
# Prevents dead code elimination
str(arr[0])
@timefunc("cpython.array memoryview")
def _(int L):
cdef int i
cdef double[::1] arr
cdef array template = array('d')
for i in range(loops):
arr = clone(template, L, False)
# Prevents dead code elimination
str(arr[0])
@timefunc("cpython.array raw C type")
def _(int L):
cdef int i
cdef array arr, template = array('d')
for i in range(loops):
arr = clone(template, L, False)
# Prevents dead code elimination
str(arr[0])
@timefunc("numpy.empty_like memoryview")
def _(int L):
cdef int i
cdef double[::1] arr
template = numpy.empty((L,), dtype='double')
for i in range(loops):
arr = numpy.empty_like(template)
# Prevents dead code elimination
str(arr[0])
@timefunc("malloc")
def _(int L):
cdef int i
cdef double* arrptr
for i in range(loops):
arrptr = <double*> malloc(sizeof(double) * L)
free(arrptr)
# Prevents dead code elimination
str(arrptr[0])
@timefunc("malloc memoryview")
def _(int L):
cdef int i
cdef double* arrptr
cdef double[::1] arr
for i in range(loops):
arrptr = <double*> malloc(sizeof(double) * L)
arr = <double[:L]>arrptr
free(arrptr)
# Prevents dead code elimination
str(arr[0])
@timefunc("cvarray memoryview")
def _(int L):
cdef int i
cdef double[::1] arr
for i in range(loops):
arr = cvarray((L,),sizeof(double),'d')
# Prevents dead code elimination
str(arr[0])
print()
print("ITERATING")
loops = 1000
@timefunc("cpython.array buffer")
def _(int L):
cdef int i
cdef array[double] arr = clone(array('d'), L, False)
cdef double d
for i in range(loops):
for i in range(L):
d = arr[i]
# Prevents dead-code elimination
str(d)
@timefunc("cpython.array memoryview")
def _(int L):
cdef int i
cdef double[::1] arr = clone(array('d'), L, False)
cdef double d
for i in range(loops):
for i in range(L):
d = arr[i]
# Prevents dead-code elimination
str(d)
@timefunc("cpython.array raw C type")
def _(int L):
cdef int i
cdef array arr = clone(array('d'), L, False)
cdef double d
for i in range(loops):
for i in range(L):
d = arr[i]
# Prevents dead-code elimination
str(d)
@timefunc("numpy.empty_like memoryview")
def _(int L):
cdef int i
cdef double[::1] arr = numpy.empty((L,), dtype='double')
cdef double d
for i in range(loops):
for i in range(L):
d = arr[i]
# Prevents dead-code elimination
str(d)
@timefunc("malloc")
def _(int L):
cdef int i
cdef double* arrptr = <double*> malloc(sizeof(double) * L)
cdef double d
for i in range(loops):
for i in range(L):
d = arrptr[i]
free(arrptr)
# Prevents dead-code elimination
str(d)
@timefunc("malloc memoryview")
def _(int L):
cdef int i
cdef double* arrptr = <double*> malloc(sizeof(double) * L)
cdef double[::1] arr = <double[:L]>arrptr
cdef double d
for i in range(loops):
for i in range(L):
d = arr[i]
free(arrptr)
# Prevents dead-code elimination
str(d)
@timefunc("cvarray memoryview")
def _(int L):
cdef int i
cdef double[::1] arr = cvarray((L,),sizeof(double),'d')
cdef double d
for i in range(loops):
for i in range(L):
d = arr[i]
# Prevents dead-code elimination
str(d)
Output:
INITIALISATIONS
Running cpython.array buffer
0.100040 0.097140 0.133110 0.121820 0.131630 0.108420 0.112160 μs
Running cpython.array memoryview
0.339480 0.333240 0.378790 0.445720 0.449800 0.414280 0.414060 μs
Running cpython.array raw C type
0.048270 0.049250 0.069770 0.074140 0.076300 0.060980 0.060270 μs
Running numpy.empty_like memoryview
1.006200 1.012160 1.128540 1.212350 1.250270 1.235710 1.241050 μs
Running malloc
0.021850 0.022430 0.037240 0.046260 0.039570 0.043690 0.030720 μs
Running malloc memoryview
1.640200 1.648000 1.681310 1.769610 1.755540 1.804950 1.758150 μs
Running cvarray memoryview
1.332330 1.353910 1.358160 1.481150 1.517690 1.485600 1.490790 μs
ITERATING
Running cpython.array buffer
0.010000 0.027000 0.091000 0.669000 6.314000 64.389000 635.171000 μs
Running cpython.array memoryview
0.013000 0.015000 0.058000 0.354000 3.186000 33.062000 338.300000 μs
Running cpython.array raw C type
0.014000 0.146000 0.979000 9.501000 94.160000 916.073000 9287.079000 μs
Running numpy.empty_like memoryview
0.042000 0.020000 0.057000 0.352000 3.193000 34.474000 333.089000 μs
Running malloc
0.002000 0.004000 0.064000 0.367000 3.599000 32.712000 323.858000 μs
Running malloc memoryview
0.019000 0.032000 0.070000 0.356000 3.194000 32.100000 327.929000 μs
Running cvarray memoryview
0.014000 0.026000 0.063000 0.351000 3.209000 32.013000 327.890000 μs
(The reason for the "iterations" benchmark is that some methods have surprisingly different characteristics in this respect.)
In order of initialisation speed:
malloc
: This is a harsh world, but it's fast. If you need to to allocate a lot of things and have unhindered iteration and indexing performance, this has to be it. But normally you're a good bet for...
cpython.array raw C type
: Well damn, it's fast. And it's safe. Unfortunately it goes through Python to access its data fields. You can avoid that by using a wonderful trick:
arr.data.as_doubles[i]
which brings it up to the standard speed while removing safety! This makes this a wonderful replacement for malloc
, being basically a pretty reference-counted version!
cpython.array buffer
: Coming in at only three to four times the setup time of malloc
, this is looks a wonderful bet. Unfortunately it has significant overhead (albeit small compared to the boundscheck
and wraparound
directives). That means it only really competes against full-safety variants, but it is the fastest of those to initialise. Your choice.
cpython.array memoryview
: This is now an order of magnitude slower than malloc
to initialise. That's a shame, but it iterates just as fast. This is the standard solution that I would suggest unless boundscheck
or wraparound
are on (in which case cpython.array buffer
might be a more compelling tradeoff).
The rest. The only one worth anything is numpy
's, due to the many fun methods attached to the objects. That's it, though.
Cython: understanding a typed memoryview with a indirect_contignuous memory layout
Let's set the record straight: typed memory view can be only used with objects which implement buffer-protocol.
Raw C-pointers obviously don't implement the buffer-protocol. But you might ask, why something like the following quick&dirty code works:
%%cython
from libc.stdlib cimport calloc
def f():
cdef int* v=<int *>calloc(4, sizeof(int))
cdef int[:] b = <int[:4]>v
return b[0] # leaks memory, so what?
Here, a pointer (v
) is used to construct a typed memory view (b
). There is however more, going under the hood (as can be seen in the cythonized c-file):
- a cython-array (i.e.
cython.view.array
) is constructed, which wraps the raw pointer and can expose it via buffer-protocol - this array is used for the creation of typed memory view.
Your understanding what view.indirect_contiguous
is used for is right - it is exactly what you desire. However, the problem is view.array
, which just cannot handle this type of data-layout.
view.indirect
and view.indirect_contiguous
correspond to PyBUF_INDIRECT
in protocol-buffer parlance and for this the field suboffsets
must contain some meaningful values (i.e >=0
for some dimensions). However, as can be see in the source-code view.array
doesn't have this member at all - there is no way it can represent the complex memory layout at all!
Where does it leave us? As pointed out by @chrisb and @DavidW in your other question, you will have to implement a wrapper which can expose your data-structure via protocol-buffer.
There are data structures in Python, which use the indirect memory layout - most prominently the PIL-arrays. A good starting point to understand, how suboffsets
are supposed to work is this piece of documenation:
void *get_item_pointer(int ndim, void *buf, Py_ssize_t *strides,
Py_ssize_t *suboffsets, Py_ssize_t *indices) {
char *pointer = (char*)buf; // A
int i;
for (i = 0; i < ndim; i++) {
pointer += strides[i] * indices[i]; // B
if (suboffsets[i] >=0 ) {
pointer = *((char**)pointer) + suboffsets[i]; // C
}
}
return (void*)pointer; // D
}
In your case strides
and offsets
would be
strides=[sizeof(int*), sizeof(int)]
(i.e.[8,4]
on usualx86_64
machines)offsets=[0,-1]
, i.e. only the first dimension is indirect.
Getting the address of element [x,y]
would then happen as follows:
- in the line
A
,pointer
is set tobuf
, let's assumeBUF
. - first dimension:
- in line
B
,pointer
becomesBUF+x*8
, and points to the location of the pointer to x-th row. - because
suboffsets[0]>=0
, we dereference the pointer in lineC
and thus it shows to addressROW_X
- the start of the x-th row.
- in line
- second dimension:
- in line
B
we get the address of they
element usingstrides
, i.e.pointer=ROW_X+4*y
- second dimension is direct (signaled by
suboffset[1]<0
), so no dereferencing is needed.
- in line
- we are done,
pointer
points to the desired address and is returned in lineD
.
FWIW, I have implemented a library which is able to export int**
and similar memory layouts via buffer protocol: https://github.com/realead/indirect_buffer.
Cython: are typed memoryviews the modern way to type numpy arrays?
I will quote from the docs the docs
Memoryviews are similar to the current NumPy array buffer support (
np.ndarray[np.float64_t, ndim=2]
), but they have more features and cleaner syntax.
This indicates that the developers of Cython consider memory views to be the modern way.
Memory views offer some big advantages over the np.ndarray
notation primarily in elegance and interoperability, however they are not superior in performance.
Performance:
First it should be noted that boundscheck sometimes fails to work with memory views resulting in artificially fast figures for memoryviews with boundscheck=True (i.e. you get fast, unsafe indexing), if you're relying on boundscheck to catch bugs this could be a nasty surprise.
For the most part once compiler optimizations have been applied, memory views and numpy array notation are equal in performance, often precisely so. When there is a difference it is normally no more than 10-30%.
Performance benchmark
The number is the time in seconds to perform 100,000,000 operations. Smaller is faster.
ACCESS+ASSIGNMENT on small array (10000 elements, 10000 times)
Results for `uint8`
1) memory view: 0.0415 +/- 0.0017
2) np.ndarray : 0.0531 +/- 0.0012
3) pointer : 0.0333 +/- 0.0017
Results for `uint16`
1) memory view: 0.0479 +/- 0.0032
2) np.ndarray : 0.0480 +/- 0.0034
3) pointer : 0.0329 +/- 0.0008
Results for `uint32`
1) memory view: 0.0499 +/- 0.0021
2) np.ndarray : 0.0413 +/- 0.0005
3) pointer : 0.0332 +/- 0.0010
Results for `uint64`
1) memory view: 0.0489 +/- 0.0019
2) np.ndarray : 0.0417 +/- 0.0010
3) pointer : 0.0353 +/- 0.0017
Results for `float32`
1) memory view: 0.0398 +/- 0.0027
2) np.ndarray : 0.0418 +/- 0.0019
3) pointer : 0.0330 +/- 0.0006
Results for `float64`
1) memory view: 0.0439 +/- 0.0037
2) np.ndarray : 0.0422 +/- 0.0013
3) pointer : 0.0353 +/- 0.0013
ACCESS PERFORMANCE (100,000,000 element array):
Results for `uint8`
1) memory view: 0.0576 +/- 0.0006
2) np.ndarray : 0.0570 +/- 0.0009
3) pointer : 0.0061 +/- 0.0004
Results for `uint16`
1) memory view: 0.0806 +/- 0.0002
2) np.ndarray : 0.0882 +/- 0.0005
3) pointer : 0.0121 +/- 0.0003
Results for `uint32`
1) memory view: 0.0572 +/- 0.0016
2) np.ndarray : 0.0571 +/- 0.0021
3) pointer : 0.0248 +/- 0.0008
Results for `uint64`
1) memory view: 0.0618 +/- 0.0007
2) np.ndarray : 0.0621 +/- 0.0014
3) pointer : 0.0481 +/- 0.0006
Results for `float32`
1) memory view: 0.0945 +/- 0.0013
2) np.ndarray : 0.0947 +/- 0.0018
3) pointer : 0.0942 +/- 0.0020
Results for `float64`
1) memory view: 0.0981 +/- 0.0026
2) np.ndarray : 0.0982 +/- 0.0026
3) pointer : 0.0968 +/- 0.0016
ASSIGNMENT PERFORMANCE (100,000,000 element array):
Results for `uint8`
1) memory view: 0.0341 +/- 0.0010
2) np.ndarray : 0.0476 +/- 0.0007
3) pointer : 0.0402 +/- 0.0001
Results for `uint16`
1) memory view: 0.0368 +/- 0.0020
2) np.ndarray : 0.0368 +/- 0.0019
3) pointer : 0.0279 +/- 0.0009
Results for `uint32`
1) memory view: 0.0429 +/- 0.0022
2) np.ndarray : 0.0427 +/- 0.0005
3) pointer : 0.0418 +/- 0.0007
Results for `uint64`
1) memory view: 0.0833 +/- 0.0004
2) np.ndarray : 0.0835 +/- 0.0011
3) pointer : 0.0832 +/- 0.0003
Results for `float32`
1) memory view: 0.0648 +/- 0.0061
2) np.ndarray : 0.0644 +/- 0.0044
3) pointer : 0.0639 +/- 0.0005
Results for `float64`
1) memory view: 0.0854 +/- 0.0056
2) np.ndarray : 0.0849 +/- 0.0043
3) pointer : 0.0847 +/- 0.0056
Benchmark Code (Shown only for access+assignment)
# cython: boundscheck=False
# cython: wraparound=False
# cython: nonecheck=False
import numpy as np
cimport numpy as np
cimport cython
# Change these as desired.
data_type = np.uint64
ctypedef np.uint64_t data_type_t
cpdef test_memory_view(data_type_t [:] view):
cdef Py_ssize_t i, j, n = view.shape[0]
for j in range(0, n):
for i in range(0, n):
view[i] = view[j]
cpdef test_ndarray(np.ndarray[data_type_t, ndim=1] view):
cdef Py_ssize_t i, j, n = view.shape[0]
for j in range(0, n):
for i in range(0, n):
view[i] = view[j]
cpdef test_pointer(data_type_t [:] view):
cdef Py_ssize_t i, j, n = view.shape[0]
cdef data_type_t * data_ptr = &view[0]
for j in range(0, n):
for i in range(0, n):
(data_ptr + i)[0] = (data_ptr + j)[0]
def run_test():
import time
from statistics import stdev, mean
n = 10000
repeats = 100
a = np.arange(0, n, dtype=data_type)
funcs = [('1) memory view', test_memory_view),
('2) np.ndarray', test_ndarray),
('3) pointer', test_pointer)]
results = {label: [] for label, func in funcs}
for r in range(0, repeats):
for label, func in funcs:
start=time.time()
func(a)
results[label].append(time.time() - start)
print('Results for `{}`'.format(data_type.__name__))
for label, times in sorted(results.items()):
print('{: <14}: {:.4f} +/- {:.4f}'.format(label, mean(times), stdev(times)))
These benchmarks indicate that on the whole there is not much difference in performance. Sometimes the np.ndarray notation is a little faster, and sometimes vice-verca.
One thing to watch out for with benchmarks is that when the code is made a little bit more complicated or 'realistic' the difference suddenly vanishes, as if the compiler loses confidence to apply some very clever optimization. This can be seen with the performance of floats where there is no difference whatsoever presumably as some fancy integer optimizations can't be used.
Ease of use
Memory views offer significant advantages, for example you can use a memory view on numpy array, CPython array, cython array, c array and more, both present and future. There is also the simple parallel syntax for casting anything to a memory view:
cdef double [:, :] data_view = <double[:256, :256]>data
Memory views are great in this regard, because if you type a function as taking a memory view then it can take any of those things. This means you can write a module that doesn't have a dependency on numpy, but which can still take numpy arrays.
On the other hand, np.ndarray
notation results in something that is still a numpy array and you can call all the numpy array methods on it. It's not a big deal to have both a numpy array and a view on the array though:
def dostuff(arr):
cdef double [:] arr_view = arr
# Now you can use 'arr' if you want array functions,
# and arr_view if you want fast indexing
Having both the array and the array view works fine in practise and I quite like the style, as it makes a clear distinction between python-level methods and c-level methods.
Conclusion
Performance is very nearly equal and there is certainly not enough difference for that to be a deciding factor.
The numpy array notation comes closer to the ideal of accelerating python code without changing it much, as you can continue to use the same variable, while gaining full-speed array indexing.
On the other hand, the memory view notation probably is the future. If you like the elegance of it, and use different kinds of data containers than just numpy arrays, there is very good reason for using memory views for consistency's sake.
How to access the typed-memory view element of a class declared in cython?
Your first issue is just that a memoryview doesn't have a useful __str__
function for print to use. You can either convert it to an object that does print nicely
print(list(b0.moves))
print(np.asarray(b0.moves))
Or you can iterate through it yourself:
for i in range(b0.moves.shape[0]):
print(b0.moves[i], end=' ') # need to have Cython set to use Python 3 syntax for this line
print()
Your second problem is harder to solve since you don't tell us what line the error comes from. I think it's the constructor of cy
which expects a memoryview but you pass an integer to. (I get a slightly different error message though).
Initialise Cython Memoryview efficiently
See this answer for a comparison of different ways of allocating memory. If your needs are simple (just indexing) pay particular attention to 'cpython.array raw C type', you can create a cpython array for fast creation then use as_ints[i]
for fast unsafe indexing, or if you really do need a memory view, the memory view on a cpython array is 3x faster than a numpy array.
Without having a bigger picture of what your code does it's difficult to offer more specific advice. For example if possible you would be better served using a two-dimensional array as it tends to be much more efficient to allocate one large chunk of memory than lots of little ones, and for instance it's much quicker to make lots of little memory view slices of one big memory view with a big hunk of allocated memory, than to create a bunch of little memory views each with their own little piece of allocated memory.
C++ - safety of allocating memory for an array, then returning a pointer to be deleted externally
You must
use delete[] p
the other one leads to undefined behaviour. Typically the compiler stores some number before the array to know how big it is. Depends on the implementation. Also main should return int, but that is not that serious.
EDIT: since you are interested about possible implementation here's the dissasembly (gcc 4.8, linux x86_64):
(gdb) l AllocateSomething
1 int* AllocateSomething()
2 {
3 int *arr1 = new int[100];
4 int *arr2 = new int[200];
5 int *arr3 = new int[300];
6 int *arr4 = new int[400];
7 arr1[0] = arr2[0] = arr3[0] = arr4[0] = ~0;
8
9
10
(gdb) x/20 arr1-8
0x601ff0: 0 0 0 0
0x602000: 0 0 417 0
0x602010: -1 0 0 0
0x602020: 0 0 0 0
0x602030: 0 0 0 0
(gdb) x/20 arr2-8
0x602190: 0 0 0 0
0x6021a0: 0 0 817 0
0x6021b0: -1 0 0 0
0x6021c0: 0 0 0 0
0x6021d0: 0 0 0 0
(gdb) x/20 arr3-8
0x6024c0: 0 0 0 0
0x6024d0: 0 0 1217 0
0x6024e0: -1 0 0 0
0x6024f0: 0 0 0 0
0x602500: 0 0 0 0
(gdb) x/20 arr4-8
0x602980: 0 0 0 0
0x602990: 0 0 1617 0
0x6029a0: -1 0 0 0
0x6029b0: 0 0 0 0
0x6029c0: 0 0 0 0
Memory dumps for each array. -1 marks start of the array. I'm printing from 8th element (4 bytes ints) before the start. You can clearly see the size of the array (sizeof (int) * the constant) + 17 auxiliary bytes stored in a word (8 bytes) before the array.
What and where are the stack and heap?
The stack is the memory set aside as scratch space for a thread of execution. When a function is called, a block is reserved on the top of the stack for local variables and some bookkeeping data. When that function returns, the block becomes unused and can be used the next time a function is called. The stack is always reserved in a LIFO (last in first out) order; the most recently reserved block is always the next block to be freed. This makes it really simple to keep track of the stack; freeing a block from the stack is nothing more than adjusting one pointer.
The heap is memory set aside for dynamic allocation. Unlike the stack, there's no enforced pattern to the allocation and deallocation of blocks from the heap; you can allocate a block at any time and free it at any time. This makes it much more complex to keep track of which parts of the heap are allocated or free at any given time; there are many custom heap allocators available to tune heap performance for different usage patterns.
Each thread gets a stack, while there's typically only one heap for the application (although it isn't uncommon to have multiple heaps for different types of allocation).
To answer your questions directly:
To what extent are they controlled by the OS or language runtime?
The OS allocates the stack for each system-level thread when the thread is created. Typically the OS is called by the language runtime to allocate the heap for the application.
What is their scope?
The stack is attached to a thread, so when the thread exits the stack is reclaimed. The heap is typically allocated at application startup by the runtime, and is reclaimed when the application (technically process) exits.
What determines the size of each of them?
The size of the stack is set when a thread is created. The size of the heap is set on application startup, but can grow as space is needed (the allocator requests more memory from the operating system).
What makes one faster?
The stack is faster because the access pattern makes it trivial to allocate and deallocate memory from it (a pointer/integer is simply incremented or decremented), while the heap has much more complex bookkeeping involved in an allocation or deallocation. Also, each byte in the stack tends to be reused very frequently which means it tends to be mapped to the processor's cache, making it very fast. Another performance hit for the heap is that the heap, being mostly a global resource, typically has to be multi-threading safe, i.e. each allocation and deallocation needs to be - typically - synchronized with "all" other heap accesses in the program.
A clear demonstration:
Image source: vikashazrati.wordpress.com
Related Topics
Set Markers for Individual Points on a Line in Matplotlib
How to Create a Datetime in Python from Milliseconds
Can Elementtree Be Told to Preserve the Order of Attributes
Type Hint for a Function That Returns Only a Specific Set of Values
How to Use Inspect to Get the Caller's Info from Callee in Python
Stop/Start/Pause in Python Matplotlib Animation
Python String.Strip Stripping Too Many Characters
How to Use PDFminer as a Library
How to Get Item's Position in a List
Why Isn't Python Very Good for Functional Programming
How to Make Python Scripts Executable on Windows
Generating Random Dates Within a Given Range in Pandas
How to Change a Module Variable from Another Module
Concatenate Two Numpy Arrays Vertically