Numpy Np.Apply_Along_Axis Function Speed Up

Improving speed of the code when using numpy.apply_along_axis

np.sin and * are vectorized operations, so, you can apply them over whole arrays:

np.sin(data[:, 0]) * np.cos(data[:, 1])

data[:, 0] is the first column and data[:, 1] is the second.

Note that this should go really fast :)

Here is a notebook that tests the speed of each method: notebook.

Average run time:

Method 1 (using numpy.apply_along_axis): 2.08s
Method 2 (loop applying function to rows): 1.14s
Method 3 (this answer): 17.3ms

numpy apply_along_axis vectorisation

Here's one vectorized approach setting the zeros as NaN and that let's us use np.nanmax and np.nanstd to compute those max and std values avoiding the zeros, like so -

imgn = np.where(img==0, np.nan, img)
mx = np.nanmax(imgn,0) # np.max(img,0) if all are positive numbers
st = np.nanstd(imgn,0)
mask = img > mx - 1.5*st
out = np.arange(mask.shape[0]).dot(mask)/mask.sum(0)

Runtime test -

In [94]: img = np.random.randint(-100,100,(2000,50))

In [95]: %timeit np.apply_along_axis(get_y, 0, img)
100 loops, best of 3: 4.36 ms per loop

In [96]: %%timeit
    ...: imgn = np.where(img==0, np.nan, img)
    ...: mx = np.nanmax(imgn,0)
    ...: st = np.nanstd(imgn,0)
    ...: mask = img > mx - 1.5*st
    ...: out = np.arange(mask.shape[0]).dot(mask)/mask.sum(0)
1000 loops, best of 3: 1.33 ms per loop

Thus, we are seeing a 3x+ speedup.

Why does numpy.apply_along_axis seem to be slower than Python loop?

np.sum take an axis parameter, so you could compute the sum simply using

sums3 = np.sum(x, axis=1)

This is much faster than the 2 methods you posed.

$ python -m timeit -n 1 -r 1 -s "import numpy as np;x=np.ones([100000,3])" "np.apply_along_axis(np.sum, 1, x)"
1 loops, best of 1: 3.21 sec per loop

$ python -m timeit -n 1 -r 1 -s "import numpy as np;x=np.ones([100000,3])" "np.array([np.sum(x[i,:]) for i in range(x.shape[0])])"
1 loops, best of 1: 712 msec per loop

$ python -m timeit -n 1 -r 1 -s "import numpy as np;x=np.ones([100000,3])" "np.sum(x, axis=1)"
1 loops, best of 1: 1.81 msec per loop

(As for why apply_along_axis is slower — I don't know, probably because the function is written in pure Python and is much more generic and thus less optimization opportunity than the array version.)

Easy parallelization of numpy.apply_along_axis()?

Alright, I worked it out: an idea is to use the standard multiprocessing module and split the original array in just a few chunks (so as to limit communication overhead with the workers). This can be done relatively easily as follows:

import multiprocessing

import numpy as np

def parallel_apply_along_axis(func1d, axis, arr, *args, **kwargs):
    """
    Like numpy.apply_along_axis(), but takes advantage of multiple
    cores.
    """        
    # Effective axis where apply_along_axis() will be applied by each
    # worker (any non-zero axis number would work, so as to allow the use
    # of `np.array_split()`, which is only done on axis 0):
    effective_axis = 1 if axis == 0 else axis
    if effective_axis != axis:
        arr = arr.swapaxes(axis, effective_axis)

    # Chunks for the mapping (only a few chunks):
    chunks = [(func1d, effective_axis, sub_arr, args, kwargs)
              for sub_arr in np.array_split(arr, multiprocessing.cpu_count())]

    pool = multiprocessing.Pool()
    individual_results = pool.map(unpacking_apply_along_axis, chunks)
    # Freeing the workers:
    pool.close()
    pool.join()

    return np.concatenate(individual_results)

where the function unpacking_apply_along_axis() being applied in Pool.map() is separate as it should (so that subprocesses can import it), and is simply a thin wrapper that handles the fact that Pool.map() only takes a single argument:

def unpacking_apply_along_axis((func1d, axis, arr, args, kwargs)):
    """
    Like numpy.apply_along_axis(), but with arguments in a tuple
    instead.

    This function is useful with multiprocessing.Pool().map(): (1)
    map() only handles functions that take a single argument, and (2)
    this function can generally be imported from a module, as required
    by map().
    """
    return np.apply_along_axis(func1d, axis, arr, *args, **kwargs)

(in Python 3, this should be written as

def unpacking_apply_along_axis(all_args):
    (func1d, axis, arr, args, kwargs) = all_args

because argument unpacking was removed).

In my particular case, this resulted in a 2x speedup on 2 cores with hyper-threading. A factor closer to 4x would have been nicer, but the speed up is already nice, in just a few lines of code, and it should be better for machines with more cores (which are quite common). Maybe there is a way of avoiding data copies and using shared memory (maybe through the multiprocessing module itself)?

Numpy.apply_along_axis works unexpectedly when applying a function with if else condition

Your function f returns integers.
you have to use:

def f(arr):
    return float(0 if arr[-1] == arr[0] else abs(arr[-1]-arr[0]))

[[0 1 2 3]
 [1 2 3 4]
 [2 3 4 5]]
[[27.75 27.71 28.05 27.75]
 [27.71 28.05 27.75 26.55]
 [28.05 27.75 26.55 27.18]]
[0.   1.16 0.87]

P.S

your function f can be generalized to simple return abs(arr[-1]-arr[0]) as it covers the 0 case. You don't need the if statement.

Numpy apply along axis based on row index

When iterating on array, directly or with apply_along_axis, the subarray does not have a .index attribute. So we have to pass an explicit index value to your function:

In [248]: def func(i,x):
     ...:    if i//2==0:
     ...:       x = x+10
     ...:    else:
     ...:       x = x+50
     ...:    return x
     ...: 
In [249]: arr = np.arange(10).reshape(5,2)

apply doesn't have a way to add this index, so instead we have to use an explicit iteration.

In [250]: np.array([func(i,v) for i,v in enumerate(arr)])
Out[250]: 
array([[10, 11],
       [12, 13],
       [54, 55],
       [56, 57],
       [58, 59]])

replacing // with %

In [251]: def func(i,x):
     ...:    if i%2==0:
     ...:       x = x+10
     ...:    else:
     ...:       x = x+50
     ...:    return x
     ...: 
In [252]: np.array([func(i,v) for i,v in enumerate(arr)])
Out[252]: 
array([[10, 11],
       [52, 53],
       [14, 15],
       [56, 57],
       [18, 19]])

But a better way is to skip the iteration entirely:

Make an array of the row additions:

In [253]: np.where(np.arange(5)%2,10,50)
Out[253]: array([50, 10, 50, 10, 50])

apply it via broadcasting:

In [256]: x+np.where(np.arange(5)%2,50,10)[:,None]
Out[256]: 
array([[10, 11],
       [52, 53],
       [14, 15],
       [56, 57],
       [18, 19]])