Most Efficient Way to Find Mode in Numpy Array

Most efficient way to find mode in numpy array

Check scipy.stats.mode() (inspired by @tom10's comment):

import numpy as np
from scipy import stats

a = np.array([[1, 3, 4, 2, 2, 7],
[5, 2, 2, 1, 4, 1],
[3, 3, 2, 2, 1, 1]])

m = stats.mode(a)
print(m)

Output:

ModeResult(mode=array([[1, 3, 2, 2, 1, 1]]), count=array([[1, 2, 2, 2, 1, 2]]))

As you can see, it returns both the mode as well as the counts. You can select the modes directly via m[0]:

print(m[0])

Output:

[[1 3 2 2 1 1]]

what is the fastest way to get the mode of a numpy array

The implementation of scipy.stats.mode has a Python loop for handling the axis argument with multidimensional arrays. The following simple implementation, for one-dimensional arrays only, is faster:

def mode1(x):
values, counts = np.unique(x, return_counts=True)
m = counts.argmax()
return values[m], counts[m]

Here's an example. First, make an array of integers with length 1000000.

In [40]: x = np.random.randint(0, 1000, size=(2, 1000000)).sum(axis=0)

In [41]: x.shape
Out[41]: (1000000,)

Check that scipy.stats.mode and mode1 give the same result.

In [42]: from scipy.stats import mode

In [43]: mode(x)
Out[43]: ModeResult(mode=array([1009]), count=array([1066]))

In [44]: mode1(x)
Out[44]: (1009, 1066)

Now check the performance.

In [45]: %timeit mode(x)
2.91 s ± 18 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [46]: %timeit mode1(x)
39.6 ms ± 83.8 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

2.91 seconds for mode(x) and only 39.6 milliseconds for mode1(x).

Python & Numpy - Finding the Mode of Values in an Array that aren't Zero

You can just mask the array and use np.histogram:

counts, bins = np.histogram(mR[mR>0], bins=np.arange(256))

# mode
modeR = np.argmax(counts)

Best way to find modes of an array along the column

If you look at the numpy.unique documentation, this function returns the values and the associated counts (because you specified return_counts=True). A slight modification of your code is necessary to give the correct result. What you are trying todo is to find the value associated to the highest count:

import numpy as np
a = np.array([[1,5,3,4],[1,5,3,3],[1,5,3,3]])
result = np.zeros(a.shape[1])
for i in range(len(result)):
values, counts = np.unique(a[:,i], return_counts = True)
result[i] = values[np.argmax(counts)]
print(result)

Output:

% python3 script.py
[1. 5. 3. 4.]

Here is a code tha compares your solution with the scipy.stats.mode function:

import numpy as np
import scipy.stats as sps
import time

a = np.random.randint(1,100,(100,100))

t_start = time.time()
result = np.zeros(a.shape[1])
for i in range(len(result)):
values, counts = np.unique(a[:,i], return_counts = True)
result[i] = values[np.argmax(counts)]
print('Timer 1: ', (time.time()-t_start), 's')

t_start = time.time()
result_2 = sps.mode(a, axis=0).mode
print('Timer 2: ', (time.time()-t_start), 's')

print('Matrices are equal!' if np.allclose(result, result_2) else 'Matrices differ!')

Output:

% python3 script.py
Timer 1: 0.002721071243286133 s
Timer 2: 0.003339052200317383 s
Matrices are equal!

I tried several values for parameters and your code is actually faster than scipy.stats.mode function so it is probably close to optimal.

Iterate through rows of numpy array to find mode

  • use np.unique with the return_counts parameter.
  • use the argmax on the counts array to get value from unique array.
  • use np.apply_along_axis for a custom function mode

def mode(a):
u, c = np.unique(a, return_counts=True)
return u[c.argmax()]

a = np.array([
[1, 2, 3],
[2, 3, 4],
[3, 4, 5],
[2, 5, 6],
[4, 1, 7],
[5, 4, 8],
[6, 6, 3]
])

np.apply_along_axis(mode, 0, a)

array([2, 4, 3])

Finding a numpy mode vector

We are dealing with one-hot vectors as rows of the 2D input array. So, argmax of each row would be unique to each one-hot vector. Get those. Then, get their counts. Anyone of the rows with the max argmax count would be the desired mode row output. Let's pick the first off those with one more use of argmax and finally index into 2D input.

Hence, one implementation -

idx = np.argmax(x,1)
count = np.bincount(idx)
out = x[(idx==count.argmax()).argmax()]

What is an efficient way to extract eigenvalues from 3x3 matrix elements stored in a 6xN `numpy` array?

np.linalg.eig and np.sort are slow for computing small matrices. While they might be a vectorized way to do that, I do not expect it to be fast because the implementation of np.linalg.eig would not be efficient anyway in this specific case.

One efficient solution is to implement this in Numba or Cython based on the analytic formula of the eigenvalues of a 3x3 symmetric matrix. The sort can be efficiently done using sorting networks. The analytic formula can be computed by mathematical tools like Wolfram Alpha. The expression can then be factorized manually though compilers are able to apply a common sub-expression elimination. Note variables are complex numbers and I am not sure the eigenvalues are always real ones in your case. I assumed that because you want to sort them which require values to be real numbers (or imaginary ones). Here is the resulting implementation:

import numba as nb

@nb.njit
def sorted_eigenvalues(v):
assert v.shape == (3,3)
a, b, c, d, e, f = v[0,0], v[1,1], v[2,2], v[0,1], v[0,2], v[1,2]
assert v[1,0] == d and v[2,0] == e and v[2,1] == f

# Analytic eigenvalues solution of the 3x3 input matrix
tmp1 = -a**2 + a*b + a*c - b**2 + b*c - c**2 - 3*d**2 - 3*e**2 - 3*f**2
tmp2 = 2*a**3 - 3*a**2*b - 3*a**2*c - 3*a*b**2 + 12*a*b*c - 3*a*c**2 + 9*a*d**2 + 9*a*e**2 - 18*a*f**2 + 2*b**3 - 3*b**2*c - 3*b*c**2 + 9*b*d**2 - 18*b*e**2 + 9*b*f**2 + 2*c**3 - 18*c*d**2 + 9*c*e**2 + 9*c*f**2 + 54*d*e*f
tmp3 = np.sqrt((4*tmp1**3 + tmp2**2) + 0j)
tmp4 = (tmp2 + tmp3) ** (1/3)
tmp5 = 1/3*(a + b + c)
tmp6 = 1 + 1j*np.sqrt(3)
tmp7 = 1 - 1j*np.sqrt(3)
eigv1 = tmp4/(3*2**(1/3)) - (2**(1/3)*tmp1)/(3*tmp4) + tmp5
eigv2 = (tmp6*tmp1)/(3*2**(2/3)*tmp4) - (tmp7*tmp4)/(6*2**(1/3)) + tmp5
eigv3 = (tmp7*tmp1)/(3*2**(2/3)*tmp4) - (tmp6*tmp4)/(6*2**(1/3)) + tmp5

# Assume the values are real ones and remove the FP rounding errors
eigv1 = np.real(eigv1)
eigv2 = np.real(eigv2)
eigv3 = np.real(eigv3)

# Sort the eigenvalues using a fast sorting network
eigv1, eigv2 = min(eigv1, eigv2), max(eigv1, eigv2)
eigv2, eigv3 = min(eigv2, eigv3), max(eigv2, eigv3)
eigv1, eigv2 = min(eigv1, eigv2), max(eigv1, eigv2)

return eigv1, eigv2, eigv3

This functions is about 50 times faster for computing one 3x3 symmetric matrix on my machine (it takes only 0.5 us to do the computation in Numba while the overhead of calling a Numba function is about 0.2-0.3 us).

The second step is to loop over the columns of the S_by_poi array in a Numba function and call the above function. For better performance, I advise you to directly pass the parameter a, b, c, d, e, f to the sorted_eigenvalues function so to avoid the expensive creation of Numpy arrays (or even memory loads/stores). I expect this to be 100 times faster in a loop (since the Numba function call overhead is removed). You can even parallelize the computation using prange and parallel=True in Numba resulting in an even faster computation (several hundred times faster if not several thousand times faster on high-end computing servers)!



Related Topics



Leave a reply



Submit