Most efficient way to find mode in numpy array
Check scipy.stats.mode()
(inspired by @tom10's comment):
import numpy as np
from scipy import stats
a = np.array([[1, 3, 4, 2, 2, 7],
[5, 2, 2, 1, 4, 1],
[3, 3, 2, 2, 1, 1]])
m = stats.mode(a)
print(m)
Output:
ModeResult(mode=array([[1, 3, 2, 2, 1, 1]]), count=array([[1, 2, 2, 2, 1, 2]]))
As you can see, it returns both the mode as well as the counts. You can select the modes directly via m[0]
:
print(m[0])
Output:
[[1 3 2 2 1 1]]
what is the fastest way to get the mode of a numpy array
The implementation of scipy.stats.mode
has a Python loop for handling the axis
argument with multidimensional arrays. The following simple implementation, for one-dimensional arrays only, is faster:
def mode1(x):
values, counts = np.unique(x, return_counts=True)
m = counts.argmax()
return values[m], counts[m]
Here's an example. First, make an array of integers with length 1000000.
In [40]: x = np.random.randint(0, 1000, size=(2, 1000000)).sum(axis=0)
In [41]: x.shape
Out[41]: (1000000,)
Check that scipy.stats.mode
and mode1
give the same result.
In [42]: from scipy.stats import mode
In [43]: mode(x)
Out[43]: ModeResult(mode=array([1009]), count=array([1066]))
In [44]: mode1(x)
Out[44]: (1009, 1066)
Now check the performance.
In [45]: %timeit mode(x)
2.91 s ± 18 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [46]: %timeit mode1(x)
39.6 ms ± 83.8 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
2.91 seconds for mode(x)
and only 39.6 milliseconds for mode1(x)
.
Python & Numpy - Finding the Mode of Values in an Array that aren't Zero
You can just mask the array and use np.histogram
:
counts, bins = np.histogram(mR[mR>0], bins=np.arange(256))
# mode
modeR = np.argmax(counts)
Best way to find modes of an array along the column
If you look at the numpy.unique documentation, this function returns the values and the associated counts (because you specified return_counts=True
). A slight modification of your code is necessary to give the correct result. What you are trying todo is to find the value associated to the highest count:
import numpy as np
a = np.array([[1,5,3,4],[1,5,3,3],[1,5,3,3]])
result = np.zeros(a.shape[1])
for i in range(len(result)):
values, counts = np.unique(a[:,i], return_counts = True)
result[i] = values[np.argmax(counts)]
print(result)
Output:
% python3 script.py
[1. 5. 3. 4.]
Here is a code tha compares your solution with the scipy.stats.mode function:
import numpy as np
import scipy.stats as sps
import time
a = np.random.randint(1,100,(100,100))
t_start = time.time()
result = np.zeros(a.shape[1])
for i in range(len(result)):
values, counts = np.unique(a[:,i], return_counts = True)
result[i] = values[np.argmax(counts)]
print('Timer 1: ', (time.time()-t_start), 's')
t_start = time.time()
result_2 = sps.mode(a, axis=0).mode
print('Timer 2: ', (time.time()-t_start), 's')
print('Matrices are equal!' if np.allclose(result, result_2) else 'Matrices differ!')
Output:
% python3 script.py
Timer 1: 0.002721071243286133 s
Timer 2: 0.003339052200317383 s
Matrices are equal!
I tried several values for parameters and your code is actually faster than scipy.stats.mode function so it is probably close to optimal.
Iterate through rows of numpy array to find mode
- use
np.unique
with thereturn_counts
parameter. - use the
argmax
on the counts array to get value from unique array. - use
np.apply_along_axis
for a custom functionmode
def mode(a):
u, c = np.unique(a, return_counts=True)
return u[c.argmax()]
a = np.array([
[1, 2, 3],
[2, 3, 4],
[3, 4, 5],
[2, 5, 6],
[4, 1, 7],
[5, 4, 8],
[6, 6, 3]
])
np.apply_along_axis(mode, 0, a)
array([2, 4, 3])
Finding a numpy mode vector
We are dealing with one-hot
vectors as rows of the 2D
input array. So, argmax of each row would be unique to each one-hot vector. Get those. Then, get their counts. Anyone of the rows with the max argmax count would be the desired mode row output. Let's pick the first off those with one more use of argmax
and finally index into 2D
input.
Hence, one implementation -
idx = np.argmax(x,1)
count = np.bincount(idx)
out = x[(idx==count.argmax()).argmax()]
What is an efficient way to extract eigenvalues from 3x3 matrix elements stored in a 6xN `numpy` array?
np.linalg.eig
and np.sort
are slow for computing small matrices. While they might be a vectorized way to do that, I do not expect it to be fast because the implementation of np.linalg.eig
would not be efficient anyway in this specific case.
One efficient solution is to implement this in Numba or Cython based on the analytic formula of the eigenvalues of a 3x3 symmetric matrix. The sort can be efficiently done using sorting networks. The analytic formula can be computed by mathematical tools like Wolfram Alpha. The expression can then be factorized manually though compilers are able to apply a common sub-expression elimination. Note variables are complex numbers and I am not sure the eigenvalues are always real ones in your case. I assumed that because you want to sort them which require values to be real numbers (or imaginary ones). Here is the resulting implementation:
import numba as nb
@nb.njit
def sorted_eigenvalues(v):
assert v.shape == (3,3)
a, b, c, d, e, f = v[0,0], v[1,1], v[2,2], v[0,1], v[0,2], v[1,2]
assert v[1,0] == d and v[2,0] == e and v[2,1] == f
# Analytic eigenvalues solution of the 3x3 input matrix
tmp1 = -a**2 + a*b + a*c - b**2 + b*c - c**2 - 3*d**2 - 3*e**2 - 3*f**2
tmp2 = 2*a**3 - 3*a**2*b - 3*a**2*c - 3*a*b**2 + 12*a*b*c - 3*a*c**2 + 9*a*d**2 + 9*a*e**2 - 18*a*f**2 + 2*b**3 - 3*b**2*c - 3*b*c**2 + 9*b*d**2 - 18*b*e**2 + 9*b*f**2 + 2*c**3 - 18*c*d**2 + 9*c*e**2 + 9*c*f**2 + 54*d*e*f
tmp3 = np.sqrt((4*tmp1**3 + tmp2**2) + 0j)
tmp4 = (tmp2 + tmp3) ** (1/3)
tmp5 = 1/3*(a + b + c)
tmp6 = 1 + 1j*np.sqrt(3)
tmp7 = 1 - 1j*np.sqrt(3)
eigv1 = tmp4/(3*2**(1/3)) - (2**(1/3)*tmp1)/(3*tmp4) + tmp5
eigv2 = (tmp6*tmp1)/(3*2**(2/3)*tmp4) - (tmp7*tmp4)/(6*2**(1/3)) + tmp5
eigv3 = (tmp7*tmp1)/(3*2**(2/3)*tmp4) - (tmp6*tmp4)/(6*2**(1/3)) + tmp5
# Assume the values are real ones and remove the FP rounding errors
eigv1 = np.real(eigv1)
eigv2 = np.real(eigv2)
eigv3 = np.real(eigv3)
# Sort the eigenvalues using a fast sorting network
eigv1, eigv2 = min(eigv1, eigv2), max(eigv1, eigv2)
eigv2, eigv3 = min(eigv2, eigv3), max(eigv2, eigv3)
eigv1, eigv2 = min(eigv1, eigv2), max(eigv1, eigv2)
return eigv1, eigv2, eigv3
This functions is about 50 times faster for computing one 3x3 symmetric matrix on my machine (it takes only 0.5 us to do the computation in Numba while the overhead of calling a Numba function is about 0.2-0.3 us).
The second step is to loop over the columns of the S_by_poi
array in a Numba function and call the above function. For better performance, I advise you to directly pass the parameter a, b, c, d, e, f
to the sorted_eigenvalues
function so to avoid the expensive creation of Numpy arrays (or even memory loads/stores). I expect this to be 100 times faster in a loop (since the Numba function call overhead is removed). You can even parallelize the computation using prange
and parallel=True
in Numba resulting in an even faster computation (several hundred times faster if not several thousand times faster on high-end computing servers)!
Related Topics
Get Row Value of Maximum Count After Applying Group by in Pandas
Issue in Using Win32Com to Access Excel File
How to Find Words in a List That Starts With a Certain Letter the User Asked For
Pandas: Merging Two Columns into One With Corresponding Values
Large File Crashing on Jupyter Notebook
Check If List Is Ascending or Descending (Using For)
Arrange a Text File Using Python
Running Multiple Commands Simultaneously from Python
How to Get the Current Ipython/Jupyter Notebook Name
How to Iterate Through a Matrix Column in Python
How to Use Authenticated Proxy in Selenium Chromedriver
Reading Particular Cell Value from Excelsheet in Python
How to Convert Np.Int64 into Python Int64 for Pandasseries
How to Remove Carriage Return in a Dataframe
Fill With Nan When Length of Values Does Not Match Length of Index
How to Repeat Each Test Multiple Times in a Py.Test Run