Numpy: find first index of value fast
There is a feature request for this scheduled for Numpy 2.0.0: https://github.com/numpy/numpy/issues/2269
Is there a NumPy function to return the first index of something in an array?
Yes, given an array, array
, and a value, item
to search for, you can use np.where
as:
itemindex = numpy.where(array == item)
The result is a tuple with first all the row indices, then all the column indices.
For example, if an array is two dimensions and it contained your item at two locations then
array[itemindex[0][0]][itemindex[1][0]]
would be equal to your item and so would be:
array[itemindex[0][1]][itemindex[1][1]]
Numpy first occurrence of value greater than existing value
This is a little faster (and looks nicer)
np.argmax(aa>5)
Since argmax
will stop at the first True
("In case of multiple occurrences of the maximum values, the indices corresponding to the first occurrence are returned.") and doesn't save another list.
In [2]: N = 10000
In [3]: aa = np.arange(-N,N)
In [4]: timeit np.argmax(aa>N/2)
100000 loops, best of 3: 52.3 us per loop
In [5]: timeit np.where(aa>N/2)[0][0]
10000 loops, best of 3: 141 us per loop
In [6]: timeit np.nonzero(aa>N/2)[0][0]
10000 loops, best of 3: 142 us per loop
Find the index of the first occurrence of some value that is not X or Y in a numpy array
np.unique
returns the first index of each number if you specify return_index=True
. You can filter the result pretty easily using, e.g., np.isin
:
u, i = np.unique(vec, return_index=True)
result = i[np.isin(u, [51, 52], invert=True)]
The advantage of doing it this way is that u
is a significantly reduced search space compared to the original data. Using invert=True
also speeds things up a little compared to explicitly negating the resulting mask.
A version of np.isin
that relies on the fact that the data is already sorted could be made using np.searchsorted
like this:
def isin_sorted(a, i, invert=False):
ind = np.searchsorted(a, i)
ind = ind[a[ind.clip(max=a.size)] == i]
if invert:
mask = np.ones(a.size, dtype=bool)
mask[ind] = False
else:
mask = np.zeros(a.size, dtype=bool)
mask[ind] = True
return mask
You could use this version in place of np.isin
, after calling np.unique
, which always returns a sorted array. For sufficiently large vec
and exclusion lists, it will be more efficient:
result = i[isin_sorted(u, [51, 52], invert=True)]
numpy - return first index of element in array
You can use np.argwhere
to get the matching indices packed as a 2D array with each row holding indices for each match and then index into the first row, like so -
np.argwhere(zArray==match)[0]
Alternatively, faster one with argmax
to get the index of the first match on a flattened version and np.unravel_index
for per-dim indices tuple -
np.unravel_index((zArray==match).argmax(), zArray.shape)
Sample run -
In [100]: zArray
Out[100]:
array([[ 0, 1200, 5000], # different from sample for a generic one
[1320, 24, 5000],
[5000, 234, 5230]])
In [101]: match
Out[101]: 5000
In [102]: np.argwhere(zArray==match)[0]
Out[102]: array([0, 2])
In [103]: np.unravel_index((zArray==match).argmax(), zArray.shape)
Out[103]: (0, 2)
Runtime test -
In [104]: a = np.random.randint(0,100,(1000,1000))
In [105]: %timeit np.argwhere(a==50)[0]
100 loops, best of 3: 2.41 ms per loop
In [106]: %timeit np.unravel_index((a==50).argmax(), a.shape)
1000 loops, best of 3: 493 µs per loop
Python/NumPy: find the first index of zero, then replace all elements with zero after that for each row
One way to accomplish question 1 is to use numpy.cumprod
>>> np.cumprod(a, axis=1)
array([[1, 0, 0, 0, 0],
[1, 1, 1, 1, 0],
[1, 0, 0, 0, 0],
[1, 0, 0, 0, 0]])
Efficiently return the index of the first value satisfying condition in array
numba
With numba
it's possible to optimise both scenarios. Syntactically, you need only construct a function with a simple for
loop:
from numba import njit
@njit
def get_first_index_nb(A, k):
for i in range(len(A)):
if A[i] > k:
return i
return -1
idx = get_first_index_nb(A, 0.9)
Numba improves performance by JIT ("Just In Time") compiling code and leveraging CPU-level optimisations. A regular for
loop without the @njit
decorator would typically be slower than the methods you've already tried for the case where the condition is met late.
For a Pandas numeric series df['data']
, you can simply feed the NumPy representation to the JIT-compiled function:
idx = get_first_index_nb(df['data'].values, 0.9)
Generalisation
Since numba
permits functions as arguments, and assuming the passed the function can also be JIT-compiled, you can arrive at a method to calculate the nth index where a condition is met for an arbitrary func
.
@njit
def get_nth_index_count(A, func, count):
c = 0
for i in range(len(A)):
if func(A[i]):
c += 1
if c == count:
return i
return -1
@njit
def func(val):
return val > 0.9
# get index of 3rd value where func evaluates to True
idx = get_nth_index_count(arr, func, 3)
For the 3rd last value, you can feed the reverse, arr[::-1]
, and negate the result from len(arr) - 1
, the - 1
necessary to account for 0-indexing.
Performance benchmarking
# Python 3.6.5, NumPy 1.14.3, Numba 0.38.0
np.random.seed(0)
arr = np.random.rand(10**7)
m = 0.9
n = 0.999999
@njit
def get_first_index_nb(A, k):
for i in range(len(A)):
if A[i] > k:
return i
return -1
def get_first_index_np(A, k):
for i in range(len(A)):
if A[i] > k:
return i
return -1
%timeit get_first_index_nb(arr, m) # 375 ns
%timeit get_first_index_np(arr, m) # 2.71 µs
%timeit next(iter(np.where(arr > m)[0]), -1) # 43.5 ms
%timeit next((idx for idx, val in enumerate(arr) if val > m), -1) # 2.5 µs
%timeit get_first_index_nb(arr, n) # 204 µs
%timeit get_first_index_np(arr, n) # 44.8 ms
%timeit next(iter(np.where(arr > n)[0]), -1) # 21.4 ms
%timeit next((idx for idx, val in enumerate(arr) if val > n), -1) # 39.2 ms
numpy find slice along an axes where the first and last occurring value occurs
You probably can optimize this to be faster, but here is a vectorized version of what you search:
axis = 1
mask = np.where(x==val)[axis]
first, last = np.amin(mask), np.amax(mask)
It first finds the element val
in your array using np.where
and returns the min
and max
of indices along desired axis.
Related Topics
What Are Good Uses for Python3's "Function Annotations"
How to Insert Pandas Dataframe via MySQLdb into Database
Python Accessing Nested JSON Data
How to Read File N Lines at a Time
Pytz Localize VS Datetime Replace
Why Is Tensorflow 2 Much Slower Than Tensorflow 1
How to Force Python to Be 32-Bit on Snow Leopard and Other 32-Bit/64-Bit Questions
How to Find Out If Python Is Compiled with Ucs-2 or Ucs-4
Plot Pandas Dataframe as Bar and Line on the Same One Chart
How to Turn Off Info Logging in Spark
Python How to Pad Numpy Array with Zeros
Check If a Process Is Running or Not on Windows
How to Display a Pandas Data Frame with Pyqt5/Pyside2
How to Transpose Dataframe in Pandas Without Index
How to Upgrade to Python 3.6 with Conda