Why numpy.any has no short-circuit mechanism?
It's an unfixed performance regression. NumPy issue 3446. There actually is short-circuiting logic, but a change to the ufunc.reduce
machinery introduced an unnecessary chunk-based outer loop around the short-circuiting logic, and that outer loop doesn't know how to short circuit. You can see some explanation of the chunking machinery here.
The short-circuiting effects wouldn't have showed up in your test even without the regression, though. First, you're timing the array creation, and second, I don't think they ever put in the short-circuit logic for any input dtype but boolean. From the discussion, it sounds like the details of the ufunc reduction machinery behind numpy.any
would have made that difficult.
The discussion does bring up the surprising point that the argmin
and argmax
methods appear to short-circuit for boolean input. A quick test shows that as of NumPy 1.12 (not quite the most recent version, but the version currently on Ideone), x[x.argmax()]
short-circuits, and it outcompetes x.any()
and x.max()
for 1-dimensional boolean input no matter whether the input is small or large and no matter whether the short-circuiting pays off. Weird!
Short circuit numpy logical_and on pandas series
One idea is:
mask = csv_df['time'].dt.hour.isin(hours_set) &
csv_df['time'].dt.strftime('%a').isin(days_set)
Anoather idea if most values not match is filter first one and then second:csv_df1 = csv_df.loc[csv_df['time'].dt.strftime('%a').isin(days_set)]
csv_df2 = csv_df1.loc[csv_df1['time'].dt.hour.isin(hours_set)]
Performance of numpy all/any vs testing a single element
When you write a == 0
, numpy creates a new array of type boolean, compares each element in a
with 0 and stores the result in the array. This allocation, initialization, and subsequent deallocation is the reason for the high cost.
Note that you don't need the explicit a == 0
in the first place. Integers that are zero always evauate to False, nonzero integers to True. np.all(a)
is equivalent to np.all(a != 0)
. So np.all(a==0)
is equivalent to not np.any(a)
Why any sometimes works much faster, and sometimes much slower than max on Boolean values in python?
Because any
evaluation is lazy. Which means that the that the any
function will stop at the first True
boolean element.
The max
, however, can't do so because it required to inspect every element in a sequence to be sure it haven't missed any greater element.
That's why, max
always will inspect all element when any
inspect only element before the first True
.
The case when max
works faster are probably the cases with type coercion because all values in numpy are stored in their own types and formats, mathematical operations may be faster that python's any
.
Why is numpy.any so slow over large arrays?
As has been guessed in the comments, I can confirm that the processing of the array is being done in chunks. First, I will show you where things are in the code and then I will show you how you can change the chunk size and the effect that doing so has on your benchmark.
Where to find the reduction processing in the Numpy source files
np.all(x) is the same as x.all(). all() really calls np.core.umath.logical_and.reduce(x).
If you want to dig into the numpy source, I will try to guide you through finding that a buffer/chunk size is used. The folder with all of the code we will be looking at is numpy/core/src/umath/.
PyUFunc_Reduce() in ufunc_object.c is the C function that handles the reduce. In PyUFunc_Reduce(), the chunk, or buffer, size is found by looking up the value for reduce in some global dictionary via the PyUFunc_GetPyValues() function (ufunc_object.c). On my machine and compiling from the development branch, the chunk size is 8192. PyUFunc_ReduceWrapper() in reduction.c is called to set-up the iterator (with a stride equal to the chunk size) and it calls the passed in loop function which is reduce_loop() in ufunc_object.c.
reduce_loop() basically just uses the iterator and calls another innerloop() function for each chunk. The innerloop function is found in loops.c.src. For a boolean array and our case of all/logical_and, the appropriate innerloop function is BOOL_logical_and. You can find the right function by searching for BOOLEAN LOOPS and then it is the second function below that (it is hard to find due to the template-like programming used here). There you will find that short circuiting is in fact being done for each chunk.
How to change the buffer size used in ufunctions (and thus in any/all)
You can get the chunk/buffer size with np.getbuffersize(). For me, that returns 8192 without manually setting it which matches what I found by printing out the buffer size in the code. You can use np.setbuffersize() to change the chunk size.
Results using a bigger buffer size
I changed your benchmark code to the following:
import timeit
import numpy as np
print 'Numpy v%s' %np.version.full_version
stmt = "np.all(x)"
for ii in xrange(9):
setup = "import numpy as np; x = np.zeros(%d,dtype=np.bool); np.setbufsize(%d)" %(10**ii, max(8192, min(10**ii, 10**7)))
timer = timeit.Timer(stmt,setup)
n,r = 1,3
t = np.min(timer.repeat(r,n))
while t < 0.2:
n *= 10
t = np.min(timer.repeat(r,n))
t /= n
if t < 1E-3:
timestr = "%1.3f us" %(t*1E6)
elif t < 1:
timestr = "%1.3f ms" %(t*1E3)
else:
timestr = "%1.3f s" %t
print "Array size: 1E%i, %i loops, best of %i: %s/loop" %(ii,n,r,timestr)
Numpy doesn't like the buffer size being too small or too big so I made sure that it didn't get smaller than 8192 or larger than 1E7 because Numpy didn't like a buffer size of 1E8. Otherwise, I was setting the buffer size to the size of the array being processed. I only went up to 1E8 because my machine only has 4GB of memory at the moment. Here are the results:Numpy v1.8.0.dev-2a5c2c8
Array size: 1E0, 100000 loops, best of 3: 5.351 us/loop
Array size: 1E1, 100000 loops, best of 3: 5.390 us/loop
Array size: 1E2, 100000 loops, best of 3: 5.366 us/loop
Array size: 1E3, 100000 loops, best of 3: 5.360 us/loop
Array size: 1E4, 100000 loops, best of 3: 5.433 us/loop
Array size: 1E5, 100000 loops, best of 3: 5.400 us/loop
Array size: 1E6, 100000 loops, best of 3: 5.397 us/loop
Array size: 1E7, 100000 loops, best of 3: 5.381 us/loop
Array size: 1E8, 100000 loops, best of 3: 6.126 us/loop
There is a small uptick in the last timing because there are multiple chunks being processed due to the limitations on how big the buffer size can be. Test if numpy array contains only zeros
Check out numpy.count_nonzero.
>>> np.count_nonzero(np.eye(4))
4
>>> np.count_nonzero([[0,1,7,0,0],[3,0,0,2,19]])
5
Invalid value encountered in greater - numpy array logic
This can do it, with some boolean indexing to avoid the nan
and a where
statement
y[np.isfinite(y)] = np.where(y[np.isfinite(y)] > 5, np.nan, y[np.isfinite(y)])
what is the most efficient way to find the position of the first np.nan value?
I'll nominate
a.argmax()
With @fuglede's
test array:In [1]: a = np.array([np.nan if i % 10000 == 9999 else 3 for i in range(100000)])
In [2]: np.isnan(a).argmax()
Out[2]: 9999
In [3]: np.argmax(a)
Out[3]: 9999
In [4]: a.argmax()
Out[4]: 9999
In [5]: timeit a.argmax()
The slowest run took 29.94 ....
10000 loops, best of 3: 20.3 µs per loop
In [6]: timeit np.isnan(a).argmax()
The slowest run took 7.82 ...
1000 loops, best of 3: 462 µs per loop
I don't have numba
installed, so can compare that. But my speedup relative to short
is greater than @fuglede's
6x.I'm testing in Py3, which accepts <np.nan
, while Py2 raises a runtime warning. But the code search suggests this isn't dependent on that comparison.
/numpy/core/src/multiarray/calculation.c
PyArray_ArgMax
plays with axes (moving the one of interest to the end), and delegates the action to arg_func = PyArray_DESCR(ap)->f->argmax
, a function that depends on the dtype.
In numpy/core/src/multiarray/arraytypes.c.src
it looks like BOOL_argmax
short circuits, returning as soon as it encounters a True
.
for (; i < n; i++) {
if (ip[i]) {
*max_ind = i;
return 0;
}
}
And @fname@_argmax
also short circuits on maximal nan
. np.nan
is 'maximal' in argmin
as well.#if @isfloat@
if (@isnan@(mp)) {
/* nan encountered; it's maximal */
return 0;
}
#endif
Comments from experienced c
coders are welcomed, but it appears to me that at least for np.nan
, a plain argmax
will be as fast you we can get.Playing with the 9999
in generating a
shows that the a.argmax
time depends on that value, consistent with short circuiting.
Collapse mask array along axis - Numpy in Python
You can use ndarray.any
:
all_masks = np.array([[False, False, False, False, False, False],
[False, False, False, False, False, False],
[False, False, False, False, False, False],
[False, True, False, False, True, False],
[False, False, False, False, False, False],
[False, True, False, False, True, False]])
all_masks.any(axis=0)
Output:array([False, True, False, False, True, False])
Why is this version of logical AND in C not showing short-circuit behavior?
This is a trick question. b
is an input argument to the sc_and
method, and so will always be evaluated. In other-words sc_and(a(), b())
will call a()
and call b()
(order not guaranteed), then call sc_and
with the results of a(), b()
which passes to a?b:0
. It has nothing to do with the ternary operator itself, which would absolutely short-circuit.
UPDATE
With regards to why I called this a 'trick question': It's because of the lack of well-defined context for where to consider 'short circuiting' (at least as reproduced by the OP). Many persons, when given just a function definition, assume that the context of the question is asking about the body of the function; they often do not consider the function as an expression in and of itself. This is the 'trick' of the question; To remind you that in programming in general, but especially in languages like C-likes that often have many exceptions to rules, you can't do that. Example, if the question was asked as such:
Consider the following code. Will sc_and exibit short-circuit behavior when called from main:
int sc_and(int a, int b){
return a?b:0;
}
int a(){
cout<<"called a!"<<endl;
return 0;
}
int b(){
cout<<"called b!"<<endl;
return 1;
}
int main(char* argc, char** argv){
int x = sc_and(a(), b());
return 0;
}
It would be immediately clear that you're supposed to be thinking of sc_and
as an operator in and of itself in your own domain-specific language, and evaluating if the call to sc_and
exhibits short-circuit behavior like a regular &&
would. I would not consider that to be a trick question at all, because it's clear you're not supposed to focus on the ternary operator, and are instead supposed to focus on C/C++'s function-call mechanics (and, I would guess, lead nicely into a follow-up question to write an sc_and
that does short-circuit, which would involve using a #define
rather than a function).Whether or not you call what the ternary operator itself does short-circuiting (or something else, like 'conditional evaluation') depends on your definition of short-circuiting, and you can read the various comments for thoughts on that. By mine it does, but it's not terribly relevant to the actual question or why I called it a 'trick'.
Related Topics
How to Detect If a File Is Binary (Non-Text) in Python
Os.Path.Dirname(_File_) Returns Empty
Generating PDF-Latex with Python Script
Logisticregression: Unknown Label Type: 'Continuous' Using Sklearn in Python
How to Reinstall Python@2 from Homebrew
Overflowerror: Long Int Too Large to Convert to Float in Python
Python Memory Usage of Numpy Arrays
Failed Loading English.Pickle with Nltk.Data.Load
Filtering a Pyspark Dataframe with SQL-Like in Clause
What Is More 'Pythonic' for 'Not'
Stop Matplotlib Repeating Labels in Legend
A Good Way to Make Long Strings Wrap to Newline
How to Make a Local Variable (Inside a Function) Global
Merging Dictionary Value Lists in Python
Splitting a String by List of Indices
Change Tick Frequency on X (Time, Not Number) Frequency in Matplotlib
What's the Difference Between Subprocess Popen and Call (How to Use Them)