Why "Numpy.Any" Has No Short-Circuit Mechanism

Why numpy.any has no short-circuit mechanism?

It's an unfixed performance regression. NumPy issue 3446. There actually is short-circuiting logic, but a change to the ufunc.reduce machinery introduced an unnecessary chunk-based outer loop around the short-circuiting logic, and that outer loop doesn't know how to short circuit. You can see some explanation of the chunking machinery here.

The short-circuiting effects wouldn't have showed up in your test even without the regression, though. First, you're timing the array creation, and second, I don't think they ever put in the short-circuit logic for any input dtype but boolean. From the discussion, it sounds like the details of the ufunc reduction machinery behind numpy.any would have made that difficult.

The discussion does bring up the surprising point that the argmin and argmax methods appear to short-circuit for boolean input. A quick test shows that as of NumPy 1.12 (not quite the most recent version, but the version currently on Ideone), x[x.argmax()] short-circuits, and it outcompetes x.any() and x.max() for 1-dimensional boolean input no matter whether the input is small or large and no matter whether the short-circuiting pays off. Weird!

Short circuit numpy logical_and on pandas series

One idea is:

mask = csv_df['time'].dt.hour.isin(hours_set) & 
       csv_df['time'].dt.strftime('%a').isin(days_set)

Anoather idea if most values not match is filter first one and then second:

csv_df1 = csv_df.loc[csv_df['time'].dt.strftime('%a').isin(days_set)]
csv_df2 = csv_df1.loc[csv_df1['time'].dt.hour.isin(hours_set)]

Performance of numpy all/any vs testing a single element

When you write a == 0, numpy creates a new array of type boolean, compares each element in a with 0 and stores the result in the array. This allocation, initialization, and subsequent deallocation is the reason for the high cost.

Note that you don't need the explicit a == 0 in the first place. Integers that are zero always evauate to False, nonzero integers to True. np.all(a) is equivalent to np.all(a != 0). So np.all(a==0) is equivalent to not np.any(a)

Why any sometimes works much faster, and sometimes much slower than max on Boolean values in python?

Because any evaluation is lazy. Which means that the that the any function will stop at the first True boolean element.

The max, however, can't do so because it required to inspect every element in a sequence to be sure it haven't missed any greater element.

That's why, max always will inspect all element when any inspect only element before the first True.

The case when max works faster are probably the cases with type coercion because all values in numpy are stored in their own types and formats, mathematical operations may be faster that python's any.

Why is numpy.any so slow over large arrays?

As has been guessed in the comments, I can confirm that the processing of the array is being done in chunks. First, I will show you where things are in the code and then I will show you how you can change the chunk size and the effect that doing so has on your benchmark.

Where to find the reduction processing in the Numpy source files

np.all(x) is the same as x.all(). all() really calls np.core.umath.logical_and.reduce(x).

If you want to dig into the numpy source, I will try to guide you through finding that a buffer/chunk size is used. The folder with all of the code we will be looking at is numpy/core/src/umath/.

PyUFunc_Reduce() in ufunc_object.c is the C function that handles the reduce. In PyUFunc_Reduce(), the chunk, or buffer, size is found by looking up the value for reduce in some global dictionary via the PyUFunc_GetPyValues() function (ufunc_object.c). On my machine and compiling from the development branch, the chunk size is 8192. PyUFunc_ReduceWrapper() in reduction.c is called to set-up the iterator (with a stride equal to the chunk size) and it calls the passed in loop function which is reduce_loop() in ufunc_object.c.

reduce_loop() basically just uses the iterator and calls another innerloop() function for each chunk. The innerloop function is found in loops.c.src. For a boolean array and our case of all/logical_and, the appropriate innerloop function is BOOL_logical_and. You can find the right function by searching for BOOLEAN LOOPS and then it is the second function below that (it is hard to find due to the template-like programming used here). There you will find that short circuiting is in fact being done for each chunk.

How to change the buffer size used in ufunctions (and thus in any/all)

You can get the chunk/buffer size with np.getbuffersize(). For me, that returns 8192 without manually setting it which matches what I found by printing out the buffer size in the code. You can use np.setbuffersize() to change the chunk size.

Results using a bigger buffer size

I changed your benchmark code to the following:

import timeit
import numpy as np
print 'Numpy v%s' %np.version.full_version
stmt = "np.all(x)"
for ii in xrange(9):
    setup = "import numpy as np; x = np.zeros(%d,dtype=np.bool); np.setbufsize(%d)" %(10**ii, max(8192, min(10**ii, 10**7)))
    timer = timeit.Timer(stmt,setup)
    n,r = 1,3
    t = np.min(timer.repeat(r,n))
    while t < 0.2:
        n *= 10
        t = np.min(timer.repeat(r,n))
    t /= n
    if t < 1E-3:
        timestr = "%1.3f us" %(t*1E6)
    elif t < 1:
        timestr = "%1.3f ms" %(t*1E3)
    else:
        timestr = "%1.3f s" %t
    print "Array size: 1E%i, %i loops, best of %i: %s/loop" %(ii,n,r,timestr)

Numpy doesn't like the buffer size being too small or too big so I made sure that it didn't get smaller than 8192 or larger than 1E7 because Numpy didn't like a buffer size of 1E8. Otherwise, I was setting the buffer size to the size of the array being processed. I only went up to 1E8 because my machine only has 4GB of memory at the moment. Here are the results:

Numpy v1.8.0.dev-2a5c2c8
Array size: 1E0, 100000 loops, best of 3: 5.351 us/loop
Array size: 1E1, 100000 loops, best of 3: 5.390 us/loop
Array size: 1E2, 100000 loops, best of 3: 5.366 us/loop
Array size: 1E3, 100000 loops, best of 3: 5.360 us/loop
Array size: 1E4, 100000 loops, best of 3: 5.433 us/loop
Array size: 1E5, 100000 loops, best of 3: 5.400 us/loop
Array size: 1E6, 100000 loops, best of 3: 5.397 us/loop
Array size: 1E7, 100000 loops, best of 3: 5.381 us/loop
Array size: 1E8, 100000 loops, best of 3: 6.126 us/loop

There is a small uptick in the last timing because there are multiple chunks being processed due to the limitations on how big the buffer size can be.

Test if numpy array contains only zeros

Check out numpy.count_nonzero.

>>> np.count_nonzero(np.eye(4))
4
>>> np.count_nonzero([[0,1,7,0,0],[3,0,0,2,19]])
5

Invalid value encountered in greater - numpy array logic

This can do it, with some boolean indexing to avoid the nan and a where statement

y[np.isfinite(y)] = np.where(y[np.isfinite(y)] > 5, np.nan, y[np.isfinite(y)])

what is the most efficient way to find the position of the first np.nan value?

I'll nominate

a.argmax()

With @fuglede's test array:

In [1]: a = np.array([np.nan if i % 10000 == 9999 else 3 for i in range(100000)])
In [2]: np.isnan(a).argmax()
Out[2]: 9999
In [3]: np.argmax(a)
Out[3]: 9999
In [4]: a.argmax()
Out[4]: 9999

In [5]: timeit a.argmax()
The slowest run took 29.94 ....
10000 loops, best of 3: 20.3 µs per loop

In [6]: timeit np.isnan(a).argmax()
The slowest run took 7.82 ...
1000 loops, best of 3: 462 µs per loop

I don't have numba installed, so can compare that. But my speedup relative to short is greater than @fuglede's 6x.

I'm testing in Py3, which accepts <np.nan, while Py2 raises a runtime warning. But the code search suggests this isn't dependent on that comparison.

/numpy/core/src/multiarray/calculation.c PyArray_ArgMax plays with axes (moving the one of interest to the end), and delegates the action to arg_func = PyArray_DESCR(ap)->f->argmax, a function that depends on the dtype.

In numpy/core/src/multiarray/arraytypes.c.src it looks like BOOL_argmax short circuits, returning as soon as it encounters a True.

for (; i < n; i++) {
    if (ip[i]) {
        *max_ind = i;
        return 0;
    }
}

And @fname@_argmax also short circuits on maximal nan. np.nan is 'maximal' in argmin as well.

#if @isfloat@
    if (@isnan@(mp)) {
        /* nan encountered; it's maximal */
        return 0;
    }
#endif

Comments from experienced c coders are welcomed, but it appears to me that at least for np.nan, a plain argmax will be as fast you we can get.

Playing with the 9999 in generating a shows that the a.argmax time depends on that value, consistent with short circuiting.

Collapse mask array along axis - Numpy in Python

You can use ndarray.any:

all_masks = np.array([[False, False, False, False, False, False],
                      [False, False, False, False, False, False],
                      [False, False, False, False, False, False],
                      [False,  True, False, False,  True, False],
                      [False, False, False, False, False, False],
                      [False,  True, False, False,  True, False]])

all_masks.any(axis=0)

Output:

array([False,  True, False, False,  True, False])

Why is this version of logical AND in C not showing short-circuit behavior?

This is a trick question. b is an input argument to the sc_and method, and so will always be evaluated. In other-words sc_and(a(), b()) will call a() and call b() (order not guaranteed), then call sc_and with the results of a(), b() which passes to a?b:0. It has nothing to do with the ternary operator itself, which would absolutely short-circuit.

UPDATE

With regards to why I called this a 'trick question': It's because of the lack of well-defined context for where to consider 'short circuiting' (at least as reproduced by the OP). Many persons, when given just a function definition, assume that the context of the question is asking about the body of the function; they often do not consider the function as an expression in and of itself. This is the 'trick' of the question; To remind you that in programming in general, but especially in languages like C-likes that often have many exceptions to rules, you can't do that. Example, if the question was asked as such:

Consider the following code. Will sc_and exibit short-circuit behavior when called from main:

int sc_and(int a, int b){
    return a?b:0;
}

int a(){
    cout<<"called a!"<<endl;
    return 0;
}

int b(){
    cout<<"called b!"<<endl;
    return 1;
}

int main(char* argc, char** argv){
    int x = sc_and(a(), b());
    return 0;
}

It would be immediately clear that you're supposed to be thinking of sc_and as an operator in and of itself in your own domain-specific language, and evaluating if the call to sc_and exhibits short-circuit behavior like a regular && would. I would not consider that to be a trick question at all, because it's clear you're not supposed to focus on the ternary operator, and are instead supposed to focus on C/C++'s function-call mechanics (and, I would guess, lead nicely into a follow-up question to write an sc_and that does short-circuit, which would involve using a #define rather than a function).

Whether or not you call what the ternary operator itself does short-circuiting (or something else, like 'conditional evaluation') depends on your definition of short-circuiting, and you can read the various comments for thoughts on that. By mine it does, but it's not terribly relevant to the actual question or why I called it a 'trick'.

Why "Numpy.Any" Has No Short-Circuit Mechanism