Faster Way to Find the First True Value in a Vector

Faster way to find the first TRUE value in a vector

Base R provides Position and Find for locating the first index and value, respectively, for which a predicate returns a true value. These higher-order functions return immediately upon the first hit.

f<-function(x) {
r<-vector("list",3)
r[[1]]<-which(x==1)[1]
r[[2]]<-which(x>1)[1]
r[[3]]<-x[x>10][1]
return(r)
}

p<-function(f,b) function(a) f(a,b)
g<-function(x) {
r<-vector("list",3)
r[[1]]<-Position(p(`==`,1),x)
r[[2]]<-Position(p(`>`,1),x)
r[[3]]<-Find(p(`>`,10),x)
return(r)
}

The relative performance depends greatly on the probability of finding a hit early relative to the cost of the predicate vs the overhead of Position/Find.

library(microbenchmark)
set.seed(1)
x<-sample(1:100,1e5,replace=TRUE)
microbenchmark(f(x),g(x))

Unit: microseconds
expr min lq mean median uq max neval cld
f(x) 5034.283 5410.1205 6313.861 5798.4780 6948.5675 26735.52 100 b
g(x) 587.463 650.4795 1013.183 734.6375 950.9845 20285.33 100 a

y<-rep(0,1e5)
microbenchmark(f(y),g(y))

Unit: milliseconds
expr min lq mean median uq max neval cld
f(y) 3.470179 3.604831 3.791592 3.718752 3.866952 4.831073 100 a
g(y) 131.250981 133.687454 137.199230 134.846369 136.193307 177.082128 100 b

Find position of first value greater than X in a vector

# Randomly generate a suitable vector
set.seed(0)
v <- sample(50:150, size = 50, replace = TRUE)

min(which(v > 100))

Extract the first TRUE per row in logical matrix efficiently in R

We can use max.col

colnames(m)[max.col(m, "first")]
#[1] "A" "B" "C" "B" "B" "A"

If there are no TRUE in a row, then we can change it to NA (if needed)

colnames(m)[max.col(m, "first")*NA^!rowSums(m)]

Or with ifelse

colnames(m)[ifelse(rowSums(m)==0, NA, max.col(m, "first"))]

How do these four methods for finding the first element of a vector that matches a predicate compare?

Despite your request, I guess that a lot is opinion based; nonetheless, here is my take.

It really depends on the use case. Even your question might hide ambiguity; here I see two different tasks:

  1. find (the position of) the first element of a list that obeys a condition
  2. find (the position of) the first TRUE value inside a logical vector.

Of course, the second task solves also the first, but with the cost of applying the predicate for each element of the list.

A lot has to do with vectorization in R: for many tasks, even if "algorithmically" wrong, it's more convenient to evaluate the predicate for each element of the list and then find the TRUE position. This is certainly the case of your examples: there is no doubt that, given how isFour and the other are defined that match is to prefer: we don't mind to check each element of data since the operation is very fast, even if one could stop before the end to get the answer. This because, if we "devectorize" your function, on average we are going to slow things a lot since subsetting and checking a single element is way slower. Consider that here I'm using list not in the R-meaning (list object), but just as a collection of values.

On the other hand, Position is thought to be used when your data is list and/or the f function is not vectorized and very expensive. Imagine for instance that f consists in training a machine learning model, evaluate it against some data and grabbing some performance statistics. We have a list of algorithms we want to train and we want to stop when we reach a nice performance. In this case, we don't want to train every possible model in the list (super expensive), but stop as soon as we can. So, we are going to use Position (see also its source code to understand why).

Regarding the two tasks that I outlined at the beginning, all your solutions deal with the second task, while only Position solve exactly only the first.

So, in general:

  • if the predicate is vectorized and efficient, go with match;
  • if the predicate is very expensive, go with Position.

Don't see much of a reason to use the other two solutions in any case: which and which.max are used mainly for other tasks and not to determine the position of a TRUE value (even if they can, of course).

Just to outline better the differences between the match solution and the Position one, here is a recap.

  • For match the isFour function is applied to each element of the input and only after match actually acts. Of course, for the specific task of the example a better way is match(4, data), since match will stop internally as soon as a 4 is found. But the important point is that isFour is applied to the whole input in your implementation.
  • For Position instead, you pass the function, not the result of its application. Then, the function is applied element by element and when the condition is met it exits, without necessarily processing the whole input.

Now it should be clear what's to prefer: it depends on the cost of "devectorize" the function against the gain of not processing the whole input.

R: Fastest way to obtain the first and last location each unique value in a vector?

And a data.table version could be something like

DT <- data.table(vals)
DT[, .(first=min(.I), last=max(.I)), by=vals]

Or dplyr which could be done with

tibble(vals) %>% mutate(row = row_number()) %>% 
group_by(vals) %>% summarise(first=min(row), max=max(row))

The timings are quite similar to what @akrun get with the elegant base R split call, though, so not a lot gained there.

How to obtain only the first True value from each row of a numpy array?

You can use cumsum, and find the first bool by comparing the result with 1.

all_bools.cumsum(axis=1).cumsum(axis=1) == 1 
array([[False, True, False],
[ True, False, False],
[False, False, True],
[False, False, False]])

This also accounts for the issue @a_guest pointed out. The second cumsum call is needed to avoid matching all False values between the first and second True value.


If performance is important, use argmax and set values:

y = np.zeros_like(all_bools, dtype=bool)
idx = np.arange(len(x)), x.argmax(axis=1)
y[idx] = x[idx]

y
array([[False, True, False],
[ True, False, False],
[False, False, True],
[False, False, False]])

Perfplot Performance Timings
I'll take this opportunity to show off perfplot, with some timings, since it is good to see how our solutions vary with different sized inputs.

import numpy as np
import perfplot

def cs1(x):
return x.cumsum(axis=1).cumsum(axis=1) == 1

def cs2(x):
y = np.zeros_like(x, dtype=bool)
idx = np.arange(len(x)), x.argmax(axis=1)
y[idx] = x[idx]
return y

def a_guest(x):
b = np.zeros_like(x, dtype=bool)
i = np.argmax(x, axis=1)
b[np.arange(i.size), i] = np.logical_or.reduce(x, axis=1)
return b

perfplot.show(
setup=lambda n: np.random.randint(0, 2, size=(n, n)).astype(bool),
kernels=[cs1, cs2, a_guest],
labels=['cs1', 'cs2', 'a_guest'],
n_range=[2**k for k in range(1, 8)],
xlabel='N'
)

Sample Image

The trend carries forward to larger N. cumsum is very expensive, while there is a constant time difference between my second solution, and @a_guest's.

Fastest way to keep only first occurence of true; set rest to false

This tweak of your vec_repl() gives a small speedup for larger examples:

vec_repl2 <- function(x) {
{tmp <- logical(length(x)); tmp[match(TRUE,x)] <- TRUE; tmp}
}

For example:

bigExample <- c(logical(10000),TRUE,logical(10000))
microbenchmark(vec_repl(bigExample),vec_repl2(bigExample))
Unit: microseconds
expr min lq mean median uq max neval
vec_repl(bigExample) 34.204 47.428 157.2569 95.383 102.7885 6130.591 100
vec_repl2(bigExample) 18.336 28.386 116.0537 78.282 85.6865 5439.463 100

Beyond that, you could perhaps look into Rcpp.

On Edit Here is an Rcpp experiment:

library(Rcpp)
cppFunction('LogicalVector vec_repl3(LogicalVector x){
int n = x.size();
LogicalVector v(n);
for(int i = 0; i < n; i++){
if(x[i]){
v[i] = TRUE;
return v;
}
}
return v; //if you get here -- x had no TRUE to begin with
}')

Comparison:

microbenchmark(vec_repl(bigExample),vec_repl2(bigExample),vec_repl3(bigExample))
Unit: microseconds
expr min lq mean median uq max neval
vec_repl(bigExample) 69.113 70.8765 323.53679 76.166 167.3170 5882.35 100
vec_repl2(bigExample) 33.499 36.6725 136.80877 38.084 135.4055 6405.28 100
vec_repl3(bigExample) 31.031 33.3230 69.85751 35.263 80.3975 1836.78 100

As you can see, Rcpp provides a speed boost (in this case) but given that the resulting code would be harder to distribute, it might not be worth it. To really get a good feel for it, the benchmarking should probably involve a wider range of vector sizes as well as distributions of TRUE in those vectors.

Find the *first* longest sequence of TRUE in a boolean vector

One option could be:

with(rle(bool), rep(seq_along(values) == which.max(lengths * values), lengths))

Results for the first vector:

[1] FALSE  TRUE FALSE FALSE

For the second:

[1] FALSE FALSE FALSE  TRUE  TRUE

For the third:

[1] FALSE  TRUE  TRUE FALSE FALSE FALSE

Get index of first true in vector

I'm not aware of a built-in function that does exactly what you want, but I have some ideas on how you could do it. There might be a simpler solution, but I've only had one cup of coffee so far. The idea is to leverage the "count leading zeros" function "clz". You just need to convert the results of your conditional into bit positions in an integer.

  1. Create a boolean vector with true/false state set by the comparison
  2. Do a dot product of that against an integer vector with pre-defined values that correspond to bit positions.
  3. The first bit set will correspond to the index you're asking for. Use clz() or a bithack to find that bit index.

In code, something like this (untested and might need adjusting):

float4 f = (float4)(1, 2, 3, 4);
int4 greater = (f > 2);
int4 bits = (int4)(8, 4, 2, 1);
int sum = dot(greater, bits); // maybe this needs to use float
int index = clz(sum); // might need offset applied

You'll need to offset or invert the result from clz to get 0,1,2,3 but that's just addition or subtraction.

Working Code

int firstTrue(int4 v) {
return 4 - (clz(0) - clz((v.x & 8) | (v.y & 4) | (v.z & 2) | (v.w & 1));
}


Related Topics



Leave a reply



Submit