Findinterval() with Right-Closed Intervals

findInterval() with right-closed intervals

EDIT: Major clean-up in all aisles.

You might look at cut. By default, cut makes left open and right closed intervals, and that can be changed using the appropriate argument (right). To use your example:

x <- c(3, 6, 7, 7, 29, 37, 52)
vec <- c(2, 5, 6, 35)
cutVec <- c(vec, max(x)) # for cut, range of vec should cover all of x

Now create four functions that should do the same thing: Two from the OP, one from Josh O'Brien, and then cut. Two arguments to cut have been changed from default settings: include.lowest = TRUE will create an interval closed on both sides for the smallest (leftmost) interval. labels = FALSE will cause cut to return simply the integer values for the bins instead of creating a factor, which it otherwise does.

findInterval.rightClosed <- function(x, vec, ...) {
fi <- findInterval(x, vec, ...)
fi - (x==vec[fi])
}
findInterval.rightClosed2 <- function(x, vec, ...) {
length(vec) - findInterval(-x, -rev(vec), ...)
}
cutFun <- function(x, vec){
cut(x, vec, include.lowest = TRUE, labels = FALSE)
}
# The body of fiFun is a contribution by Josh O'Brien that got fed to the ether.
fiFun <- function(x, vec){
xxFI <- findInterval(x, vec * (1 + .Machine$double.eps))
}

Do all functions return the same result? Yup. (notice the use of cutVec for cutFun)

mapply(identical, list(findInterval.rightClosed(x, vec)),
list(findInterval.rightClosed2(x, vec), cutFun(x, cutVec), fiFun(x, vec)))
# [1] TRUE TRUE TRUE

Now a more demanding vector to bin:

x <- rpois(2e6, 10)
vec <- c(-Inf, quantile(x, seq(.2, 1, .2)))

Test whether identical (note use of unname)

mapply(identical, list(unname(findInterval.rightClosed(x, vec))),
list(findInterval.rightClosed2(x, vec), cutFun(x, vec), fiFun(x, vec)))
# [1] TRUE TRUE TRUE

And benchmark:

library(microbenchmark)
microbenchmark(findInterval.rightClosed(x, vec), findInterval.rightClosed2(x, vec),
cutFun(x, vec), fiFun(x, vec), times = 50)
# Unit: milliseconds
# expr min lq median uq max
# 1 cutFun(x, vec) 35.46261 35.63435 35.81233 36.68036 53.52078
# 2 fiFun(x, vec) 51.30158 51.69391 52.24277 53.69253 67.09433
# 3 findInterval.rightClosed(x, vec) 124.57110 133.99315 142.06567 155.68592 176.43291
# 4 findInterval.rightClosed2(x, vec) 79.81685 82.01025 86.20182 95.65368 108.51624

From this run, cut seems to be the fastest.

findInterval() with right-closed intervals

EDIT: Major clean-up in all aisles.

You might look at cut. By default, cut makes left open and right closed intervals, and that can be changed using the appropriate argument (right). To use your example:

x <- c(3, 6, 7, 7, 29, 37, 52)
vec <- c(2, 5, 6, 35)
cutVec <- c(vec, max(x)) # for cut, range of vec should cover all of x

Now create four functions that should do the same thing: Two from the OP, one from Josh O'Brien, and then cut. Two arguments to cut have been changed from default settings: include.lowest = TRUE will create an interval closed on both sides for the smallest (leftmost) interval. labels = FALSE will cause cut to return simply the integer values for the bins instead of creating a factor, which it otherwise does.

findInterval.rightClosed <- function(x, vec, ...) {
fi <- findInterval(x, vec, ...)
fi - (x==vec[fi])
}
findInterval.rightClosed2 <- function(x, vec, ...) {
length(vec) - findInterval(-x, -rev(vec), ...)
}
cutFun <- function(x, vec){
cut(x, vec, include.lowest = TRUE, labels = FALSE)
}
# The body of fiFun is a contribution by Josh O'Brien that got fed to the ether.
fiFun <- function(x, vec){
xxFI <- findInterval(x, vec * (1 + .Machine$double.eps))
}

Do all functions return the same result? Yup. (notice the use of cutVec for cutFun)

mapply(identical, list(findInterval.rightClosed(x, vec)),
list(findInterval.rightClosed2(x, vec), cutFun(x, cutVec), fiFun(x, vec)))
# [1] TRUE TRUE TRUE

Now a more demanding vector to bin:

x <- rpois(2e6, 10)
vec <- c(-Inf, quantile(x, seq(.2, 1, .2)))

Test whether identical (note use of unname)

mapply(identical, list(unname(findInterval.rightClosed(x, vec))),
list(findInterval.rightClosed2(x, vec), cutFun(x, vec), fiFun(x, vec)))
# [1] TRUE TRUE TRUE

And benchmark:

library(microbenchmark)
microbenchmark(findInterval.rightClosed(x, vec), findInterval.rightClosed2(x, vec),
cutFun(x, vec), fiFun(x, vec), times = 50)
# Unit: milliseconds
# expr min lq median uq max
# 1 cutFun(x, vec) 35.46261 35.63435 35.81233 36.68036 53.52078
# 2 fiFun(x, vec) 51.30158 51.69391 52.24277 53.69253 67.09433
# 3 findInterval.rightClosed(x, vec) 124.57110 133.99315 142.06567 155.68592 176.43291
# 4 findInterval.rightClosed2(x, vec) 79.81685 82.01025 86.20182 95.65368 108.51624

From this run, cut seems to be the fastest.

Find which interval row in a data frame that each element of a vector belongs in

Here's a possible solution using the new "non-equi" joins in data.table (v>=1.9.8). While I doubt you'll like the syntax, it should be very efficient soluion.

Also, regarding findInterval, this function assumes continuity in your intervals, while this isn't the case here, so I doubt there is a straightforward solution using it.

library(data.table) #v1.10.0
setDT(intervals)[data.table(elements), on = .(start <= elements, end >= elements)]
# phase start end
# 1: a 0.1 0.1
# 2: a 0.2 0.2
# 3: a 0.5 0.5
# 4: NA 0.9 0.9
# 5: b 1.1 1.1
# 6: b 1.9 1.9
# 7: c 2.1 2.1

Regarding the above code, I find it pretty self-explanatory: Join intervals and elements by the condition specified in the on operator. That's pretty much it.

There is a certain caveat here though, start, end and elements should be all of the same type, so if one of them is integer, it should be converted to numeric first.

R: Is cut the right function to do this?

cut should be the correct function, but you're doing things wrong.

First, there are typos in your code. labels = c(...) would be the correct version.

Second, think about what you're doing: creating intervals. How many? Try cut without the labels to see:

cut(1.2, breaks=c(0.6,0.8,1.0,1.2,1.4))
# [1] (1,1.2]
# Levels: (0.6,0.8] (0.8,1] (1,1.2] (1.2,1.4]

There are only 4 levels created the way you're doing it, so you only need to provide 4 labels (or redefine your break points).

Extracting breakpoints with intervals closed on the left

Just use something like the following for your pattern, and use gsub instead: "\\[|\\]|\\(|\\)".

An example.

out <- levels(cut(aaa, 3, dig.lab = 4, ordered = TRUE, right = FALSE))
gsub("\\[|\\]|\\(|\\)", "", out)
# [1] "0.994,2.998" "2.998,5.002" "5.002,7.006"

And, here's a quick way to read that data in:

read.csv(text = gsub("\\[|\\]|\\(|\\)", "", out), header = FALSE)
# V1 V2
# 1 0.994 2.998
# 2 2.998 5.002
# 3 5.002 7.006

FYI: The same pattern would work whether the intervals are closed on the left or on the right. Using your original example:

labs <- levels(cut(aaa, 3))
labs
# [1] "(0.994,3]" "(3,5]" "(5,7.01]"
read.csv(text = gsub("\\[|\\]|\\(|\\)", "", labs), header = FALSE)
# V1 V2
# 1 0.994 3.00
# 2 3.000 5.00
# 3 5.000 7.01

As for alternatives, since you just need to strip out the first and last character before you can use read.csv, you can also easily use substr without having to fuss with regular expressions (if that's not your thing):

substr(labs, 2, nchar(labs)-1)
# [1] "0.994,3" "3,5" "5,7.01"

Update: A totally different alternative

Since it is obvious that R has to calculate these values and store them as part of the function in order to generate the output you see, it is not too difficult to manipulate the function to get it to output different things.

Looking at the code for cut.default, you'll find the following as the last few lines:

if (codes.only) 
code
else factor(code, seq_along(labels), labels, ordered = ordered_result)

It's really easy to change the last few lines to output a list that contains the output of cut as the first item, and the calculated ranges (from the cut function directly, not extracting from the pasted together factor labels.

For instance, in the Gist I've posted at this link, I've changed those lines as follows:

if (codes.only) 
FIN <- code
else FIN <- factor(code, seq_along(labels), labels, ordered = ordered_result)
list(output = FIN, ranges = data.frame(lower = ch.br[-nb], upper = ch.br[-1L]))

Now, compare:

cut(aaa, 3)
# [1] (0.994,3] (0.994,3] (3,5] (3,5] (3,5] (0.994,3] (3,5] (3,5]
# [9] (3,5] (5,7.01] (5,7.01]
# Levels: (0.994,3] (3,5] (5,7.01]
CUT(aaa, 3)
# $output
# [1] (0.994,3] (0.994,3] (3,5] (3,5] (3,5] (0.994,3] (3,5] (3,5]
# [9] (3,5] (5,7.01] (5,7.01]
# Levels: (0.994,3] (3,5] (5,7.01]
#
# $ranges
# lower upper
# 1 0.994 3
# 2 3 5
# 3 5 7.01

And, right = FALSE:

cut(aaa, 3, dig.lab = 4, ordered = TRUE, right = FALSE)
# [1] [0.994,2.998) [0.994,2.998) [2.998,5.002) [2.998,5.002) [2.998,5.002)
# [6] [0.994,2.998) [2.998,5.002) [2.998,5.002) [2.998,5.002) [5.002,7.006)
# [11] [5.002,7.006)
# Levels: [0.994,2.998) < [2.998,5.002) < [5.002,7.006)
CUT(aaa, 3, dig.lab = 4, ordered = TRUE, right = FALSE)
# $output
# [1] [0.994,2.998) [0.994,2.998) [2.998,5.002) [2.998,5.002) [2.998,5.002)
# [6] [0.994,2.998) [2.998,5.002) [2.998,5.002) [2.998,5.002) [5.002,7.006)
# [11] [5.002,7.006)
# Levels: [0.994,2.998) < [2.998,5.002) < [5.002,7.006)

# $ranges
# lower upper
# 1 0.994 2.998
# 2 2.998 5.002
# 3 5.002 7.006

How to efficiently check whether a vector of numbers is in an interval defined by a data frame

Function findInterval has good performance and can get the job done.

See first how it works with just the first element of n1

i <- findInterval(n1[1], c(df.int$Upper.Limit, Inf))
j <- findInterval(n1[1], c(-Inf, df.int$Upper.Limit))

df.int$Upper.Limit[i]
#[1] 189000
n1[1]
#[1] 189041
df.int$Upper.Limit[j]
#[1] 189200

df.int$Upper.Limit[i] < n1[1] & n1[1] <= df.int$Upper.Limit[j]
#[1] TRUE

Now a general purpose solution.

subsInterval <- function(x, DF, colLimits = 1, colValues = 2, lower = TRUE){
vec <- if(lower)
c(DF[[colLimits]], Inf)
else
c(-Inf, DF[[colLimits]])
i <- findInterval(x, vec, left.open = TRUE)
DF[[colValues]][i]
}

system.time(
n2 <- subsInterval(n1, df.int)
)
# user system elapsed
# 0.000 0.000 0.001

system.time(
n3 <- subsInterval(n1, df.int, lower = FALSE)
)
# user system elapsed
# 0 0 0

How to find which interval/range a variable falls under in R

Try using the code below and see whether it will give you the expected results:

 dataframe=data.frame(Col1=seq(0,24,by=4),x=rnorm(7),y=rnorm(7,50))
funfun=function(x){v=findInterval(x,dataframe$Col1);c(v,v+1)}
dataframe[funfun(2),]
Col1 x y
1 0 0.831266 50.28246
2 4 1.751892 48.78810
dataframe[funfun(10),]
Col1 x y
3 8 0.2624929 48.33945
4 12 -0.2243066 51.11304

If this helps please let us know. thank you

Finding a value in an interval

If you plan to re-classify new values, it's best to explicitly set the breaks= parameter with a vector rather than a size. Not that had those values been in the set originally, you may have had different breaks, and it is possible that your new values may be outside all the levels of your existing data which can be troublesome.

So first, I will generate some sample data.

set.seed(18)
x <- runif(50)

Now I will show two different way to calculate breaks. Here are b1() and b2()

b1<-function(x, n=nclass.Sturges(x)) {
#like default cut()
nb <- as.integer(n + 1)
dx <- diff(rx <- range(x, na.rm = TRUE))
if (dx == 0)
dx <- abs(rx[1L])
seq.int(rx[1L] - dx/1000, rx[2L] + dx/1000,
length.out = nb)
}
b2<-function(x, n=nclass.Sturges(x)) {
#like default hist()
pretty(range(x), n=n)
}

So each of these functions will give break points similar to either the default behaviors of cut() or hist(). Rather than just a single number of breaks, they each return a vector with all the break points explicitly stated. This allows you to use cut() to create your factor

mybreaks <- b1(x)
factorx <- cut(x,breaks=mybreaks))

(Note that's you don't have to wrap cut() in factor() as cut() already returns a factor. Now, if you get new values, you can classify them using findInterval() and the special breaks vector you've already prepared

nv <- runif(5)
grp <- findInterval(nv,mybreaks)

And we can check the results with

data.frame(grp=levels(factorx)[grp], x=nv)
# grp x
# 1 (0.831,0.969] 0.8769438
# 2 (0.00131,0.14] 0.1188054
# 3 (0.416,0.554] 0.5467373
# 4 (0.14,0.278] 0.2327532
# 5 (0.554,0.693] 0.6022678

and everything looks pretty good. In this case, findInterval() will tell you which level of the previous factor you created that each item belongs to. It will return 0 if the number is smaller than your previous observations, but it will return the largest category for anything greater than the largest level of mybreaks. This behavior is somewhat different that cut() which return NA. The last group in cut() is right-closed where findInterval leaves the right-end open.



Related Topics



Leave a reply



Submit