findInterval() with right-closed intervals
EDIT: Major clean-up in all aisles.
You might look at cut
. By default, cut
makes left open and right closed intervals, and that can be changed using the appropriate argument (right
). To use your example:
x <- c(3, 6, 7, 7, 29, 37, 52)
vec <- c(2, 5, 6, 35)
cutVec <- c(vec, max(x)) # for cut, range of vec should cover all of x
Now create four functions that should do the same thing: Two from the OP, one from Josh O'Brien, and then cut
. Two arguments to cut
have been changed from default settings: include.lowest = TRUE
will create an interval closed on both sides for the smallest (leftmost) interval. labels = FALSE
will cause cut
to return simply the integer values for the bins instead of creating a factor, which it otherwise does.
findInterval.rightClosed <- function(x, vec, ...) {
fi <- findInterval(x, vec, ...)
fi - (x==vec[fi])
}
findInterval.rightClosed2 <- function(x, vec, ...) {
length(vec) - findInterval(-x, -rev(vec), ...)
}
cutFun <- function(x, vec){
cut(x, vec, include.lowest = TRUE, labels = FALSE)
}
# The body of fiFun is a contribution by Josh O'Brien that got fed to the ether.
fiFun <- function(x, vec){
xxFI <- findInterval(x, vec * (1 + .Machine$double.eps))
}
Do all functions return the same result? Yup. (notice the use of cutVec
for cutFun
)
mapply(identical, list(findInterval.rightClosed(x, vec)),
list(findInterval.rightClosed2(x, vec), cutFun(x, cutVec), fiFun(x, vec)))
# [1] TRUE TRUE TRUE
Now a more demanding vector to bin:
x <- rpois(2e6, 10)
vec <- c(-Inf, quantile(x, seq(.2, 1, .2)))
Test whether identical (note use of unname
)
mapply(identical, list(unname(findInterval.rightClosed(x, vec))),
list(findInterval.rightClosed2(x, vec), cutFun(x, vec), fiFun(x, vec)))
# [1] TRUE TRUE TRUE
And benchmark:
library(microbenchmark)
microbenchmark(findInterval.rightClosed(x, vec), findInterval.rightClosed2(x, vec),
cutFun(x, vec), fiFun(x, vec), times = 50)
# Unit: milliseconds
# expr min lq median uq max
# 1 cutFun(x, vec) 35.46261 35.63435 35.81233 36.68036 53.52078
# 2 fiFun(x, vec) 51.30158 51.69391 52.24277 53.69253 67.09433
# 3 findInterval.rightClosed(x, vec) 124.57110 133.99315 142.06567 155.68592 176.43291
# 4 findInterval.rightClosed2(x, vec) 79.81685 82.01025 86.20182 95.65368 108.51624
From this run, cut
seems to be the fastest.
findInterval() with right-closed intervals
EDIT: Major clean-up in all aisles.
You might look at cut
. By default, cut
makes left open and right closed intervals, and that can be changed using the appropriate argument (right
). To use your example:
x <- c(3, 6, 7, 7, 29, 37, 52)
vec <- c(2, 5, 6, 35)
cutVec <- c(vec, max(x)) # for cut, range of vec should cover all of x
Now create four functions that should do the same thing: Two from the OP, one from Josh O'Brien, and then cut
. Two arguments to cut
have been changed from default settings: include.lowest = TRUE
will create an interval closed on both sides for the smallest (leftmost) interval. labels = FALSE
will cause cut
to return simply the integer values for the bins instead of creating a factor, which it otherwise does.
findInterval.rightClosed <- function(x, vec, ...) {
fi <- findInterval(x, vec, ...)
fi - (x==vec[fi])
}
findInterval.rightClosed2 <- function(x, vec, ...) {
length(vec) - findInterval(-x, -rev(vec), ...)
}
cutFun <- function(x, vec){
cut(x, vec, include.lowest = TRUE, labels = FALSE)
}
# The body of fiFun is a contribution by Josh O'Brien that got fed to the ether.
fiFun <- function(x, vec){
xxFI <- findInterval(x, vec * (1 + .Machine$double.eps))
}
Do all functions return the same result? Yup. (notice the use of cutVec
for cutFun
)
mapply(identical, list(findInterval.rightClosed(x, vec)),
list(findInterval.rightClosed2(x, vec), cutFun(x, cutVec), fiFun(x, vec)))
# [1] TRUE TRUE TRUE
Now a more demanding vector to bin:
x <- rpois(2e6, 10)
vec <- c(-Inf, quantile(x, seq(.2, 1, .2)))
Test whether identical (note use of unname
)
mapply(identical, list(unname(findInterval.rightClosed(x, vec))),
list(findInterval.rightClosed2(x, vec), cutFun(x, vec), fiFun(x, vec)))
# [1] TRUE TRUE TRUE
And benchmark:
library(microbenchmark)
microbenchmark(findInterval.rightClosed(x, vec), findInterval.rightClosed2(x, vec),
cutFun(x, vec), fiFun(x, vec), times = 50)
# Unit: milliseconds
# expr min lq median uq max
# 1 cutFun(x, vec) 35.46261 35.63435 35.81233 36.68036 53.52078
# 2 fiFun(x, vec) 51.30158 51.69391 52.24277 53.69253 67.09433
# 3 findInterval.rightClosed(x, vec) 124.57110 133.99315 142.06567 155.68592 176.43291
# 4 findInterval.rightClosed2(x, vec) 79.81685 82.01025 86.20182 95.65368 108.51624
From this run, cut
seems to be the fastest.
Find which interval row in a data frame that each element of a vector belongs in
Here's a possible solution using the new "non-equi" joins in data.table
(v>=1.9.8). While I doubt you'll like the syntax, it should be very efficient soluion.
Also, regarding findInterval
, this function assumes continuity in your intervals, while this isn't the case here, so I doubt there is a straightforward solution using it.
library(data.table) #v1.10.0
setDT(intervals)[data.table(elements), on = .(start <= elements, end >= elements)]
# phase start end
# 1: a 0.1 0.1
# 2: a 0.2 0.2
# 3: a 0.5 0.5
# 4: NA 0.9 0.9
# 5: b 1.1 1.1
# 6: b 1.9 1.9
# 7: c 2.1 2.1
Regarding the above code, I find it pretty self-explanatory: Join intervals
and elements
by the condition specified in the on
operator. That's pretty much it.
There is a certain caveat here though, start
, end
and elements
should be all of the same type, so if one of them is integer
, it should be converted to numeric
first.
R: Is cut the right function to do this?
cut
should be the correct function, but you're doing things wrong.
First, there are typos in your code. labels = c(...)
would be the correct version.
Second, think about what you're doing: creating intervals. How many? Try cut
without the labels
to see:
cut(1.2, breaks=c(0.6,0.8,1.0,1.2,1.4))
# [1] (1,1.2]
# Levels: (0.6,0.8] (0.8,1] (1,1.2] (1.2,1.4]
There are only 4 levels created the way you're doing it, so you only need to provide 4 labels (or redefine your break points).
Extracting breakpoints with intervals closed on the left
Just use something like the following for your pattern, and use gsub
instead: "\\[|\\]|\\(|\\)"
.
An example.
out <- levels(cut(aaa, 3, dig.lab = 4, ordered = TRUE, right = FALSE))
gsub("\\[|\\]|\\(|\\)", "", out)
# [1] "0.994,2.998" "2.998,5.002" "5.002,7.006"
And, here's a quick way to read that data in:
read.csv(text = gsub("\\[|\\]|\\(|\\)", "", out), header = FALSE)
# V1 V2
# 1 0.994 2.998
# 2 2.998 5.002
# 3 5.002 7.006
FYI: The same pattern would work whether the intervals are closed on the left or on the right. Using your original example:
labs <- levels(cut(aaa, 3))
labs
# [1] "(0.994,3]" "(3,5]" "(5,7.01]"
read.csv(text = gsub("\\[|\\]|\\(|\\)", "", labs), header = FALSE)
# V1 V2
# 1 0.994 3.00
# 2 3.000 5.00
# 3 5.000 7.01
As for alternatives, since you just need to strip out the first and last character before you can use read.csv
, you can also easily use substr
without having to fuss with regular expressions (if that's not your thing):
substr(labs, 2, nchar(labs)-1)
# [1] "0.994,3" "3,5" "5,7.01"
Update: A totally different alternative
Since it is obvious that R has to calculate these values and store them as part of the function in order to generate the output you see, it is not too difficult to manipulate the function to get it to output different things.
Looking at the code for cut.default
, you'll find the following as the last few lines:
if (codes.only)
code
else factor(code, seq_along(labels), labels, ordered = ordered_result)
It's really easy to change the last few lines to output a list
that contains the output of cut
as the first item, and the calculated ranges (from the cut
function directly, not extracting from the pasted together factor
labels
.
For instance, in the Gist I've posted at this link, I've changed those lines as follows:
if (codes.only)
FIN <- code
else FIN <- factor(code, seq_along(labels), labels, ordered = ordered_result)
list(output = FIN, ranges = data.frame(lower = ch.br[-nb], upper = ch.br[-1L]))
Now, compare:
cut(aaa, 3)
# [1] (0.994,3] (0.994,3] (3,5] (3,5] (3,5] (0.994,3] (3,5] (3,5]
# [9] (3,5] (5,7.01] (5,7.01]
# Levels: (0.994,3] (3,5] (5,7.01]
CUT(aaa, 3)
# $output
# [1] (0.994,3] (0.994,3] (3,5] (3,5] (3,5] (0.994,3] (3,5] (3,5]
# [9] (3,5] (5,7.01] (5,7.01]
# Levels: (0.994,3] (3,5] (5,7.01]
#
# $ranges
# lower upper
# 1 0.994 3
# 2 3 5
# 3 5 7.01
And, right = FALSE
:
cut(aaa, 3, dig.lab = 4, ordered = TRUE, right = FALSE)
# [1] [0.994,2.998) [0.994,2.998) [2.998,5.002) [2.998,5.002) [2.998,5.002)
# [6] [0.994,2.998) [2.998,5.002) [2.998,5.002) [2.998,5.002) [5.002,7.006)
# [11] [5.002,7.006)
# Levels: [0.994,2.998) < [2.998,5.002) < [5.002,7.006)
CUT(aaa, 3, dig.lab = 4, ordered = TRUE, right = FALSE)
# $output
# [1] [0.994,2.998) [0.994,2.998) [2.998,5.002) [2.998,5.002) [2.998,5.002)
# [6] [0.994,2.998) [2.998,5.002) [2.998,5.002) [2.998,5.002) [5.002,7.006)
# [11] [5.002,7.006)
# Levels: [0.994,2.998) < [2.998,5.002) < [5.002,7.006)
# $ranges
# lower upper
# 1 0.994 2.998
# 2 2.998 5.002
# 3 5.002 7.006
How to efficiently check whether a vector of numbers is in an interval defined by a data frame
Function findInterval
has good performance and can get the job done.
See first how it works with just the first element of n1
i <- findInterval(n1[1], c(df.int$Upper.Limit, Inf))
j <- findInterval(n1[1], c(-Inf, df.int$Upper.Limit))
df.int$Upper.Limit[i]
#[1] 189000
n1[1]
#[1] 189041
df.int$Upper.Limit[j]
#[1] 189200
df.int$Upper.Limit[i] < n1[1] & n1[1] <= df.int$Upper.Limit[j]
#[1] TRUE
Now a general purpose solution.
subsInterval <- function(x, DF, colLimits = 1, colValues = 2, lower = TRUE){
vec <- if(lower)
c(DF[[colLimits]], Inf)
else
c(-Inf, DF[[colLimits]])
i <- findInterval(x, vec, left.open = TRUE)
DF[[colValues]][i]
}
system.time(
n2 <- subsInterval(n1, df.int)
)
# user system elapsed
# 0.000 0.000 0.001
system.time(
n3 <- subsInterval(n1, df.int, lower = FALSE)
)
# user system elapsed
# 0 0 0
How to find which interval/range a variable falls under in R
Try using the code below and see whether it will give you the expected results:
dataframe=data.frame(Col1=seq(0,24,by=4),x=rnorm(7),y=rnorm(7,50))
funfun=function(x){v=findInterval(x,dataframe$Col1);c(v,v+1)}
dataframe[funfun(2),]
Col1 x y
1 0 0.831266 50.28246
2 4 1.751892 48.78810
dataframe[funfun(10),]
Col1 x y
3 8 0.2624929 48.33945
4 12 -0.2243066 51.11304
If this helps please let us know. thank you
Finding a value in an interval
If you plan to re-classify new values, it's best to explicitly set the breaks=
parameter with a vector rather than a size. Not that had those values been in the set originally, you may have had different breaks, and it is possible that your new values may be outside all the levels of your existing data which can be troublesome.
So first, I will generate some sample data.
set.seed(18)
x <- runif(50)
Now I will show two different way to calculate breaks. Here are b1()
and b2()
b1<-function(x, n=nclass.Sturges(x)) {
#like default cut()
nb <- as.integer(n + 1)
dx <- diff(rx <- range(x, na.rm = TRUE))
if (dx == 0)
dx <- abs(rx[1L])
seq.int(rx[1L] - dx/1000, rx[2L] + dx/1000,
length.out = nb)
}
b2<-function(x, n=nclass.Sturges(x)) {
#like default hist()
pretty(range(x), n=n)
}
So each of these functions will give break points similar to either the default behaviors of cut()
or hist()
. Rather than just a single number of breaks, they each return a vector with all the break points explicitly stated. This allows you to use cut()
to create your factor
mybreaks <- b1(x)
factorx <- cut(x,breaks=mybreaks))
(Note that's you don't have to wrap cut()
in factor()
as cut()
already returns a factor. Now, if you get new values, you can classify them using findInterval()
and the special breaks vector you've already prepared
nv <- runif(5)
grp <- findInterval(nv,mybreaks)
And we can check the results with
data.frame(grp=levels(factorx)[grp], x=nv)
# grp x
# 1 (0.831,0.969] 0.8769438
# 2 (0.00131,0.14] 0.1188054
# 3 (0.416,0.554] 0.5467373
# 4 (0.14,0.278] 0.2327532
# 5 (0.554,0.693] 0.6022678
and everything looks pretty good. In this case, findInterval()
will tell you which level of the previous factor you created that each item belongs to. It will return 0 if the number is smaller than your previous observations, but it will return the largest category for anything greater than the largest level of mybreaks
. This behavior is somewhat different that cut()
which return NA
. The last group in cut()
is right-closed where findInterval
leaves the right-end open.
Related Topics
How to Select Rows from Data.Frame with 2 Conditions
Duplicate a Column in Data Frame and Rename It to Another Column Name
Real Time, Auto Updating, Incremental Plot in R
Convert Data from Many Rows to Many Columns
Insert Portions of a Markdown Document Inside Another Markdown Document Using Knitr
Really Fast Word Ngram Vectorization in R
Dplyr 'Rename' Standard Evaluation Function Not Working as Expected
Creating Tree Diagram for Showing Case Count Using R
R: Unexpected Results from P.Adjust (Fdr)
Raster Package Taking All Hard Drive
Using Dplyr for Frequency Counts of Interactions, Must Include Zero Counts
Is There Any Other Package Other Than "Sentiment" to Do Sentiment Analysis in R
How to Remove the Legend Title in Ggplot2
Print Number as Reduced Fraction in R