Is Cut() Style Binning Available in Dplyr

Inconsistency in the binning of the cut function in RStudio

You think the two ways of cutting the vector are equivalent, but they are not. This issue is irrlevant to RStudio or knitr. It is easy to show the problem in a normal R session:

problem = function() {
  library(ISLR)
  set.seed(NULL)  # reinitialize random seed
  Wage$age.jittered = jitter(Wage$age)
  get_breaks = function(cutted) {
    labels = levels(cutted)
    lower = as.numeric(sub("\\((.+),.*", "\\1", labels))
    upper = as.numeric(sub("[^,]*,([^]]*)\\]", "\\1", labels[length(labels)]))
    c(lower, upper)
  }
  age_groups = cut(Wage$age.jittered, 4)
  age_groups1 = cut(Wage$age.jittered, get_breaks(age_groups))
  all(levels(age_groups) == levels(age_groups1))
  idx = which(age_groups != age_groups1)
  length(idx)
}

res = replicate(1000, problem())
barplot(table(res))

frequency of length(idx)

You'd expect the barplot to only have non-zero frequencies at 0, but the length of idx is not zero for quite a few times.

Back to your question, the labels that you saw are not necessarily the exact endpoints. They could be rounded. See the argument dig.lab in the help page ?cut.

Issue with case_when statement using & in dplyr?

The column percentile is factor. We need to convert to character class first and then to numeric

library(dplyr)
 df1 %>%
     mutate(percentile = as.numeric(as.character(percentile))) %>%
     ...

What happens is that when we directly coerce to numeric/integer, it gets coerced to integer storage values instead of the actual values

v1 <- factor(c(81.9, 82.7, 81.9, 82.5))
as.numeric(v1)
#[1] 1 3 1 2

is different than the following

as.numeric(as.character(v1))
#[1] 81.9 82.7 81.9 82.5

Or probably faster with levels

as.numeric(levels(v1)[v1])
#[1] 81.9 82.7 81.9 82.5

Splitting a continuous variable into equal sized groups

try this:

split(das, cut(das$anim, 3))

if you want to split based on the value of wt, then

library(Hmisc) # cut2
split(das, cut2(das$wt, g=3))

anyway, you can do that by combining cut, cut2 and split.

UPDATED

if you want a group index as an additional column, then

das$group <- cut(das$anim, 3)

if the column should be index like 1, 2, ..., then

das$group <- as.numeric(cut(das$anim, 3))

UPDATED AGAIN

try this:

> das$wt2 <- as.numeric(cut2(das$wt, g=3))
> das
   anim    wt wt2
1     1 181.0   1
2     2 179.0   1
3     3 180.5   1
4     4 201.0   2
5     5 201.5   2
6     6 245.0   2
7     7 246.4   3
8     8 189.3   1
9     9 301.0   3
10   10 354.0   3
11   11 369.0   3
12   12 205.0   2
13   13 199.0   1
14   14 394.0   3
15   15 231.3   2

Create groups based on percent_rank in dplyr

Perhaps cut will serve your needs:

library(dplyr)
n <- 100
set.seed(42)
df1 <- data.frame(idx = 1:n, x = rnorm(n))
df1 <- df1 %>%
    arrange(x) %>%
    mutate(pc_x = percent_rank(x))

I use -1e9 in breaks because cut is "left-open", so if I used breaks <- c(0, ...) then the first row would be NA instead of 1.

breaks <- c(-1e9, 0.3, 0.7, 1)
df1 %>%
    mutate(grp = cut(pc_x, breaks=breaks, labels=FALSE)) %>%
    group_by(grp)
## Source: local data frame [100 x 4]
## Groups: grp [3]
##      idx          x       pc_x   grp
##    (int)      (dbl)      (dbl) (int)
## 1     59 -2.9930901 0.00000000     1
## 2     18 -2.6564554 0.01010101     1
## 3     19 -2.4404669 0.02020202     1
## 4     39 -2.4142076 0.03030303     1
## 5     22 -1.7813084 0.04040404     1
## ..   ...        ...        ...   ...

Binning ages in R

cut() is probably the correct function here. The thing is you just need to specify the break points of the ranges, not the beginning and ending intervals. The measure is assumed to be continuous.

#input data
birthyear <- c(1987, 1995, 1994, 1981, 1994, 1989, 1985, 1987, 1996, 1981, 
    1980, 1994, 1996, 1983, 1949, 1988, 1998, 1977, 1967, 1968)
agebreaks <- c(1864, 1929, 1939,1949,1954,1959,1969,1979,1989,1994,2000)

#cut
a < -cut(birthyear, agebreaks, include.lowest=T)
#rename
levels(a) <- rev(c("14 to 19 years","20 to 24 years","25 to 34 years",
    "35 to 44 years","45 to 54 years","55 to 59 years","60 to 64 years",
    "65 to 74 years","75 to 84 years","85 years and over"))

#table
as.data.frame(table(a))

#result
                   a Freq
1  85 years and over    0
2     75 to 84 years    0
3     65 to 74 years    1
4     60 to 64 years    0
5     55 to 59 years    0
6     45 to 54 years    2
7     35 to 44 years    1
8     25 to 34 years    9
9     20 to 24 years    3
10    14 to 19 years    4