Inconsistency in the binning of the cut function in RStudio
You think the two ways of cutting the vector are equivalent, but they are not. This issue is irrlevant to RStudio or knitr. It is easy to show the problem in a normal R session:
problem = function() {
library(ISLR)
set.seed(NULL) # reinitialize random seed
Wage$age.jittered = jitter(Wage$age)
get_breaks = function(cutted) {
labels = levels(cutted)
lower = as.numeric(sub("\\((.+),.*", "\\1", labels))
upper = as.numeric(sub("[^,]*,([^]]*)\\]", "\\1", labels[length(labels)]))
c(lower, upper)
}
age_groups = cut(Wage$age.jittered, 4)
age_groups1 = cut(Wage$age.jittered, get_breaks(age_groups))
all(levels(age_groups) == levels(age_groups1))
idx = which(age_groups != age_groups1)
length(idx)
}
res = replicate(1000, problem())
barplot(table(res))
You'd expect the barplot to only have non-zero frequencies at 0, but the length of idx
is not zero for quite a few times.
Back to your question, the labels that you saw are not necessarily the exact endpoints. They could be rounded. See the argument dig.lab
in the help page ?cut
.
Issue with case_when statement using & in dplyr?
The column percentile
is factor
. We need to convert to character
class first and then to numeric
library(dplyr)
df1 %>%
mutate(percentile = as.numeric(as.character(percentile))) %>%
...
What happens is that when we directly coerce to numeric/integer, it gets coerced to integer storage values instead of the actual values
v1 <- factor(c(81.9, 82.7, 81.9, 82.5))
as.numeric(v1)
#[1] 1 3 1 2
is different than the following
as.numeric(as.character(v1))
#[1] 81.9 82.7 81.9 82.5
Or probably faster with levels
as.numeric(levels(v1)[v1])
#[1] 81.9 82.7 81.9 82.5
Splitting a continuous variable into equal sized groups
try this:
split(das, cut(das$anim, 3))
if you want to split based on the value of wt
, then
library(Hmisc) # cut2
split(das, cut2(das$wt, g=3))
anyway, you can do that by combining cut
, cut2
and split
.
UPDATED
if you want a group index as an additional column, then
das$group <- cut(das$anim, 3)
if the column should be index like 1, 2, ..., then
das$group <- as.numeric(cut(das$anim, 3))
UPDATED AGAIN
try this:
> das$wt2 <- as.numeric(cut2(das$wt, g=3))
> das
anim wt wt2
1 1 181.0 1
2 2 179.0 1
3 3 180.5 1
4 4 201.0 2
5 5 201.5 2
6 6 245.0 2
7 7 246.4 3
8 8 189.3 1
9 9 301.0 3
10 10 354.0 3
11 11 369.0 3
12 12 205.0 2
13 13 199.0 1
14 14 394.0 3
15 15 231.3 2
Create groups based on percent_rank in dplyr
Perhaps cut
will serve your needs:
library(dplyr)
n <- 100
set.seed(42)
df1 <- data.frame(idx = 1:n, x = rnorm(n))
df1 <- df1 %>%
arrange(x) %>%
mutate(pc_x = percent_rank(x))
I use -1e9
in breaks
because cut
is "left-open", so if I used breaks <- c(0, ...)
then the first row would be NA
instead of 1.
breaks <- c(-1e9, 0.3, 0.7, 1)
df1 %>%
mutate(grp = cut(pc_x, breaks=breaks, labels=FALSE)) %>%
group_by(grp)
## Source: local data frame [100 x 4]
## Groups: grp [3]
## idx x pc_x grp
## (int) (dbl) (dbl) (int)
## 1 59 -2.9930901 0.00000000 1
## 2 18 -2.6564554 0.01010101 1
## 3 19 -2.4404669 0.02020202 1
## 4 39 -2.4142076 0.03030303 1
## 5 22 -1.7813084 0.04040404 1
## .. ... ... ... ...
Binning ages in R
cut()
is probably the correct function here. The thing is you just need to specify the break points of the ranges, not the beginning and ending intervals. The measure is assumed to be continuous.
#input data
birthyear <- c(1987, 1995, 1994, 1981, 1994, 1989, 1985, 1987, 1996, 1981,
1980, 1994, 1996, 1983, 1949, 1988, 1998, 1977, 1967, 1968)
agebreaks <- c(1864, 1929, 1939,1949,1954,1959,1969,1979,1989,1994,2000)
#cut
a < -cut(birthyear, agebreaks, include.lowest=T)
#rename
levels(a) <- rev(c("14 to 19 years","20 to 24 years","25 to 34 years",
"35 to 44 years","45 to 54 years","55 to 59 years","60 to 64 years",
"65 to 74 years","75 to 84 years","85 years and over"))
#table
as.data.frame(table(a))
#result
a Freq
1 85 years and over 0
2 75 to 84 years 0
3 65 to 74 years 1
4 60 to 64 years 0
5 55 to 59 years 0
6 45 to 54 years 2
7 35 to 44 years 1
8 25 to 34 years 9
9 20 to 24 years 3
10 14 to 19 years 4
Related Topics
Try_Convert Fails on SQL Server 2012
Sql: How to Order Null and Empty Entries to The Front in an Orderby
How to Do Many to Many Table Outer Joins
How to Get a Real Time Within Postgresql Transaction
Can't Connect to Msql Server After Upgrading It on Linux
Rodbc and Microsoft SQL Server: Truncating Long Character Strings
Sql Server 2012 Random String from a List
Is Limit Clause in Hive Really Random
How to Set List of Values as Parameter into Hibernate Query
Linked Access Db "Record Has Been Changed by Another User"
Sql Query to Sum Fields from Different Tables
Sql: Subquery Has Too Many Columns
How to Select The Record with The 2Nd Highest Salary in Database Oracle