Categorize Numeric Variable into Group/ Bins/ Breaks

Categorize numeric variable into group/ bins/ breaks

I would use findInterval() here:

First, make up some sample data

set.seed(1)
ages <- floor(runif(20, min = 20, max = 50))
ages
# [1] 27 31 37 47 26 46 48 39 38 21 26 25 40 31 43 34 41 49 31 43

Use findInterval() to categorize your "ages" vector.

findInterval(ages, c(20, 30, 40))
# [1] 1 2 2 3 1 3 3 2 2 1 1 1 3 2 3 2 3 3 2 3

Alternatively, as recommended in the comments, cut() is also useful here:

cut(ages, breaks=c(20, 30, 40, 50), right = FALSE)
cut(ages, breaks=c(20, 30, 40, 50), right = FALSE, labels = FALSE)

Categorize numeric variable with mutate

set.seed(123)
df <- data.frame(a = rnorm(10), b = rnorm(10))

df %>% mutate(a = cut(a, breaks = quantile(a, probs = seq(0, 1, 0.2))))

giving:

                 a          b
1 (-0.586,-0.316] 1.2240818
2 (-0.316,0.094] 0.3598138
3 (0.68,1.72] 0.4007715
4 (-0.316,0.094] 0.1106827
5 (0.094,0.68] -0.5558411
6 (0.68,1.72] 1.7869131
7 (0.094,0.68] 0.4978505
8 <NA> -1.9666172
9 (-1.27,-0.586] 0.7013559
10 (-0.586,-0.316] -0.4727914

R categorize numeric value using case_when

We could use cut function:

library(dplyr)

labels <- c("1 km", "10 km", "20 km", "50 km")

data %>%
mutate(within_km = cut(distance_km,
breaks = c(0, 1, 10, 20, 50),
labels = labels))
  id    distance_km within_km
<chr> <dbl> <fct>
1 1 0.5 1 km
2 2 1.5 10 km
3 3 10.5 20 km
4 4 43 50 km
5 5 20.7 50 km

Splitting a continuous variable into equal sized groups

try this:

split(das, cut(das$anim, 3))

if you want to split based on the value of wt, then

library(Hmisc) # cut2
split(das, cut2(das$wt, g=3))

anyway, you can do that by combining cut, cut2 and split.

UPDATED

if you want a group index as an additional column, then

das$group <- cut(das$anim, 3)

if the column should be index like 1, 2, ..., then

das$group <- as.numeric(cut(das$anim, 3))

UPDATED AGAIN

try this:

> das$wt2 <- as.numeric(cut2(das$wt, g=3))
> das
anim wt wt2
1 1 181.0 1
2 2 179.0 1
3 3 180.5 1
4 4 201.0 2
5 5 201.5 2
6 6 245.0 2
7 7 246.4 3
8 8 189.3 1
9 9 301.0 3
10 10 354.0 3
11 11 369.0 3
12 12 205.0 2
13 13 199.0 1
14 14 394.0 3
15 15 231.3 2

Create 4 categories variables

I may be misunderstanding something, but you appear to have overlapping categories- Total >= 2 is basic, but Total < 3 is good. You may want to confirm the bounds for your groupings. Once that's sorted, you were actually pretty close to a working solution- you can nest ifelse statements and consider that they are evaluated in order. So, if a condition evaluates to TRUE "early" in the chain, it will return whatever is the output for a TRUE response at that point. Otherwise, it will move to the next ifelse to evaluate. Note here that I've used 1, 2, and 3 as the 'breaks' for the categories, so that the logic evaluates to: "If it's less than 1, it's Limited. If it's less than 2, it's Basic. If it's less than 3, it's good. Otherwise, it's Full."

set.seed(123)
df <- data.frame(total = runif(n = 15, min = 0, max = 4))
df


df$level = ifelse(df$total < 1, "Limited",
ifelse(df$total < 2, "Basic",
ifelse(df$total < 3, "Good", "Full")))
> df
total level
1 0.5691772 Limited
2 2.1971386 Good
3 3.8163650 Full
4 2.3419334 Good
5 1.6180411 Basic
6 2.5915739 Good
7 1.2792825 Basic
8 1.2308800 Basic
9 0.8790705 Limited
10 1.4779555 Basic
11 3.9368768 Full
12 0.6168092 Limited
13 0.3641760 Limited
14 0.5676276 Limited
15 2.7600284 Good

With just four categories an ifelse block is probably fine- if I were using many more bounds I'd likely use a different approach Edit: like thelatemail's- it's far cleaner.



Related Topics



Leave a reply



Submit