Categorize Continuous Variable with Dplyr

Categorize numeric variable with mutate

set.seed(123)
df <- data.frame(a = rnorm(10), b = rnorm(10))

df %>% mutate(a = cut(a, breaks = quantile(a, probs = seq(0, 1, 0.2))))

giving:

                 a          b
1 (-0.586,-0.316] 1.2240818
2 (-0.316,0.094] 0.3598138
3 (0.68,1.72] 0.4007715
4 (-0.316,0.094] 0.1106827
5 (0.094,0.68] -0.5558411
6 (0.68,1.72] 1.7869131
7 (0.094,0.68] 0.4978505
8 <NA> -1.9666172
9 (-1.27,-0.586] 0.7013559
10 (-0.586,-0.316] -0.4727914

Categorize a continuous variable based on groups of n in R

You can use the integer division operator %/% to get the whole number part of dividing x by 10, then add 1 to it. This will give you the correct step number. Add this into a paste0 call to glue "step_" onto the front and you've got it:

df %>% mutate(z = paste0("step_", (x %/% 10 + 1)))
#> # A tibble: 13 x 3
#> x y z
#> <dbl> <dbl> <chr>
#> 1 0 0.595 step_1
#> 2 2 1.44 step_1
#> 3 6 -0.375 step_1
#> 4 9 -0.808 step_1
#> 5 10 -0.298 step_2
#> 6 13 -0.774 step_2
#> 7 14 -0.769 step_2
#> 8 17 0.335 step_2
#> 9 20 0.696 step_3
#> 10 21 0.284 step_3
#> 11 24 -0.568 step_3
#> 12 28 -0.0942 step_3
#> 13 29 -0.547 step_3

Categorize numeric variable into group/ bins/ breaks

I would use findInterval() here:

First, make up some sample data

set.seed(1)
ages <- floor(runif(20, min = 20, max = 50))
ages
# [1] 27 31 37 47 26 46 48 39 38 21 26 25 40 31 43 34 41 49 31 43

Use findInterval() to categorize your "ages" vector.

findInterval(ages, c(20, 30, 40))
# [1] 1 2 2 3 1 3 3 2 2 1 1 1 3 2 3 2 3 3 2 3

Alternatively, as recommended in the comments, cut() is also useful here:

cut(ages, breaks=c(20, 30, 40, 50), right = FALSE)
cut(ages, breaks=c(20, 30, 40, 50), right = FALSE, labels = FALSE)

Recode continuous data into categorical data using is.na() and if_else() in R

I have provided a toy example to stand in for the code that you described:

df <- data.frame(x = c(1,2,3,4,NA,NA,NA,NA))

Here, we have a data frame with continuous and NA values, and using dplyr, we can use you functions to categorize "x":

library(dplyr)
df <- df %>%
mutate(new_data = if_else(is.na(x), "is NA", "is not NA"))

This creates a new column that categorizes your NA values to "is NA".

Recoding continuous variable into categorical with *specific categories, in R using Tidyverse

A tidyverse approach would make use of dplyr::case_when to recode the variable like so:

data %>% 
mutate(age = case_when(
`Age(Self-report)` < 35 ~ "18-34",
`Age(Self-report)` > 34 & `Age(Self-report)` < 55 ~ "35-54",
`Age(Self-report)` > 55 ~ "55+"
))


Related Topics



Leave a reply



Submit