Create Category Based on Range in R

Create category based on range in R

Why didn't cut work? Did you not assign to a new column or something?

> data=data.frame(x=c(3,4,6,12))
> data$group = cut(data$x,c(0,5,10,15))
> data
   x   group
1  3   (0,5]
2  4   (0,5]
3  6  (5,10]
4 12 (10,15]

What you've created there is a factor object in a column of your data frame. The text displayed is the levels of the factor, and you can change them by assignment:

levels(data$group) = c("0-5","6-10",">10")
data
   x group
1  3   0-5
2  4   0-5
3  6  6-10
4 12   >10

Read some basic R docs on factors and you'll get it.

How to categorize numerical ranges in r

We can use findInterval in base R

df[-1] <- lapply(df[-1], findInterval, c(-Inf, -2, 2, Inf))

Create categorical variable in R based on range

Ian's answer (cut) is the most common way to do this, as far as i know.

I prefer to use shingle, from the Lattice Package

the argument that specifies the binning intervals seems a little more intuitive to me.

you use shingle like so:

# mock some data
data = sample(0:40, 200, replace=T)

a = c(0, 5);b = c(5,9);c = c(9, 19);d = c(19, 33);e = c(33, 41)

my_bins = matrix(rbind(a, b, c, d, e), ncol=2)

# returns: (the binning intervals i've set)
        [,1] [,2]
 [1,]    0    5
 [2,]    5    9
 [3,]    9   19
 [4,]   19   33
 [5,]   33   41

shx = shingle(data, intervals=my_bins)

#'shx' at the interactive prompt will give you a nice frequency table:
# Intervals:
   min max count
1   0   5    23
2   5   9    17
3   9  19    56
4  19  33    76
5  33  41    46

New category based on sequence of date ranges

We could create a sequence of yearly dates from min of Date to max and use that in cut

period <- seq(min(df$Date), max(df$Date), by = "1 year")
df$period <- cut(df$Date, breaks =  c(period, Inf), 
              labels = paste0("period", seq_along(period)))

df
#         Date          AHD  period
#1  2001-08-01 10.792934251 period1
#2  2001-09-01 12.277429242 period1
#3  2001-10-01 13.084441177 period1
#4  2001-11-01  9.654302297 period1
#5  2001-12-01 12.429124689 period1
#6  2002-01-01 12.506055892 period1
#7  2002-02-01 11.425260040 period1
#8  2002-03-01 11.453368144 period1
#9  2002-04-01 11.435548001 period1
#10 2002-05-01 11.109962171 period1
#11 2002-06-01 11.522807300 period1
#12 2002-07-01 11.001613555 period1
#13 2002-08-01 11.223746105 period2
#....

We can also use findInterval which gives same result as cut

df$period <- paste0("period", findInterval(df$Date,  period))

data

set.seed(1234)
Date <- seq(as.Date("2001/08/01"), by = "month", length.out = 60)
AHD <- rnorm(60, mean = 12, sd = 1)
df <- data.frame(Date=Date, AHD = AHD)

Categorization based on value ranges in multiple columns using dplyr

One dplyr and purrr solution could be:

criteria_up <- criteria %>%
 group_by(category) %>%
 mutate(min.val = ifelse(!is.na(min.val), paste(trait, min.val, sep = " >= "), NA_character_),
        max.val = ifelse(!is.na(max.val), paste(trait, max.val, sep = " <= "), NA_character_)) %>%
 summarise(val = paste(paste(na.omit(min.val), collapse = " & "), 
                       paste(na.omit(max.val), collapse = " & "), 
                       sep = " & "))

map2_dfr(.x = criteria_up %>%
          pull(val),
         .y = criteria_up %>%
          pull(category),
         ~ mtcars %>%
          filter(!!rlang::parse_expr(.x)) %>%
          mutate(category = !!.y)) %>%
 full_join(mtcars)

    mpg cyl  disp  hp drat    wt  qsec vs am gear carb category
1  21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4     high
2  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4     high
3  18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2      low
4  18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1      low
5  19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4      low
6  17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4      low
7  16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3      low
8  17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3      low
9  15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3      low
10 15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2      low
11 15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2      low
12 19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2      low
13 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4      low
14 19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6      low
15 15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8      low
16 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4   medium
17 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4   medium
18 22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1   medium
19 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1   medium
20 24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2   medium
21 22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2   medium
22 32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1   medium
23 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2   medium
24 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1   medium
25 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1   medium
26 27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1   medium
27 26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2   medium
28 30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2   medium
29 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2   medium
30 14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4     <NA>
31 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4     <NA>
32 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4     <NA>
33 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4     <NA>
34 13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4     <NA>

Creating a summary table based on range categories in r

First, just create your intervals using cut, and then use table and cbind the results:

intervals <- cut(input$depth, breaks=c(0, 3, 6, 10))

cbind(table(intervals),
      table(intervals, input$region),
      table(intervals, input$geomorph))
#          A R T F P S
# (0,3]  1 1 0 0 1 0 0
# (3,6]  2 0 1 1 1 0 1
# (6,10] 2 0 1 1 0 1 1

The output of the above is a matrix. Use the following if you want a data.frame:

temp <- cbind(table(intervals),
      table(intervals, input$region),
      table(intervals, input$geomorph))

temp <- data.frame(depth.category = rownames(temp),
                   as.data.frame(temp, row.names = 1:nrow(temp)))
names(temp)[2] <- "Total"
temp
#   depth.category Total A R T F P S
# 1          (0,3]     1 1 0 0 1 0 0
# 2          (3,6]     2 0 1 1 1 0 1
# 3         (6,10]     2 0 1 1 0 1 1

Assign a category, according to range of the value as a new column, python

You can use pandas.cut to organise this into bins in line with your example.

df = pd.DataFrame([500, 1000, 1001, 1560, 1500, 2000, 2300, 7, 1499], columns=['INDUST_CODE'])

   INDUST_CODE
0          500
1         1000
2         1001
3         1560
4         1500
5         2000
6         2300
7            7
8         1499

df['Categories'] = pd.cut(df['INDUST_CODE'], [0, 999, 1499, 1999, 100000], labels=['Military_service', 'Public_service', 'Private_sector', 'Others'])

   INDUST_CODE        Categories
0          500  Military_service
1         1000    Public_service
2         1001    Public_service
3         1560    Private_sector
4         1500    Private_sector
5         2000            Others
6         2300            Others
7            7  Military_service
8         1499    Public_service
Categories (4, object): [Military_service < Public_service < Private_sector < Others]

Creating a new column of categorical variables based on date range

You have few syntax issues in your ifelse statement.

Since you are using dplyr you can simplify this with case_when and between functions .

library(dplyr)

dat %>%
  mutate(new_var = case_when(
           between(date, as.Date("1954-03-13"), as.Date("1958-12-07"))~"test1",
           between(date, as.Date("1958-09-14"), as.Date("1964-03-07"))~ "test2")
         )

#  record_id       date new_var
#1    111111 1956-10-28   test1
#2    222222 1956-10-28   test1
#3    333333 1956-10-29   test1
#4    444444 1956-10-29   test1
#5    555555 1956-10-30   test1

Create and fill new columns based on range information from two other columns

Now tested code. Implementation of strategy described in comment:

I’d make a matrix with names columns and assign with row and col indices. You can then either attach it as a matrix or convert to data frame.

Mat <- matrix(0, nrow(df), 10) # 200 for real case
maxwk <- 10
colnames(Mat) <- paste0("week", 1:maxwk)

# Add extra column that marks condition 
# If there are always exactly 3 row per group just rep(1:3, ngrps)

# Need to define a value for cond that identifies the three possibilities:

df$cond <- rep(1:3, length=nrow(df))  # assume all groups have exactly 3:

for ( r in 1:nrow(df) ) {
          # for first row in group
  if( df$cond[r] == 1){
     Idx <-  paste0("week", df$start[r]:df$stop[r] ) #start:stop
     Mat[r, Idx] <- 1; next}
          # second
  if( df$cond[r] == 2){ 
     Idx <-  paste0("week" , df$stop[r]:df$unstop[r] )#  stop:unstop
     Mat[r, Idx] <- 1; next}
          # third
  if( df$cond[r] == 3){
    Idx  <- paste0("week", df$unstop[r]:maxwk )    # unstop:max
    Mat[r, Idx] <- 1; next}
  }

df
  group start stop unstop cond
1     1     2    4      5    1
2     1     2    7      7    2
3     1     2    8     10    3
4     2     7    6      7    1
5     2     7    8      9    2
6     2     7    9     10    3
> Mat
     week1 week2 week3 week4 week5 week6 week7 week8 week9 week10
[1,]     0     1     1     1     0     0     0     0     0      0
[2,]     0     0     0     0     0     0     1     0     0      0
[3,]     0     0     0     0     0     0     0     0     0      1
[4,]     0     0     0     0     0     1     1     0     0      0
[5,]     0     0     0     0     0     0     0     1     1      0
[6,]     0     0     0     0     0     0     0     0     0      1

You could cbind these.

There might be performance improvements possible. Could use switch(cond, ...) to dispatch to the correct logic rather than the if( cond == .){ ., next} method. This should be much faster than code that uses ifelse or if_else. If you want to see how that's implemented, then endorse the general strategy with a checkmark and I'll spend the time to add the alternate code.

ran a benchmark after setting up both methods for 100 week maximum. * The warnings are from the code in the question:

> perf_results <- microbenchmark(
+     first.method    = do_first(df), sec.method=do_second(df), times=10)
There were 50 or more warnings (use warnings() to see the first 50)
> perf_results
Unit: microseconds
         expr         min        lq         mean       median          uq        max neval
 first.method 4385001.123 4416568.8 4581549.9624 4450691.5455 4615753.753 5350416.80    10
   sec.method     146.432     149.6     181.6137     188.2125     193.307     243.47    10

I wanted to see if a switch method of selecting the proper algorithm for a row would improve performance. It did and to a degree that surprised me. The switch function is analogous to the case function in Pascal and many other languages. It has two forms whose behavior is different depending on whether the first argument, EXPR is numeric or character. Here, the "dispatch" version is chosen because the "cond" column is numeric.

do_third= function(df){ Mat <- matrix(0, nrow(df), 100) # 200 for real case
maxwk <- 100
colnames(Mat) <- paste0("week", 1:maxwk)
df$cond <- rep(1:3, length=nrow(df))  # assume all groups have exactly 3: 
for( r in 1:nrow(df)) { switch( df[r,"cond"],      
         { # for first row in each group of 3
     Idx <-  paste0("week", df$start[r]:df$stop[r] ) #start:stop
     Mat[r, Idx] <- 1 }, 
          
          { # second row in group
     Idx <-  paste0("week" , df$stop[r]:df$unstop[r] )#  stop:unstop
     Mat[r, Idx] <- 1 },
          
          {# third
     Idx  <- paste0("week", df$unstop[r]:maxwk )    # unstop:max
     Mat[r, Idx] <- 1 } ) }
   }

New microbenchmark:

perf_results
Unit: nanoseconds
         expr        min         lq         mean     median         uq        max neval cld
 first.method 4304901359 4351893534 4387626725.8 4372151785 4416247096 4543314742    10   b
   sec.method     162803     173855    2588492.1     215309     216878   24081195    10  a 
   third.meth         34         53        610.6        877        940        963    10  a

Create Category Based on Range in R