Create Category Based on Range in R

Create category based on range in R

Why didn't cut work? Did you not assign to a new column or something?

> data=data.frame(x=c(3,4,6,12))
> data$group = cut(data$x,c(0,5,10,15))
> data
x group
1 3 (0,5]
2 4 (0,5]
3 6 (5,10]
4 12 (10,15]

What you've created there is a factor object in a column of your data frame. The text displayed is the levels of the factor, and you can change them by assignment:

levels(data$group) = c("0-5","6-10",">10")
data
x group
1 3 0-5
2 4 0-5
3 6 6-10
4 12 >10

Read some basic R docs on factors and you'll get it.

How to categorize numerical ranges in r

We can use findInterval in base R

df[-1] <- lapply(df[-1], findInterval, c(-Inf, -2, 2, Inf))

Create categorical variable in R based on range

Ian's answer (cut) is the most common way to do this, as far as i know.

I prefer to use shingle, from the Lattice Package

the argument that specifies the binning intervals seems a little more intuitive to me.

you use shingle like so:

# mock some data
data = sample(0:40, 200, replace=T)

a = c(0, 5);b = c(5,9);c = c(9, 19);d = c(19, 33);e = c(33, 41)

my_bins = matrix(rbind(a, b, c, d, e), ncol=2)

# returns: (the binning intervals i've set)
[,1] [,2]
[1,] 0 5
[2,] 5 9
[3,] 9 19
[4,] 19 33
[5,] 33 41

shx = shingle(data, intervals=my_bins)

#'shx' at the interactive prompt will give you a nice frequency table:
# Intervals:
min max count
1 0 5 23
2 5 9 17
3 9 19 56
4 19 33 76
5 33 41 46

New category based on sequence of date ranges

We could create a sequence of yearly dates from min of Date to max and use that in cut

period <- seq(min(df$Date), max(df$Date), by = "1 year")
df$period <- cut(df$Date, breaks = c(period, Inf),
labels = paste0("period", seq_along(period)))

df
# Date AHD period
#1 2001-08-01 10.792934251 period1
#2 2001-09-01 12.277429242 period1
#3 2001-10-01 13.084441177 period1
#4 2001-11-01 9.654302297 period1
#5 2001-12-01 12.429124689 period1
#6 2002-01-01 12.506055892 period1
#7 2002-02-01 11.425260040 period1
#8 2002-03-01 11.453368144 period1
#9 2002-04-01 11.435548001 period1
#10 2002-05-01 11.109962171 period1
#11 2002-06-01 11.522807300 period1
#12 2002-07-01 11.001613555 period1
#13 2002-08-01 11.223746105 period2
#....

We can also use findInterval which gives same result as cut

df$period <- paste0("period", findInterval(df$Date,  period))

data

set.seed(1234)
Date <- seq(as.Date("2001/08/01"), by = "month", length.out = 60)
AHD <- rnorm(60, mean = 12, sd = 1)
df <- data.frame(Date=Date, AHD = AHD)

Categorization based on value ranges in multiple columns using dplyr

One dplyr and purrr solution could be:

criteria_up <- criteria %>%
group_by(category) %>%
mutate(min.val = ifelse(!is.na(min.val), paste(trait, min.val, sep = " >= "), NA_character_),
max.val = ifelse(!is.na(max.val), paste(trait, max.val, sep = " <= "), NA_character_)) %>%
summarise(val = paste(paste(na.omit(min.val), collapse = " & "),
paste(na.omit(max.val), collapse = " & "),
sep = " & "))

map2_dfr(.x = criteria_up %>%
pull(val),
.y = criteria_up %>%
pull(category),
~ mtcars %>%
filter(!!rlang::parse_expr(.x)) %>%
mutate(category = !!.y)) %>%
full_join(mtcars)

mpg cyl disp hp drat wt qsec vs am gear carb category
1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 high
2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 high
3 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 low
4 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 low
5 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 low
6 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 low
7 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 low
8 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 low
9 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 low
10 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 low
11 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 low
12 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 low
13 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 low
14 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 low
15 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 low
16 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 medium
17 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 medium
18 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 medium
19 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 medium
20 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 medium
21 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 medium
22 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 medium
23 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 medium
24 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 medium
25 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 medium
26 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 medium
27 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 medium
28 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 medium
29 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 medium
30 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 <NA>
31 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 <NA>
32 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 <NA>
33 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 <NA>
34 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 <NA>

Creating a summary table based on range categories in r

First, just create your intervals using cut, and then use table and cbind the results:

intervals <- cut(input$depth, breaks=c(0, 3, 6, 10))

cbind(table(intervals),
table(intervals, input$region),
table(intervals, input$geomorph))
# A R T F P S
# (0,3] 1 1 0 0 1 0 0
# (3,6] 2 0 1 1 1 0 1
# (6,10] 2 0 1 1 0 1 1

The output of the above is a matrix. Use the following if you want a data.frame:

temp <- cbind(table(intervals),
table(intervals, input$region),
table(intervals, input$geomorph))

temp <- data.frame(depth.category = rownames(temp),
as.data.frame(temp, row.names = 1:nrow(temp)))
names(temp)[2] <- "Total"
temp
# depth.category Total A R T F P S
# 1 (0,3] 1 1 0 0 1 0 0
# 2 (3,6] 2 0 1 1 1 0 1
# 3 (6,10] 2 0 1 1 0 1 1

Assign a category, according to range of the value as a new column, python

You can use pandas.cut to organise this into bins in line with your example.

df = pd.DataFrame([500, 1000, 1001, 1560, 1500, 2000, 2300, 7, 1499], columns=['INDUST_CODE'])

INDUST_CODE
0 500
1 1000
2 1001
3 1560
4 1500
5 2000
6 2300
7 7
8 1499

df['Categories'] = pd.cut(df['INDUST_CODE'], [0, 999, 1499, 1999, 100000], labels=['Military_service', 'Public_service', 'Private_sector', 'Others'])

INDUST_CODE Categories
0 500 Military_service
1 1000 Public_service
2 1001 Public_service
3 1560 Private_sector
4 1500 Private_sector
5 2000 Others
6 2300 Others
7 7 Military_service
8 1499 Public_service
Categories (4, object): [Military_service < Public_service < Private_sector < Others]

Creating a new column of categorical variables based on date range

You have few syntax issues in your ifelse statement.

Since you are using dplyr you can simplify this with case_when and between functions .

library(dplyr)

dat %>%
mutate(new_var = case_when(
between(date, as.Date("1954-03-13"), as.Date("1958-12-07"))~"test1",
between(date, as.Date("1958-09-14"), as.Date("1964-03-07"))~ "test2")
)

# record_id date new_var
#1 111111 1956-10-28 test1
#2 222222 1956-10-28 test1
#3 333333 1956-10-29 test1
#4 444444 1956-10-29 test1
#5 555555 1956-10-30 test1

Create and fill new columns based on range information from two other columns

Now tested code. Implementation of strategy described in comment:

I’d make a matrix with names columns and assign with row and col indices. You can then either attach it as a matrix or convert to data frame.

Mat <- matrix(0, nrow(df), 10) # 200 for real case
maxwk <- 10
colnames(Mat) <- paste0("week", 1:maxwk)

# Add extra column that marks condition
# If there are always exactly 3 row per group just rep(1:3, ngrps)

# Need to define a value for cond that identifies the three possibilities:

df$cond <- rep(1:3, length=nrow(df))  # assume all groups have exactly 3:

for ( r in 1:nrow(df) ) {
# for first row in group
if( df$cond[r] == 1){
Idx <- paste0("week", df$start[r]:df$stop[r] ) #start:stop
Mat[r, Idx] <- 1; next}
# second
if( df$cond[r] == 2){
Idx <- paste0("week" , df$stop[r]:df$unstop[r] )# stop:unstop
Mat[r, Idx] <- 1; next}
# third
if( df$cond[r] == 3){
Idx <- paste0("week", df$unstop[r]:maxwk ) # unstop:max
Mat[r, Idx] <- 1; next}
}
df
group start stop unstop cond
1 1 2 4 5 1
2 1 2 7 7 2
3 1 2 8 10 3
4 2 7 6 7 1
5 2 7 8 9 2
6 2 7 9 10 3
> Mat
week1 week2 week3 week4 week5 week6 week7 week8 week9 week10
[1,] 0 1 1 1 0 0 0 0 0 0
[2,] 0 0 0 0 0 0 1 0 0 0
[3,] 0 0 0 0 0 0 0 0 0 1
[4,] 0 0 0 0 0 1 1 0 0 0
[5,] 0 0 0 0 0 0 0 1 1 0
[6,] 0 0 0 0 0 0 0 0 0 1

You could cbind these.

There might be performance improvements possible. Could use switch(cond, ...) to dispatch to the correct logic rather than the if( cond == .){ ., next} method. This should be much faster than code that uses ifelse or if_else. If you want to see how that's implemented, then endorse the general strategy with a checkmark and I'll spend the time to add the alternate code.

  • ran a benchmark after setting up both methods for 100 week maximum. * The warnings are from the code in the question:
> perf_results <- microbenchmark(
+ first.method = do_first(df), sec.method=do_second(df), times=10)
There were 50 or more warnings (use warnings() to see the first 50)
> perf_results
Unit: microseconds
expr min lq mean median uq max neval
first.method 4385001.123 4416568.8 4581549.9624 4450691.5455 4615753.753 5350416.80 10
sec.method 146.432 149.6 181.6137 188.2125 193.307 243.47 10

I wanted to see if a switch method of selecting the proper algorithm for a row would improve performance. It did and to a degree that surprised me. The switch function is analogous to the case function in Pascal and many other languages. It has two forms whose behavior is different depending on whether the first argument, EXPR is numeric or character. Here, the "dispatch" version is chosen because the "cond" column is numeric.

do_third= function(df){ Mat <- matrix(0, nrow(df), 100) # 200 for real case
maxwk <- 100
colnames(Mat) <- paste0("week", 1:maxwk)
df$cond <- rep(1:3, length=nrow(df)) # assume all groups have exactly 3:
for( r in 1:nrow(df)) { switch( df[r,"cond"],
{ # for first row in each group of 3
Idx <- paste0("week", df$start[r]:df$stop[r] ) #start:stop
Mat[r, Idx] <- 1 },

{ # second row in group
Idx <- paste0("week" , df$stop[r]:df$unstop[r] )# stop:unstop
Mat[r, Idx] <- 1 },

{# third
Idx <- paste0("week", df$unstop[r]:maxwk ) # unstop:max
Mat[r, Idx] <- 1 } ) }
}

New microbenchmark:

perf_results
Unit: nanoseconds
expr min lq mean median uq max neval cld
first.method 4304901359 4351893534 4387626725.8 4372151785 4416247096 4543314742 10 b
sec.method 162803 173855 2588492.1 215309 216878 24081195 10 a
third.meth 34 53 610.6 877 940 963 10 a


Related Topics



Leave a reply



Submit