Binning a Numeric Variable

Binning a numeric variable

How about cut:

binned.x <- cut(x, breaks = c(-1:9, Inf), labels = c(as.character(0:9), '10+'))

Which yields:

 # [1] 0   1   3   4   2   4   2   5   10+ 10+ 10+ 2   10+ 2   10+ 3   4   2  
# Levels: 0 1 2 3 4 5 6 7 8 9 10+

Automatically creating bins for a numeric variable in r

Your description sounds like you're wanting to plot a histogram of var. This can be done easily enough in ggplot using geom_histogram. The key here is that ggplot likes to have a data frame, so you just have to specify your variable in a dataframe first, which you can do inside the ggplot() function:

ggplot(data.frame(var), aes(var)) + geom_histogram(color='black', alpha=0.2)

Gives you this:

Sample Image

The default is to use 30 bins, but you can specify either number of bins via bins= or the size of the bins via binwidth=:

ggplot(data.frame(var), aes(var)) + geom_histogram(bins=10, color='black', alpha=0.2)

Sample Image

If you want to plot the basic bar geom, then geom_histogram() works just fine. If you change to use the stat_bin() function instead, it will perform the same binning method, but then you can apply and use a different geom if you want to:

ggplot(data.frame(var), aes(var)) +
stat_bin(geom='area', bins=10, alpha=0.2, color='black')

Sample Image

If you're looking to grab just the numbers/data from "binning" a variable like you have, one of the simplest ways might be to use cut() from dplyr.

Use of cut() is pretty simple. You specify the vector and a breaks= argument. Breaks can be specified a list of places where you want to "cut" your data (or "bin" your data), or you can just set breaks=10 and it will give you an evenly cut set of 10 bins. The result is a factor with levels= that correspond to the range for each of the breaks. In the case of var with breaks=10, you get the following:

> var_cut <- cut(var, breaks = 10)
> levels(var_cut)
[1] "(-0.365,36.5]" "(36.5,73]" "(73,110]" "(110,146]" "(146,182]" "(182,219]" "(219,256]"
[8] "(256,292]" "(292,328]" "(328,365]"

Categorize numeric variable into group/ bins/ breaks

I would use findInterval() here:

First, make up some sample data

set.seed(1)
ages <- floor(runif(20, min = 20, max = 50))
ages
# [1] 27 31 37 47 26 46 48 39 38 21 26 25 40 31 43 34 41 49 31 43

Use findInterval() to categorize your "ages" vector.

findInterval(ages, c(20, 30, 40))
# [1] 1 2 2 3 1 3 3 2 2 1 1 1 3 2 3 2 3 3 2 3

Alternatively, as recommended in the comments, cut() is also useful here:

cut(ages, breaks=c(20, 30, 40, 50), right = FALSE)
cut(ages, breaks=c(20, 30, 40, 50), right = FALSE, labels = FALSE)

Efficiently Binning Data into specified bins with dplyr

fuzzyjoin implements dplyr range/interval joins:

library(fuzzyjoin)

interval_left_join(
FJX_bins,
test_spectra,
by = c('Wavelength' = 'Lambda_Start', 'Wavelength' = 'Lambda_End')
)
# A tibble: 52 x 5
Wavelength Sigma Bin_Number Lambda_Start Lambda_End
<int> <dbl> <int> <dbl> <dbl>
1 289 3.98e-20 1 289 298.
2 290 3.89e-20 1 289 298.
3 291 3.77e-20 1 289 298.
4 292 3.64e-20 1 289 298.
5 293 3.54e-20 1 289 298.
6 294 3.39e-20 1 289 298.
7 295 3.25e-20 1 289 298.
8 296 3.09e-20 1 289 298.
9 297 2.93e-20 1 289 298.
10 298 2.80e-20 1 289 298.
# … with 42 more rows

How do I bin a variable across a number of observations for each specimen?

You can use cut to divide the data into categories, complete the sequence and get data in wide format using pivot_wider.

library(dplyr)  
library(tidyr)


df %>%
count(Industry, Logo, Hue = cut(Hue, breaks, labels)) %>%
complete(Industry, Hue = labels, fill = list(n = 0)) %>%
fill(Logo) %>%
arrange(match(Hue, labels)) %>%
pivot_wider(names_from = Hue, values_from = n)

# Industry Logo `[0-45)` `[45-90)` `[90-135)` `[135-180)` `[180-225)` `[225-270)` `[270-315)` `[315-360)`
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 Fossil Petrox 3 0 0 0 2 0 0 0
#2 Renewable Windo 1 0 0 0 0 0 1 1


Related Topics



Leave a reply



Submit