Cut by Defined Interval
To cut unto pre-defined intervals, you can specify a vector of breaks using the breaks
parameter.
Define some data:
x <- sample(0:20, 100, replace=TRUE)
x
Now cut x at 0, 10 and 20:
cut(x, breaks=c(0, 10, 20), include.lowest=TRUE)
[1] (10,20] [0,10] [0,10] (10,20] (10,20] (10,20] [0,10] (10,20] (10,20]
[10] (10,20] [0,10] (10,20] (10,20] (10,20] [0,10] (10,20] [0,10] [0,10]
[19] [0,10] (10,20] [0,10] [0,10] [0,10] (10,20] [0,10] (10,20] (10,20]
[28] (10,20] (10,20] [0,10] [0,10] [0,10] [0,10] (10,20] [0,10] [0,10]
[37] [0,10] [0,10] (10,20] (10,20] (10,20] (10,20] [0,10] (10,20] [0,10]
[46] (10,20] [0,10] (10,20] (10,20] [0,10] [0,10] (10,20] (10,20] (10,20]
[55] [0,10] [0,10] (10,20] [0,10] [0,10] [0,10] [0,10] (10,20] (10,20]
[64] (10,20] [0,10] [0,10] (10,20] (10,20] (10,20] (10,20] (10,20] (10,20]
[73] (10,20] [0,10] [0,10] [0,10] (10,20] [0,10] (10,20] [0,10] (10,20]
[82] [0,10] [0,10] (10,20] [0,10] [0,10] [0,10] (10,20] (10,20] [0,10]
[91] [0,10] [0,10] (10,20] (10,20] [0,10] [0,10] [0,10] [0,10] (10,20]
[100] (10,20]
Levels: [0,10] (10,20]
cut() function puts all data in a single interval
I think you have two alternatives: use cut(as.numeric(vec),...)
or findInterval
.
as.numeric
If you are not concerned about hitting the theoretical precision loss when converting to integer64
to numeric
(it might be hard to find this happening), then you can convert to numeric
:
cut(as.numeric(vec), points ,dig.lab = 10)
# [1] (448,672] (0,224] (224,448] (0,224] (224,448] (896,1120] (672,896] (0,224] (224,448] (224,448] (672,896] (0,224] (0,224] (448,672] (448,672] (0,224]
# Levels: (0,224] (224,448] (448,672] (672,896] (896,1120]
findInterval
table(cut(vec, points ,dig.lab = 10))
# (0,224] (224,448] (448,672] (672,896] (896,1120]
# 16 0 0 0 0
table(findInterval(vec, points))
# 1 2 3 4 5
# 6 4 3 1 2
You can mock this to produce similarly-formatted factors manually:
labels <- sprintf("(%i,%i]", points[-length(points)], points[-1])
labels
# [1] "(0,224]" "(224,448]" "(448,672]" "(672,896]" "(896,1120]"
factor(labels[findInterval(vec, points)], labels = labels)
# [1] (448,672] (0,224] (224,448] (0,224] (224,448] (896,1120] (672,896] (0,224] (224,448] (224,448] (896,1120] (0,224] (0,224] (448,672] (448,672] (0,224]
# Levels: (0,224] (224,448] (448,672] (672,896] (896,1120]
Data
vec <- structure(c(2.55431938899924e-321, 9.88131291682493e-322, 1.93179667523927e-321, 9.18962101264719e-322, 1.29445199210407e-321, 5.03946958758071e-321, 3.90805925860426e-321, 6.12641400843146e-322, 2.15906687232625e-321, 1.17587623710217e-321, 4.42682818673757e-321, 1.04741916918344e-321, 7.11454530011395e-322, 2.61360726650019e-321, 2.58396332774972e-321, 9.38724727098368e-322), class = "integer64")
pandas.cut function gave me negative values when it is suppose to be 0
As explained in the comments, you asked cut
to define the bins automatically for you, by default they are equal width, which mean having a negative bound is possible.
If you wish to keep the automatic binning, you can modify the intervals manually afterwards. Here is an example in case of only the first interval that is "incorrect", using cat.rename_categories
:
np.random.seed(0)
s = pd.Series(np.random.randint(-10,100,size=100)).clip(lower=0)
s_cut = pd.cut(s, bins=10)
print(s_cut.cat.categories)
first_I = s_cut.cat.categories[0]
new_I = pd.Interval(0, first_I.right)
s_cut = s_cut.cat.rename_categories({first_I: new_I})
print(s_cut.cat.categories)
output:
# before
IntervalIndex([(-0.095, 9.5], (9.5, 19.0], (19.0, 28.5], ...)
# after
IntervalIndex([(0.0, 9.5], (9.5, 19.0], (19.0, 28.5], ...)
Defined interval in R by cut() and make a histogram plot
set.seed(50)
months <- sample(50)
output <- cut(months, breaks = seq(0,50, by= 12), labels = c("<12","12-24","24-35","36-50"))
hist(as.numeric(output))
You'll have to edit the axis values on the histogram manually, since they will be labeled at an interval 1-4. And as I mentioned in my comment. The histogram isn't very informative, considering all the values are equal.
Count interval using function cut
You can use .drop = FALSE
to include factor levels which are empty.
library(dplyr)
interval%>% group_by(interval, .drop = FALSE) %>% summarise(n=n())
# A tibble: 4 x 2
# interval n
# <fct> <int>
#1 (0,3] 1
#2 (3,5] 1
#3 (5,7] 0
#4 (7,12] 2
Alternately, you can also use count
interval%>% count(interval, .drop = FALSE)
Note that some of these functions are also present in plyr
library, so if you have that library loaded these functions might mask them. In such case, restart R and load only dplyr
library or explicitly mention dplyr::summarise
and dplyr::count
.
cut function and controlled frequency in the intervals
Updated after some comments:
Since you state that the minimum number of cases in each group would be fine for you, I'd go with Hmisc::cut2
v <- rnorm(10, 0, 1)
Hmisc::cut2(v, m = 3) # minimum of 3 cases per group
The documentation for cut2
states:
m desired minimum number of observations in a group.
The algorithm does not guarantee that all groups will have at least m observations.
The same cuts for separate variables
If the distributions of your variables are very similar you could extract the exact cutpoints by setting the argument onlycuts = T
and reuse them for the other variables. In case the distributions are different though, you will end up with few cases in some intervals.
Using your data:
library(magrittr)
library(Hmisc)
cuts <- cut2(df1$x, g = 20, onlycuts = T) # determine cuts based on df1
cut2(df1$x, cuts = cuts) %>% table
cut2(df2$x, cuts = cuts) %>% table*2 # multiplied by two for better comparison
cut variable with symbol in label interval
Those levels
are character values, you can change the levels using sub
.
var <- cut(variab, breaks = c(0, 1:4, Inf), include.lowest=TRUE)
levels(var) <- sub('Inf', '4+', levels(var))
table(var)
#var
# [0,1] (1,2] (2,3] (3,4] (4,4+]
# 2 1 0 1 96
For the data in dataframe, you can do :
df %>%
mutate(var = cut(variab, breaks = c(0, 1:4, Inf), include.lowest=TRUE),
var = sub('Inf', '4+', var))
Related Topics
Paste Multiple Columns Together
Cbind a Dataframe With an Empty Dataframe - Cbind.Fill
Formatting Decimal Places in R
How to Get Summary Statistics by Group
Shading a Kernel Density Plot Between Two Points.
How to Trim Leading and Trailing White Space
How to Split Data into Training/Testing Sets Using Sample Function
Plot Two Graphs in Same Plot in R
Run R Script from Command Line
How to Create a Lag Variable Within Each Group
Combine Legends For Color and Shape into a Single Legend
Convert Data.Frame Columns from Factors to Characters
Formula With Dynamic Number of Variables
Finding Local Maxima and Minima
Creating a New Column Based on Unique Id With Values in R
Change the Class from Factor to Numeric of Many Columns in a Data Frame