Cut by Defined Interval

Cut by Defined Interval

To cut unto pre-defined intervals, you can specify a vector of breaks using the breaks parameter.

Define some data:

x <- sample(0:20, 100, replace=TRUE)
x

Now cut x at 0, 10 and 20:

cut(x, breaks=c(0, 10, 20), include.lowest=TRUE)

[1] (10,20] [0,10] [0,10] (10,20] (10,20] (10,20] [0,10] (10,20] (10,20]
[10] (10,20] [0,10] (10,20] (10,20] (10,20] [0,10] (10,20] [0,10] [0,10]
[19] [0,10] (10,20] [0,10] [0,10] [0,10] (10,20] [0,10] (10,20] (10,20]
[28] (10,20] (10,20] [0,10] [0,10] [0,10] [0,10] (10,20] [0,10] [0,10]
[37] [0,10] [0,10] (10,20] (10,20] (10,20] (10,20] [0,10] (10,20] [0,10]
[46] (10,20] [0,10] (10,20] (10,20] [0,10] [0,10] (10,20] (10,20] (10,20]
[55] [0,10] [0,10] (10,20] [0,10] [0,10] [0,10] [0,10] (10,20] (10,20]
[64] (10,20] [0,10] [0,10] (10,20] (10,20] (10,20] (10,20] (10,20] (10,20]
[73] (10,20] [0,10] [0,10] [0,10] (10,20] [0,10] (10,20] [0,10] (10,20]
[82] [0,10] [0,10] (10,20] [0,10] [0,10] [0,10] (10,20] (10,20] [0,10]
[91] [0,10] [0,10] (10,20] (10,20] [0,10] [0,10] [0,10] [0,10] (10,20]
[100] (10,20]
Levels: [0,10] (10,20]

cut() function puts all data in a single interval

I think you have two alternatives: use cut(as.numeric(vec),...) or findInterval.

as.numeric

If you are not concerned about hitting the theoretical precision loss when converting to integer64 to numeric (it might be hard to find this happening), then you can convert to numeric:

cut(as.numeric(vec), points ,dig.lab = 10)
# [1] (448,672] (0,224] (224,448] (0,224] (224,448] (896,1120] (672,896] (0,224] (224,448] (224,448] (672,896] (0,224] (0,224] (448,672] (448,672] (0,224]
# Levels: (0,224] (224,448] (448,672] (672,896] (896,1120]

findInterval

table(cut(vec, points ,dig.lab = 10))
# (0,224] (224,448] (448,672] (672,896] (896,1120]
# 16 0 0 0 0
table(findInterval(vec, points))
# 1 2 3 4 5
# 6 4 3 1 2

You can mock this to produce similarly-formatted factors manually:

labels <- sprintf("(%i,%i]", points[-length(points)], points[-1])
labels
# [1] "(0,224]" "(224,448]" "(448,672]" "(672,896]" "(896,1120]"
factor(labels[findInterval(vec, points)], labels = labels)
# [1] (448,672] (0,224] (224,448] (0,224] (224,448] (896,1120] (672,896] (0,224] (224,448] (224,448] (896,1120] (0,224] (0,224] (448,672] (448,672] (0,224]
# Levels: (0,224] (224,448] (448,672] (672,896] (896,1120]

Data

vec <- structure(c(2.55431938899924e-321, 9.88131291682493e-322, 1.93179667523927e-321, 9.18962101264719e-322, 1.29445199210407e-321, 5.03946958758071e-321, 3.90805925860426e-321, 6.12641400843146e-322, 2.15906687232625e-321, 1.17587623710217e-321, 4.42682818673757e-321, 1.04741916918344e-321, 7.11454530011395e-322, 2.61360726650019e-321, 2.58396332774972e-321, 9.38724727098368e-322), class = "integer64")

pandas.cut function gave me negative values when it is suppose to be 0

As explained in the comments, you asked cut to define the bins automatically for you, by default they are equal width, which mean having a negative bound is possible.

If you wish to keep the automatic binning, you can modify the intervals manually afterwards. Here is an example in case of only the first interval that is "incorrect", using cat.rename_categories:

np.random.seed(0)
s = pd.Series(np.random.randint(-10,100,size=100)).clip(lower=0)
s_cut = pd.cut(s, bins=10)
print(s_cut.cat.categories)

first_I = s_cut.cat.categories[0]
new_I = pd.Interval(0, first_I.right)
s_cut = s_cut.cat.rename_categories({first_I: new_I})
print(s_cut.cat.categories)

output:

# before
IntervalIndex([(-0.095, 9.5], (9.5, 19.0], (19.0, 28.5], ...)

# after
IntervalIndex([(0.0, 9.5], (9.5, 19.0], (19.0, 28.5], ...)

Defined interval in R by cut() and make a histogram plot

set.seed(50)
months <- sample(50)

output <- cut(months, breaks = seq(0,50, by= 12), labels = c("<12","12-24","24-35","36-50"))

hist(as.numeric(output))

You'll have to edit the axis values on the histogram manually, since they will be labeled at an interval 1-4. And as I mentioned in my comment. The histogram isn't very informative, considering all the values are equal.

Count interval using function cut

You can use .drop = FALSE to include factor levels which are empty.

library(dplyr)
interval%>% group_by(interval, .drop = FALSE) %>% summarise(n=n())

# A tibble: 4 x 2
# interval n
# <fct> <int>
#1 (0,3] 1
#2 (3,5] 1
#3 (5,7] 0
#4 (7,12] 2

Alternately, you can also use count

interval%>% count(interval, .drop = FALSE)

Note that some of these functions are also present in plyr library, so if you have that library loaded these functions might mask them. In such case, restart R and load only dplyr library or explicitly mention dplyr::summarise and dplyr::count.

cut function and controlled frequency in the intervals

Updated after some comments:

Since you state that the minimum number of cases in each group would be fine for you, I'd go with Hmisc::cut2

v <- rnorm(10, 0, 1)
Hmisc::cut2(v, m = 3) # minimum of 3 cases per group

The documentation for cut2 states:

m   desired minimum number of observations in a group.
The algorithm does not guarantee that all groups will have at least m observations.

The same cuts for separate variables

If the distributions of your variables are very similar you could extract the exact cutpoints by setting the argument onlycuts = T and reuse them for the other variables. In case the distributions are different though, you will end up with few cases in some intervals.

Using your data:

library(magrittr)
library(Hmisc)

cuts <- cut2(df1$x, g = 20, onlycuts = T) # determine cuts based on df1

cut2(df1$x, cuts = cuts) %>% table
cut2(df2$x, cuts = cuts) %>% table*2 # multiplied by two for better comparison

cut variable with symbol in label interval

Those levels are character values, you can change the levels using sub.

var <- cut(variab, breaks = c(0, 1:4, Inf), include.lowest=TRUE)
levels(var) <- sub('Inf', '4+', levels(var))
table(var)
#var
# [0,1] (1,2] (2,3] (3,4] (4,4+]
# 2 1 0 1 96

For the data in dataframe, you can do :

df %>% 
mutate(var = cut(variab, breaks = c(0, 1:4, Inf), include.lowest=TRUE),
var = sub('Inf', '4+', var))


Related Topics



Leave a reply



Submit