Define and Apply Custom Bins on a Dataframe

Define and apply custom bins on a dataframe

Another cut answer that takes into account extrema:

dat <- read.table("clipboard", header=TRUE)

cuts <- apply(dat, 2, cut, c(-Inf,seq(0.5, 1, 0.1), Inf), labels=0:6)
cuts[cuts=="6"] <- "0"
cuts <- as.data.frame(cuts)

cosinFcolor cosinEdge cosinTexture histoFcolor histoEdge histoTexture jaccard
1 3 0 0 1 1 0 0
2 0 0 5 0 2 2 0
3 1 0 2 0 0 1 0
4 0 0 3 0 1 1 0
5 1 3 1 0 4 0 0
6 0 0 1 0 0 0 0

Explanation

The cut function splits into bins depending on the cuts you specify. So let's take 1:10 and split it at 3, 5 and 7.

cut(1:10, c(3, 5, 7))
[1] <NA> <NA> <NA> (3,5] (3,5] (5,7] (5,7] <NA> <NA> <NA>
Levels: (3,5] (5,7]

You can see how it has made a factor where the levels are those in between the breaks. Also notice it doesn't include 3 (there's an include.lowest argument which will include it). But these are terrible names for groups, let's call them group 1 and 2.

cut(1:10, c(3, 5, 7), labels=1:2)
[1] <NA> <NA> <NA> 1 1 2 2 <NA> <NA> <NA>

Better, but what's with the NAs? They are outside our boundaries and not counted. To count them, in my solution, I added -infinity and infinity, so all points would be included. Notice that as we have more breaks, we'll need more labels:

x <- cut(1:10, c(-Inf, 3, 5, 7, Inf), labels=1:4)
[1] 1 1 1 2 2 3 3 4 4 4
Levels: 1 2 3 4

Ok, but we didn't want 4 (as per your problem). We wanted all the 4s to be in group 1. So let's get rid of the entries which are labelled '4'.

x[x=="4"] <- "1"
[1] 1 1 1 2 2 3 3 1 1 1
Levels: 1 2 3 4

This is slightly different to what I did before, notice I took away all the last labels at the end before, but I've done it this way here so you can better see how cut works.

Ok, the apply function. So far, we've been using cut on a single vector. But you want it used on a collection of vectors: each column of your data frame. That's what the second argument of apply does. 1 applies the function to all rows, 2 applies to all columns. Apply the cut function to each column of your data frame. Everything after cut in the apply function are just arguments to cut, which we discussed above.

Hope that helps.

Binning Pandas Dataframe by custom and variable length datetime ranges

If you have the start and end times of each period, then you really don't need to create a range. You can just use logic with the datetime objects. Should be easy to generalize to more and more tests if you have that.

import pandas as pd

start_t1 = pd.to_datetime('2017-10-14 00:20:00')
stop_t1 = pd.to_datetime('2017-10-14 00:33:15')
start_t2 = pd.to_datetime('2017-10-14 00:49:15')
stop_t2 = pd.to_datetime('2017-10-14 01:15:15')

df.loc[(df.Timestamp > start_t1) & (df.Timestamp < stop_t1), 'Test'] = 'Test_1'
df.loc[(df.Timestamp > start_t2) & (df.Timestamp < stop_t2), 'Test'] = 'Test_2'

DeviceID Quant Result2 QuantResult1 Timestamp Test
0 15D 387403 7903 2017-10-14 00:28:00 Test_1
1 15D 786 3429734 2017-10-14 00:29:10 Test_1
2 15D 546 2320923 2017-10-14 00:31:15 Test_1
3 15D 435869 232 2017-10-14 00:50:05 Test_2
4 15D 12 34032984 2017-10-14 01:10:07 Test_2

Binning a column with Python Pandas

You can use pandas.cut:

bins = [0, 1, 5, 10, 25, 50, 100]
df['binned'] = pd.cut(df['percentage'], bins)
print (df)
percentage binned
0 46.50 (25, 50]
1 44.20 (25, 50]
2 100.00 (50, 100]
3 42.12 (25, 50]


bins = [0, 1, 5, 10, 25, 50, 100]
labels = [1,2,3,4,5,6]
df['binned'] = pd.cut(df['percentage'], bins=bins, labels=labels)
print (df)
percentage binned
0 46.50 5
1 44.20 5
2 100.00 6
3 42.12 5

Or numpy.searchsorted:

bins = [0, 1, 5, 10, 25, 50, 100]
df['binned'] = np.searchsorted(bins, df['percentage'].values)
print (df)
percentage binned
0 46.50 5
1 44.20 5
2 100.00 6
3 42.12 5

...and then value_counts or groupby and aggregate size:

s = pd.cut(df['percentage'], bins=bins).value_counts()
print (s)
(25, 50] 3
(50, 100] 1
(10, 25] 0
(5, 10] 0
(1, 5] 0
(0, 1] 0
Name: percentage, dtype: int64


s = df.groupby(pd.cut(df['percentage'], bins=bins)).size()
print (s)
percentage
(0, 1] 0
(1, 5] 0
(5, 10] 0
(10, 25] 0
(25, 50] 3
(50, 100] 1
dtype: int64

By default cut returns categorical.

Series methods like Series.value_counts() will use all categories, even if some categories are not present in the data, operations in categorical.

Grouping values into custom bins

Try making the bins fall between the thresholds:

bins = [0.5, 8.5, 11.5, 12.5, 15.5, 16.5]
labels=['Middle School or less', 'Some High School',
'High School Grad', 'Some College', 'College Grad']

adult_df_educrace['education_bins'] = pd.cut(x=adult_df_educrace['education'],
bins=bins,
labels=labels)

Test:

adult_df_educrace = pd.DataFrame({'education':np.arange(1,17)})

Output:

    education         education_bins
0 1 Middle School or less
1 2 Middle School or less
2 3 Middle School or less
3 4 Middle School or less
4 5 Middle School or less
5 6 Middle School or less
6 7 Middle School or less
7 8 Middle School or less
8 9 Some High School
9 10 Some High School
10 11 Some High School
11 12 High School Grad
12 13 Some College
13 14 Some College
14 15 Some College
15 16 College Grad

How to create bins and assign labels based on a given condition pandas

The solution involves editing your bins list:

# Same labels as yours
labels = ['0', '1.0 - 1.19', '1.2 – 1.39', '1.4 – 1.59', '1.6 – 1.79',
'1.8 – 1.99', '2.0 – 2.99', '3.0 – 3.99', '4.0 – 4.99', '5.0 – 5.99',
'6.0 – 6.99', '7.0 – 7.99', '8.0 – 8.99', '9.0 – 9.99', '10.0+']

# Define the edges between bins
bins = [0, 1.0, 1.2, 1.4, 1.6, 1.8, 2.0, 3.0, 4.0,
5.0, 6.0, 7.0, 8.0, 9.0, 10.0, np.inf]

# pd.cut each column, with each bin closed on left and open on right
res = df.apply(lambda x: pd.cut(x, bins=bins, labels=labels, right=False))

# rename columns and print result
res.columns = [f'new{i+1}' for i in range(df.shape[1])]

print(res)

new1 new2 new3 new4 new5 new6 new7 new8 new9 new10
0 10.0+ 5.0 – 5.99 10.0+ 10.0+ 0 1.0 - 1.19 3.0 – 3.99 2.0 – 2.99 0 1.4 – 1.59
1 3.0 – 3.99 7.0 – 7.99 0 3.0 – 3.99 10.0+ 0 8.0 – 8.99 10.0+ 0 0
2 6.0 – 6.99 10.0+ 10.0+ 3.0 – 3.99 10.0+ 6.0 – 6.99 10.0+ 4.0 – 4.99 0 0
3 2.0 – 2.99 5.0 – 5.99 0 8.0 – 8.99 10.0+ 0 6.0 – 6.99 10.0+ 0 0
4 4.0 – 4.99 10.0+ 10.0+ 10.0+ 1.4 – 1.59 10.0+ 10.0+ 10.0+ 0 0
5 7.0 – 7.99 2.0 – 2.99 0 4.0 – 4.99 3.0 – 3.99 0 10.0+ 0 0 0
6 6.0 – 6.99 10.0+ 8.0 – 8.99 10.0+ 6.0 – 6.99 3.0 – 3.99 9.0 – 9.99 10.0+ 0 0
7 6.0 – 6.99 5.0 – 5.99 6.0 – 6.99 10.0+ 10.0+ 10.0+ 4.0 – 4.99 10.0+ 0 0
8 3.0 – 3.99 10.0+ 4.0 – 4.99 8.0 – 8.99 0 10.0+ 10.0+ 3.0 – 3.99 10.0+ 0

Explanation

The sequence of scalars passed as bins to pd.cut() "defines the bin edges allowing for non-uniform width": https://pandas.pydata.org/docs/reference/api/pandas.cut.html.

By default, each bin is open on the left and closed on the right. To switch this, pass right=False (which also closes the left edge of each bin).

For example, bins=[0, 1.0, 1.19, 1.2] causes pd.cut to make 3 intervals: [0.0, 1.0) < [1.0, 1.19) < [1.19, 2.0).

Map numeric data into bins in Pandas dataframe for seperate groups using dictionaries

A simpler solution would be to use groupby and apply a custom function on each group. In this case, we can define a function reclass that obtains the correct bins and ids and then uses pd.cut:

def reclass(group, name):
bins = bins_dic[name]
ids = ids_dic[name]
return pd.cut(group, bins, labels=ids)

df['id'] = df.groupby('polyid')['value'].apply(lambda x: reclass(x, x.name))

Result:

    polyid  value  id
0 1 0.56 1
1 1 0.59 1
2 1 0.62 2
3 1 0.83 3
4 2 0.85 2
5 2 0.01 1
6 2 0.79 2
7 3 0.37 1
8 3 0.99 3
9 3 0.48 1
10 3 0.55 2
11 3 0.06 1

How to define a function that will check any data frame for Age column and return bins?

the problem lies here

x = input("Enter Dataframe Name: ") # type of x is a string
df = x # now type of df is also a string
df['Age'] # python uses [] as a slicing operation for string, hence generate error

this would resolve your problem

def age_range(df):
bins=[0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
labels=['0-9', '10-19', '20s', '30s', '40s', '50s', '60s', '70s', '80s', '90s']
result = pd.cut(df['Age'], bins=bins, labels=labels, right=False)
return result

for example, you can run it like:

df = pd.DataFrame({'Age' : [random.randint(1, 99) for i in range(500)]})
df["AgeRange"] = age_range(df)

or

df = pd.DataFrame({'Age' : [random.randint(1, 99) for i in range(500)]})
AgeRangeDf = pd.DataFrame({"Age_Range" :age_range(df)})


Related Topics



Leave a reply



Submit