Define and Apply Custom Bins on a Dataframe

Define and apply custom bins on a dataframe

Another cut answer that takes into account extrema:

dat <- read.table("clipboard", header=TRUE)

cuts <- apply(dat, 2, cut, c(-Inf,seq(0.5, 1, 0.1), Inf), labels=0:6)
cuts[cuts=="6"] <- "0"
cuts <- as.data.frame(cuts)

  cosinFcolor cosinEdge cosinTexture histoFcolor histoEdge histoTexture jaccard
1           3         0            0           1         1            0       0
2           0         0            5           0         2            2       0
3           1         0            2           0         0            1       0
4           0         0            3           0         1            1       0
5           1         3            1           0         4            0       0
6           0         0            1           0         0            0       0

Explanation

The cut function splits into bins depending on the cuts you specify. So let's take 1:10 and split it at 3, 5 and 7.

cut(1:10, c(3, 5, 7))
 [1] <NA>  <NA>  <NA>  (3,5] (3,5] (5,7] (5,7] <NA>  <NA>  <NA> 
Levels: (3,5] (5,7]

You can see how it has made a factor where the levels are those in between the breaks. Also notice it doesn't include 3 (there's an include.lowest argument which will include it). But these are terrible names for groups, let's call them group 1 and 2.

cut(1:10, c(3, 5, 7), labels=1:2)
 [1] <NA> <NA> <NA> 1    1    2    2    <NA> <NA> <NA>

Better, but what's with the NAs? They are outside our boundaries and not counted. To count them, in my solution, I added -infinity and infinity, so all points would be included. Notice that as we have more breaks, we'll need more labels:

x <- cut(1:10, c(-Inf, 3, 5, 7, Inf), labels=1:4)
 [1] 1 1 1 2 2 3 3 4 4 4
Levels: 1 2 3 4

Ok, but we didn't want 4 (as per your problem). We wanted all the 4s to be in group 1. So let's get rid of the entries which are labelled '4'.

x[x=="4"] <- "1"
 [1] 1 1 1 2 2 3 3 1 1 1
Levels: 1 2 3 4

This is slightly different to what I did before, notice I took away all the last labels at the end before, but I've done it this way here so you can better see how cut works.

Ok, the apply function. So far, we've been using cut on a single vector. But you want it used on a collection of vectors: each column of your data frame. That's what the second argument of apply does. 1 applies the function to all rows, 2 applies to all columns. Apply the cut function to each column of your data frame. Everything after cut in the apply function are just arguments to cut, which we discussed above.

Hope that helps.

Binning Pandas Dataframe by custom and variable length datetime ranges

If you have the start and end times of each period, then you really don't need to create a range. You can just use logic with the datetime objects. Should be easy to generalize to more and more tests if you have that.

import pandas as pd

start_t1 = pd.to_datetime('2017-10-14 00:20:00')
stop_t1 = pd.to_datetime('2017-10-14 00:33:15')
start_t2 = pd.to_datetime('2017-10-14 00:49:15')
stop_t2 = pd.to_datetime('2017-10-14 01:15:15')

df.loc[(df.Timestamp > start_t1) & (df.Timestamp < stop_t1), 'Test'] = 'Test_1'
df.loc[(df.Timestamp > start_t2) & (df.Timestamp < stop_t2), 'Test'] = 'Test_2'

  DeviceID  Quant Result2  QuantResult1           Timestamp    Test
0      15D         387403          7903 2017-10-14 00:28:00  Test_1
1      15D            786       3429734 2017-10-14 00:29:10  Test_1
2      15D            546       2320923 2017-10-14 00:31:15  Test_1
3      15D         435869           232 2017-10-14 00:50:05  Test_2
4      15D             12      34032984 2017-10-14 01:10:07  Test_2

Binning a column with Python Pandas

You can use pandas.cut:

bins = [0, 1, 5, 10, 25, 50, 100]
df['binned'] = pd.cut(df['percentage'], bins)
print (df)
   percentage     binned
0       46.50   (25, 50]
1       44.20   (25, 50]
2      100.00  (50, 100]
3       42.12   (25, 50]

bins = [0, 1, 5, 10, 25, 50, 100]
labels = [1,2,3,4,5,6]
df['binned'] = pd.cut(df['percentage'], bins=bins, labels=labels)
print (df)
   percentage binned
0       46.50      5
1       44.20      5
2      100.00      6
3       42.12      5

Or numpy.searchsorted:

bins = [0, 1, 5, 10, 25, 50, 100]
df['binned'] = np.searchsorted(bins, df['percentage'].values)
print (df)
   percentage  binned
0       46.50       5
1       44.20       5
2      100.00       6
3       42.12       5

...and then value_counts or groupby and aggregate size:

s = pd.cut(df['percentage'], bins=bins).value_counts()
print (s)
(25, 50]     3
(50, 100]    1
(10, 25]     0
(5, 10]      0
(1, 5]       0
(0, 1]       0
Name: percentage, dtype: int64

s = df.groupby(pd.cut(df['percentage'], bins=bins)).size()
print (s)
percentage
(0, 1]       0
(1, 5]       0
(5, 10]      0
(10, 25]     0
(25, 50]     3
(50, 100]    1
dtype: int64

By default cut returns categorical.

Series methods like Series.value_counts() will use all categories, even if some categories are not present in the data, operations in categorical.

Grouping values into custom bins

Try making the bins fall between the thresholds:

bins = [0.5, 8.5, 11.5, 12.5, 15.5, 16.5]
labels=['Middle School or less', 'Some High School', 
        'High School Grad', 'Some College', 'College Grad']

adult_df_educrace['education_bins'] = pd.cut(x=adult_df_educrace['education'],
                                             bins=bins,
                                             labels=labels)

Test:

adult_df_educrace = pd.DataFrame({'education':np.arange(1,17)})

Output:

    education         education_bins
0           1  Middle School or less
1           2  Middle School or less
2           3  Middle School or less
3           4  Middle School or less
4           5  Middle School or less
5           6  Middle School or less
6           7  Middle School or less
7           8  Middle School or less
8           9       Some High School
9          10       Some High School
10         11       Some High School
11         12       High School Grad
12         13           Some College
13         14           Some College
14         15           Some College
15         16           College Grad

How to create bins and assign labels based on a given condition pandas

The solution involves editing your bins list:

# Same labels as yours
labels = ['0', '1.0 - 1.19', '1.2 – 1.39', '1.4 – 1.59', '1.6 – 1.79', 
          '1.8 – 1.99', '2.0 – 2.99', '3.0 – 3.99', '4.0 – 4.99', '5.0 – 5.99', 
          '6.0 – 6.99', '7.0 – 7.99', '8.0 – 8.99', '9.0 – 9.99', '10.0+']

# Define the edges between bins
bins = [0, 1.0, 1.2, 1.4, 1.6, 1.8, 2.0, 3.0, 4.0, 
        5.0, 6.0, 7.0, 8.0, 9.0, 10.0, np.inf]

# pd.cut each column, with each bin closed on left and open on right
res = df.apply(lambda x: pd.cut(x, bins=bins, labels=labels, right=False))

# rename columns and print result
res.columns = [f'new{i+1}' for i in range(df.shape[1])]

print(res)

         new1        new2        new3        new4        new5        new6        new7        new8   new9       new10
0       10.0+  5.0 – 5.99       10.0+       10.0+           0  1.0 - 1.19  3.0 – 3.99  2.0 – 2.99      0  1.4 – 1.59
1  3.0 – 3.99  7.0 – 7.99           0  3.0 – 3.99       10.0+           0  8.0 – 8.99       10.0+      0           0
2  6.0 – 6.99       10.0+       10.0+  3.0 – 3.99       10.0+  6.0 – 6.99       10.0+  4.0 – 4.99      0           0
3  2.0 – 2.99  5.0 – 5.99           0  8.0 – 8.99       10.0+           0  6.0 – 6.99       10.0+      0           0
4  4.0 – 4.99       10.0+       10.0+       10.0+  1.4 – 1.59       10.0+       10.0+       10.0+      0           0
5  7.0 – 7.99  2.0 – 2.99           0  4.0 – 4.99  3.0 – 3.99           0       10.0+           0      0           0
6  6.0 – 6.99       10.0+  8.0 – 8.99       10.0+  6.0 – 6.99  3.0 – 3.99  9.0 – 9.99       10.0+      0           0
7  6.0 – 6.99  5.0 – 5.99  6.0 – 6.99       10.0+       10.0+       10.0+  4.0 – 4.99       10.0+      0           0
8  3.0 – 3.99       10.0+  4.0 – 4.99  8.0 – 8.99           0       10.0+       10.0+  3.0 – 3.99  10.0+           0

Explanation

The sequence of scalars passed as bins to pd.cut() "defines the bin edges allowing for non-uniform width": https://pandas.pydata.org/docs/reference/api/pandas.cut.html.

By default, each bin is open on the left and closed on the right. To switch this, pass right=False (which also closes the left edge of each bin).

For example, bins=[0, 1.0, 1.19, 1.2] causes pd.cut to make 3 intervals: [0.0, 1.0) < [1.0, 1.19) < [1.19, 2.0).

Map numeric data into bins in Pandas dataframe for seperate groups using dictionaries

A simpler solution would be to use groupby and apply a custom function on each group. In this case, we can define a function reclass that obtains the correct bins and ids and then uses pd.cut:

def reclass(group, name):
    bins = bins_dic[name]
    ids = ids_dic[name]
    return pd.cut(group, bins, labels=ids)
    
df['id'] = df.groupby('polyid')['value'].apply(lambda x: reclass(x, x.name))

Result:

    polyid  value  id
0        1   0.56   1
1        1   0.59   1
2        1   0.62   2
3        1   0.83   3
4        2   0.85   2
5        2   0.01   1
6        2   0.79   2
7        3   0.37   1
8        3   0.99   3
9        3   0.48   1
10       3   0.55   2
11       3   0.06   1

How to define a function that will check any data frame for Age column and return bins?

the problem lies here

x = input("Enter Dataframe Name: ") # type of x is a string
df = x # now type of df is also a string
df['Age'] # python uses [] as a slicing operation for string, hence generate error

this would resolve your problem

def age_range(df):
        bins=[0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
        labels=['0-9', '10-19', '20s', '30s', '40s', '50s', '60s', '70s', '80s', '90s']
        result = pd.cut(df['Age'], bins=bins, labels=labels, right=False)
        return result

for example, you can run it like:

df = pd.DataFrame({'Age' : [random.randint(1, 99) for i in range(500)]})
df["AgeRange"] = age_range(df)

df = pd.DataFrame({'Age' : [random.randint(1, 99) for i in range(500)]})
AgeRangeDf = pd.DataFrame({"Age_Range" :age_range(df)})

Define and Apply Custom Bins on a Dataframe