Binning a Column With Python Pandas

Binning a column with Python Pandas

You can use pandas.cut:

bins = [0, 1, 5, 10, 25, 50, 100]
df['binned'] = pd.cut(df['percentage'], bins)
print (df)
percentage binned
0 46.50 (25, 50]
1 44.20 (25, 50]
2 100.00 (50, 100]
3 42.12 (25, 50]


bins = [0, 1, 5, 10, 25, 50, 100]
labels = [1,2,3,4,5,6]
df['binned'] = pd.cut(df['percentage'], bins=bins, labels=labels)
print (df)
percentage binned
0 46.50 5
1 44.20 5
2 100.00 6
3 42.12 5

Or numpy.searchsorted:

bins = [0, 1, 5, 10, 25, 50, 100]
df['binned'] = np.searchsorted(bins, df['percentage'].values)
print (df)
percentage binned
0 46.50 5
1 44.20 5
2 100.00 6
3 42.12 5

...and then value_counts or groupby and aggregate size:

s = pd.cut(df['percentage'], bins=bins).value_counts()
print (s)
(25, 50] 3
(50, 100] 1
(10, 25] 0
(5, 10] 0
(1, 5] 0
(0, 1] 0
Name: percentage, dtype: int64


s = df.groupby(pd.cut(df['percentage'], bins=bins)).size()
print (s)
percentage
(0, 1] 0
(1, 5] 0
(5, 10] 0
(10, 25] 0
(25, 50] 3
(50, 100] 1
dtype: int64

By default cut returns categorical.

Series methods like Series.value_counts() will use all categories, even if some categories are not present in the data, operations in categorical.

How to bin data in pandas dataframe

use pd.cut: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html:

import numpy as np
bins = [-np.inf,10000,100000,500000,np.inf]
labels = ['amount < 10000' ,'amount >=10000 & <100000','amount >=100000 & <500000', 'amount >= 500000']
df['bins'] = pd.cut(df.amount, bins=bins, , labels=labels, right=False, include_lowest=True)

OUTPUT:

         date   amount type                      bins
0 2018-09-28 4000.0 D amount < 10000
1 2018-11-23 2000.0 D amount < 10000
2 2018-12-27 52.5 D amount < 10000
3 2018-10-02 20000.0 D amount >=10000 & <100000
4 2018-11-27 4000.0 C amount < 10000
5 2018-06-01 500.0 D amount < 10000
6 2018-07-02 5000.0 D amount < 10000
7 2018-07-02 52.5 D amount < 10000
8 2018-10-31 500.0 D amount < 10000
9 2018-11-26 2000.0 C amount < 10000

NOTE: you can manipulate the labels list if required.

Binning multiple columns using two groupby-ed columns pandas

I'm not sure this is exactly what you want, but maybe you could use parts of it.

Provided your base dataframe is named df, you can start with using pd.cut() to bin the columns duration_hours and interval_hours:

bins = range(int(df.duration_hours.max()) + 2)
df["dur"] = pd.cut(df.duration_hours, bins, right=False)
bins = range(0, int(df.interval_hours.max()) + 51, 50)
df["int"] = pd.cut(df.interval_hours, bins, right=False)

Then .melt() the result into a new dataframe df_res

df_res = df.melt(
id_vars=["Object", "state"], value_vars=["dur", "int"],
value_name="Bins", var_name="Variable",
)

and groupby() and .sum() over most of it to get the Sample column

group = ["Object", "state", "Variable", "Bins"]
df_res = (
df_res[group].assign(Sample=1).groupby(group, observed=True).sum()
)

and use it to build the Prob column (by .groupby()-transform().sum() over the first three index levels):

df_res["Prob"] = (
df_res.Sample / df_res.groupby(level=[0, 1, 2]).Sample.transform('sum')
)

Result for

df = 
Object state duration_hours interval_hours
0 A 1 0.06 0
1 A 1 0.87 34
2 A 1 1.50 80
3 A 2 18.00 0
4 B 1 7.00 0
5 C 1 0.30 0
6 C 2 3.00 0
7 C 2 4.00 12

is

                                 Sample      Prob
Object state Variable Bins
A 1 dur [0, 1) 2 0.666667
[1, 2) 1 0.333333
int [0, 50) 2 0.666667
[50, 100) 1 0.333333
2 dur [18, 19) 1 1.000000
int [0, 50) 1 1.000000
B 1 dur [7, 8) 1 1.000000
int [0, 50) 1 1.000000
C 1 dur [0, 1) 1 1.000000
int [0, 50) 1 1.000000
2 dur [3, 4) 1 0.500000
[4, 5) 1 0.500000
int [0, 50) 2 1.000000

Binning Percent Ranges with Labels Python Pandas

You can use pandas.cut, as long as your values are numerical (ordered).

    df["Trump2020_bin"] = pd.cut(
df["Trump2020"],
[0, 50, 65, 100],
labels=["lose", "win", "landslide"],
)


Related Topics



Leave a reply



Submit