Binning a column with Python Pandas
You can use pandas.cut
:
bins = [0, 1, 5, 10, 25, 50, 100]
df['binned'] = pd.cut(df['percentage'], bins)
print (df)
percentage binned
0 46.50 (25, 50]
1 44.20 (25, 50]
2 100.00 (50, 100]
3 42.12 (25, 50]
bins = [0, 1, 5, 10, 25, 50, 100]
labels = [1,2,3,4,5,6]
df['binned'] = pd.cut(df['percentage'], bins=bins, labels=labels)
print (df)
percentage binned
0 46.50 5
1 44.20 5
2 100.00 6
3 42.12 5
Or numpy.searchsorted
:
bins = [0, 1, 5, 10, 25, 50, 100]
df['binned'] = np.searchsorted(bins, df['percentage'].values)
print (df)
percentage binned
0 46.50 5
1 44.20 5
2 100.00 6
3 42.12 5
...and then value_counts
or groupby
and aggregate size
:
s = pd.cut(df['percentage'], bins=bins).value_counts()
print (s)
(25, 50] 3
(50, 100] 1
(10, 25] 0
(5, 10] 0
(1, 5] 0
(0, 1] 0
Name: percentage, dtype: int64
s = df.groupby(pd.cut(df['percentage'], bins=bins)).size()
print (s)
percentage
(0, 1] 0
(1, 5] 0
(5, 10] 0
(10, 25] 0
(25, 50] 3
(50, 100] 1
dtype: int64
By default cut
returns categorical
.
Series
methods like Series.value_counts()
will use all categories, even if some categories are not present in the data, operations in categorical.
How to bin data in pandas dataframe
use pd.cut
: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html:
import numpy as np
bins = [-np.inf,10000,100000,500000,np.inf]
labels = ['amount < 10000' ,'amount >=10000 & <100000','amount >=100000 & <500000', 'amount >= 500000']
df['bins'] = pd.cut(df.amount, bins=bins, , labels=labels, right=False, include_lowest=True)
OUTPUT
:
date amount type bins
0 2018-09-28 4000.0 D amount < 10000
1 2018-11-23 2000.0 D amount < 10000
2 2018-12-27 52.5 D amount < 10000
3 2018-10-02 20000.0 D amount >=10000 & <100000
4 2018-11-27 4000.0 C amount < 10000
5 2018-06-01 500.0 D amount < 10000
6 2018-07-02 5000.0 D amount < 10000
7 2018-07-02 52.5 D amount < 10000
8 2018-10-31 500.0 D amount < 10000
9 2018-11-26 2000.0 C amount < 10000
NOTE: you can manipulate the labels
list if required.
Binning multiple columns using two groupby-ed columns pandas
I'm not sure this is exactly what you want, but maybe you could use parts of it.
Provided your base dataframe is named df
, you can start with using pd.cut()
to bin the columns duration_hours
and interval_hours
:
bins = range(int(df.duration_hours.max()) + 2)
df["dur"] = pd.cut(df.duration_hours, bins, right=False)
bins = range(0, int(df.interval_hours.max()) + 51, 50)
df["int"] = pd.cut(df.interval_hours, bins, right=False)
Then .melt()
the result into a new dataframe df_res
df_res = df.melt(
id_vars=["Object", "state"], value_vars=["dur", "int"],
value_name="Bins", var_name="Variable",
)
and groupby()
and .sum()
over most of it to get the Sample
column
group = ["Object", "state", "Variable", "Bins"]
df_res = (
df_res[group].assign(Sample=1).groupby(group, observed=True).sum()
)
and use it to build the Prob
column (by .groupby()
-transform().sum()
over the first three index levels):
df_res["Prob"] = (
df_res.Sample / df_res.groupby(level=[0, 1, 2]).Sample.transform('sum')
)
Result for
df =
Object state duration_hours interval_hours
0 A 1 0.06 0
1 A 1 0.87 34
2 A 1 1.50 80
3 A 2 18.00 0
4 B 1 7.00 0
5 C 1 0.30 0
6 C 2 3.00 0
7 C 2 4.00 12
is
Sample Prob
Object state Variable Bins
A 1 dur [0, 1) 2 0.666667
[1, 2) 1 0.333333
int [0, 50) 2 0.666667
[50, 100) 1 0.333333
2 dur [18, 19) 1 1.000000
int [0, 50) 1 1.000000
B 1 dur [7, 8) 1 1.000000
int [0, 50) 1 1.000000
C 1 dur [0, 1) 1 1.000000
int [0, 50) 1 1.000000
2 dur [3, 4) 1 0.500000
[4, 5) 1 0.500000
int [0, 50) 2 1.000000
Binning Percent Ranges with Labels Python Pandas
You can use pandas.cut
, as long as your values are numerical (ordered).
df["Trump2020_bin"] = pd.cut(
df["Trump2020"],
[0, 50, 65, 100],
labels=["lose", "win", "landslide"],
)
Related Topics
How to Input a Regex in String.Replace
How to Programmatically Set an Attribute
Why Does This Iterative List-Growing Code Give Indexerror: List Assignment Index Out of Range
How to Split a Text into Sentences
Multiple Prints on the Same Line in Python
Create a Pandas Dataframe by Appending One Row At a Time
How to Explicitly Free Memory in Python
Urllib and "Ssl: Certificate_Verify_Failed" Error
"Unicode Error "Unicodeescape" Codec Can't Decode Bytes... Cannot Open Text Files in Python 3
What Is the Eafp Principle in Python
Permanently Add a Directory to Pythonpath
Error Message: "'Chromedriver' Executable Needs to Be Available in the Path"
Unicodeencodeerror: 'Charmap' Codec Can't Encode Characters
How to Use Itertools.Groupby()