How to map numeric data into categories / bins in Pandas dataframe
With Pandas, you should avoid row-wise operations, as these usually involve an inefficient Python-level loop. Here are a couple of alternatives.
Pandas: pd.cut
As @JonClements suggests, you can use pd.cut
for this, the benefit here being that your new column becomes a Categorical.
You only need to define your boundaries (including np.inf
) and category names, then apply pd.cut
to the desired numeric column.
bins = [0, 2, 18, 35, 65, np.inf]
names = ['<2', '2-18', '18-35', '35-65', '65+']
df['AgeRange'] = pd.cut(df['Age'], bins, labels=names)
print(df.dtypes)
# Age int64
# Age_units object
# AgeRange category
# dtype: object
NumPy: np.digitize
np.digitize
provides another clean solution. The idea is to define your boundaries and names, create a dictionary, then apply np.digitize
to your Age column. Finally, use your dictionary to map your category names.
Note that for boundary cases the lower bound is used for mapping to a bin.
import pandas as pd, numpy as np
df = pd.DataFrame({'Age': [99, 53, 71, 84, 84],
'Age_units': ['Y', 'Y', 'Y', 'Y', 'Y']})
bins = [0, 2, 18, 35, 65]
names = ['<2', '2-18', '18-35', '35-65', '65+']
d = dict(enumerate(names, 1))
df['AgeRange'] = np.vectorize(d.get)(np.digitize(df['Age'], bins))
Result
Age Age_units AgeRange
0 99 Y 65+
1 53 Y 35-65
2 71 Y 65+
3 84 Y 65+
4 84 Y 65+
Map numeric data into bins in Pandas dataframe for seperate groups using dictionaries
A simpler solution would be to use groupby
and apply
a custom function on each group. In this case, we can define a function reclass
that obtains the correct bins and ids and then uses pd.cut
:
def reclass(group, name):
bins = bins_dic[name]
ids = ids_dic[name]
return pd.cut(group, bins, labels=ids)
df['id'] = df.groupby('polyid')['value'].apply(lambda x: reclass(x, x.name))
Result:
polyid value id
0 1 0.56 1
1 1 0.59 1
2 1 0.62 2
3 1 0.83 3
4 2 0.85 2
5 2 0.01 1
6 2 0.79 2
7 3 0.37 1
8 3 0.99 3
9 3 0.48 1
10 3 0.55 2
11 3 0.06 1
Mapping ranges of values in pandas dataframe
There are a few alternatives.
Pandas via pd.cut
/ NumPy via np.digitize
You can construct a list of boundaries, then use specialist library functions. This is described in @EdChum's solution, and also in this answer.
NumPy via np.select
df = pd.DataFrame(data=np.random.randint(1,10,10), columns=['a'])
criteria = [df['a'].between(1, 3), df['a'].between(4, 7), df['a'].between(8, 10)]
values = [1, 2, 3]
df['b'] = np.select(criteria, values, 0)
The elements of criteria
are Boolean series, so for lists of values, you can use df['a'].isin([1, 3])
, etc.
Dictionary mapping via range
d = {range(1, 4): 1, range(4, 8): 2, range(8, 11): 3}
df['c'] = df['a'].apply(lambda x: next((v for k, v in d.items() if x in k), 0))
print(df)
a b c
0 1 1 1
1 7 2 2
2 5 2 2
3 1 1 1
4 3 1 1
5 5 2 2
6 4 2 2
7 4 2 2
8 9 3 3
9 3 1 1
Grouping numerical values in categories
Here's a way using pd.cut
:
df = df.sort_values('GPA')
df['bins'] = pd.cut(df['GPA'], bins=3, labels = ['A','B','C'])
Name GPA bins
3 Ramzi 1.75 A
2 Djamel 2.10 A
1 Betty 2.75 B
4 Alexa 3.15 C
0 Adel 3.50 C
Use a dictionary to key a range of values
Assume you have a dataframe like this:
range value
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9
Then you can apply the following function to the column 'value':
def get_value(range):
if range < 5:
return 'Below 5'
elif range < 10:
return 'Between 5 and 10'
else:
return 'Above 10'
df['value'] = df.apply(lambda col: get_value(col['range']), axis=1)
To get the result you want.
Binning a column with pandas
You can use pandas.cut
:
bins = [0, 1, 5, 10, 25, 50, 100]
df['binned'] = pd.cut(df['percentage'], bins)
print (df)
percentage binned
0 46.50 (25, 50]
1 44.20 (25, 50]
2 100.00 (50, 100]
3 42.12 (25, 50]
bins = [0, 1, 5, 10, 25, 50, 100]
labels = [1,2,3,4,5,6]
df['binned'] = pd.cut(df['percentage'], bins=bins, labels=labels)
print (df)
percentage binned
0 46.50 5
1 44.20 5
2 100.00 6
3 42.12 5
Or numpy.searchsorted
:
bins = [0, 1, 5, 10, 25, 50, 100]
df['binned'] = np.searchsorted(bins, df['percentage'].values)
print (df)
percentage binned
0 46.50 5
1 44.20 5
2 100.00 6
3 42.12 5
...and then value_counts
or groupby
and aggregate size
:
s = pd.cut(df['percentage'], bins=bins).value_counts()
print (s)
(25, 50] 3
(50, 100] 1
(10, 25] 0
(5, 10] 0
(1, 5] 0
(0, 1] 0
Name: percentage, dtype: int64
s = df.groupby(pd.cut(df['percentage'], bins=bins)).size()
print (s)
percentage
(0, 1] 0
(1, 5] 0
(5, 10] 0
(10, 25] 0
(25, 50] 3
(50, 100] 1
dtype: int64
By default cut
returns categorical
.
Series
methods like Series.value_counts()
will use all categories, even if some categories are not present in the data, operations in categorical.
How to efficiently label each value to a bin after I created the bins by pandas.cut() function?
tl;dr: np.digitize
is a good solution.
After reading all the comments and answers here and some more Googling, I think I got a solution that I am pretty satisfied. Thank you to all of you guys!
Setup
import pandas as pd
import numpy as np
np.random.seed(42)
bins = [0, 10, 15, 20, 25, 30, np.inf]
labels = bins[1:]
ages = list(range(5, 90, 5))
df = pd.DataFrame({"user_age": ages})
df["user_age_bin"] = pd.cut(df["user_age"], bins=bins, labels=False)
# sort by age
print(df.sort_values('user_age'))
Output:
user_age user_age_bin
0 5 0
1 10 0
2 15 1
3 20 2
4 25 3
5 30 4
6 35 5
7 40 5
8 45 5
9 50 5
10 55 5
11 60 5
12 65 5
13 70 5
14 75 5
15 80 5
16 85 5
Assign category:
# a new age value
new_age=30
# use this right=True and '-1' trick to make the bins match
print(np.digitize(new_age, bins=bins, right=True) -1)
Output:
4
How to convert the continuous numbers into categorical using pandas?
One idea is use maths with integer division by //
by 10
, then multiple by 10
and last convert to strings (with repalce if necessary):
s = df['Val'] // 10 * 10
df['new'] = s.replace(0, 1).astype(str) + '-' + (s + 10).astype(str)
print (df)
Val Val_Cat new
0 1 1-10 1-10
1 15 10-20 10-20
2 2 1-10 1-10
3 91 90-100 90-100
4 52 50-60 50-60
5 126 120-130 120-130
Alternative with f-string
s:
df['new'] = df['Val'].map(lambda x: f'{x//10*10}-{(x//10*10)+10}')
print (df)
Val Val_Cat new
0 1 1-10 0-10
1 15 10-20 10-20
2 2 1-10 0-10
3 91 90-100 90-100
4 52 50-60 50-60
5 126 120-130 120-130
Your solution with cut is possible change by:
bins = np.arange(0, df['Val'].max() // 10 * 10 + 20, 10)
df['new'] = pd.cut(df.Val, bins = bins, right=False)
print (df)
Val Val_Cat new
0 1 1-10 [0, 10)
1 15 10-20 [10, 20)
2 2 1-10 [0, 10)
3 91 90-100 [90, 100)
4 52 50-60 [50, 60)
5 126 120-130 [120, 130)
Related Topics
Converting Yes and No to 0 and 1 in R
Is There an R Equivalent of the Pythonic "If _Name_ == "_Main_": Main()"
How to Serve Multiple Clients Using Just Flask App.Run() as Standalone
How to Install Pil with Pip on MAC Os
Size Legend for Plotly Bubble Map/Chart
How to Use Rpy2 to Save a Pandas Dataframe to an .Rdata File
Plotting 3-Tuple Data Points in a Surface/Contour Plot Using Matplotlib
Getting Segmentation Fault Core Dumped Error While Importing Robjects from Rpy2
Python Beautifulsoup Iframe Document HTML Extract
Django: How to Serve Media/Stylesheets and Link to Them Within Templates
Style Active Navigation Element with a Flask/Jinja2 MACro
Typeerror: Use() Got an Unexpected Keyword Argument 'Warn' When Importing Matplotlib
How to Set the R_Home Environment Variable to the R Home Directory
Multiple Level Template Inheritance in Jinja2
Control the Size Textarea Widget Look in Django Admin
How to Use the Ellipsis Slicing Syntax in Python
How to Use Mingw's Gcc Compiler When Installing Python Package Using Pip