How to Map Numeric Data into Categories/Bins in Pandas Dataframe

How to map numeric data into categories / bins in Pandas dataframe

With Pandas, you should avoid row-wise operations, as these usually involve an inefficient Python-level loop. Here are a couple of alternatives.

Pandas: pd.cut

As @JonClements suggests, you can use pd.cut for this, the benefit here being that your new column becomes a Categorical.

You only need to define your boundaries (including np.inf) and category names, then apply pd.cut to the desired numeric column.

bins = [0, 2, 18, 35, 65, np.inf]
names = ['<2', '2-18', '18-35', '35-65', '65+']

df['AgeRange'] = pd.cut(df['Age'], bins, labels=names)

print(df.dtypes)

# Age int64
# Age_units object
# AgeRange category
# dtype: object

NumPy: np.digitize

np.digitize provides another clean solution. The idea is to define your boundaries and names, create a dictionary, then apply np.digitize to your Age column. Finally, use your dictionary to map your category names.

Note that for boundary cases the lower bound is used for mapping to a bin.

import pandas as pd, numpy as np

df = pd.DataFrame({'Age': [99, 53, 71, 84, 84],
'Age_units': ['Y', 'Y', 'Y', 'Y', 'Y']})

bins = [0, 2, 18, 35, 65]
names = ['<2', '2-18', '18-35', '35-65', '65+']

d = dict(enumerate(names, 1))

df['AgeRange'] = np.vectorize(d.get)(np.digitize(df['Age'], bins))

Result

   Age Age_units AgeRange
0 99 Y 65+
1 53 Y 35-65
2 71 Y 65+
3 84 Y 65+
4 84 Y 65+

Map numeric data into bins in Pandas dataframe for seperate groups using dictionaries

A simpler solution would be to use groupby and apply a custom function on each group. In this case, we can define a function reclass that obtains the correct bins and ids and then uses pd.cut:

def reclass(group, name):
bins = bins_dic[name]
ids = ids_dic[name]
return pd.cut(group, bins, labels=ids)

df['id'] = df.groupby('polyid')['value'].apply(lambda x: reclass(x, x.name))

Result:

    polyid  value  id
0 1 0.56 1
1 1 0.59 1
2 1 0.62 2
3 1 0.83 3
4 2 0.85 2
5 2 0.01 1
6 2 0.79 2
7 3 0.37 1
8 3 0.99 3
9 3 0.48 1
10 3 0.55 2
11 3 0.06 1

Mapping ranges of values in pandas dataframe

There are a few alternatives.

Pandas via pd.cut / NumPy via np.digitize

You can construct a list of boundaries, then use specialist library functions. This is described in @EdChum's solution, and also in this answer.

NumPy via np.select

df = pd.DataFrame(data=np.random.randint(1,10,10), columns=['a'])

criteria = [df['a'].between(1, 3), df['a'].between(4, 7), df['a'].between(8, 10)]
values = [1, 2, 3]

df['b'] = np.select(criteria, values, 0)

The elements of criteria are Boolean series, so for lists of values, you can use df['a'].isin([1, 3]), etc.

Dictionary mapping via range

d = {range(1, 4): 1, range(4, 8): 2, range(8, 11): 3}

df['c'] = df['a'].apply(lambda x: next((v for k, v in d.items() if x in k), 0))

print(df)

a b c
0 1 1 1
1 7 2 2
2 5 2 2
3 1 1 1
4 3 1 1
5 5 2 2
6 4 2 2
7 4 2 2
8 9 3 3
9 3 1 1

Grouping numerical values in categories

Here's a way using pd.cut:

df = df.sort_values('GPA')

df['bins'] = pd.cut(df['GPA'], bins=3, labels = ['A','B','C'])

Name GPA bins
3 Ramzi 1.75 A
2 Djamel 2.10 A
1 Betty 2.75 B
4 Alexa 3.15 C
0 Adel 3.50 C

Use a dictionary to key a range of values

Assume you have a dataframe like this:

  range value
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9

Then you can apply the following function to the column 'value':

def get_value(range):
if range < 5:
return 'Below 5'
elif range < 10:
return 'Between 5 and 10'
else:
return 'Above 10'

df['value'] = df.apply(lambda col: get_value(col['range']), axis=1)

To get the result you want.

Binning a column with pandas

You can use pandas.cut:

bins = [0, 1, 5, 10, 25, 50, 100]
df['binned'] = pd.cut(df['percentage'], bins)
print (df)
percentage binned
0 46.50 (25, 50]
1 44.20 (25, 50]
2 100.00 (50, 100]
3 42.12 (25, 50]


bins = [0, 1, 5, 10, 25, 50, 100]
labels = [1,2,3,4,5,6]
df['binned'] = pd.cut(df['percentage'], bins=bins, labels=labels)
print (df)
percentage binned
0 46.50 5
1 44.20 5
2 100.00 6
3 42.12 5

Or numpy.searchsorted:

bins = [0, 1, 5, 10, 25, 50, 100]
df['binned'] = np.searchsorted(bins, df['percentage'].values)
print (df)
percentage binned
0 46.50 5
1 44.20 5
2 100.00 6
3 42.12 5

...and then value_counts or groupby and aggregate size:

s = pd.cut(df['percentage'], bins=bins).value_counts()
print (s)
(25, 50] 3
(50, 100] 1
(10, 25] 0
(5, 10] 0
(1, 5] 0
(0, 1] 0
Name: percentage, dtype: int64


s = df.groupby(pd.cut(df['percentage'], bins=bins)).size()
print (s)
percentage
(0, 1] 0
(1, 5] 0
(5, 10] 0
(10, 25] 0
(25, 50] 3
(50, 100] 1
dtype: int64

By default cut returns categorical.

Series methods like Series.value_counts() will use all categories, even if some categories are not present in the data, operations in categorical.

How to efficiently label each value to a bin after I created the bins by pandas.cut() function?

tl;dr: np.digitize is a good solution.

After reading all the comments and answers here and some more Googling, I think I got a solution that I am pretty satisfied. Thank you to all of you guys!

Setup

import pandas as pd
import numpy as np
np.random.seed(42)

bins = [0, 10, 15, 20, 25, 30, np.inf]
labels = bins[1:]
ages = list(range(5, 90, 5))
df = pd.DataFrame({"user_age": ages})
df["user_age_bin"] = pd.cut(df["user_age"], bins=bins, labels=False)

# sort by age
print(df.sort_values('user_age'))

Output:

 user_age  user_age_bin
0 5 0
1 10 0
2 15 1
3 20 2
4 25 3
5 30 4
6 35 5
7 40 5
8 45 5
9 50 5
10 55 5
11 60 5
12 65 5
13 70 5
14 75 5
15 80 5
16 85 5

Assign category:

# a new age value
new_age=30

# use this right=True and '-1' trick to make the bins match
print(np.digitize(new_age, bins=bins, right=True) -1)

Output:

4

How to convert the continuous numbers into categorical using pandas?

One idea is use maths with integer division by // by 10, then multiple by 10 and last convert to strings (with repalce if necessary):

s = df['Val'] // 10 * 10
df['new'] = s.replace(0, 1).astype(str) + '-' + (s + 10).astype(str)
print (df)
Val Val_Cat new
0 1 1-10 1-10
1 15 10-20 10-20
2 2 1-10 1-10
3 91 90-100 90-100
4 52 50-60 50-60
5 126 120-130 120-130

Alternative with f-strings:

df['new'] = df['Val'].map(lambda x: f'{x//10*10}-{(x//10*10)+10}')
print (df)
Val Val_Cat new
0 1 1-10 0-10
1 15 10-20 10-20
2 2 1-10 0-10
3 91 90-100 90-100
4 52 50-60 50-60
5 126 120-130 120-130

Your solution with cut is possible change by:

bins = np.arange(0, df['Val'].max() // 10 * 10 + 20, 10)

df['new'] = pd.cut(df.Val, bins = bins, right=False)
print (df)
Val Val_Cat new
0 1 1-10 [0, 10)
1 15 10-20 [10, 20)
2 2 1-10 [0, 10)
3 91 90-100 [90, 100)
4 52 50-60 [50, 60)
5 126 120-130 [120, 130)


Related Topics



Leave a reply



Submit