How to Map Numeric Data into Categories/Bins in Pandas Dataframe

How to map numeric data into categories / bins in Pandas dataframe

With Pandas, you should avoid row-wise operations, as these usually involve an inefficient Python-level loop. Here are a couple of alternatives.

Pandas: `pd.cut`

As @JonClements suggests, you can use pd.cut for this, the benefit here being that your new column becomes a Categorical.

You only need to define your boundaries (including np.inf) and category names, then apply pd.cut to the desired numeric column.

bins = [0, 2, 18, 35, 65, np.inf]
names = ['<2', '2-18', '18-35', '35-65', '65+']

df['AgeRange'] = pd.cut(df['Age'], bins, labels=names)

print(df.dtypes)

# Age             int64
# Age_units      object
# AgeRange     category
# dtype: object

NumPy: `np.digitize`

np.digitize provides another clean solution. The idea is to define your boundaries and names, create a dictionary, then apply np.digitize to your Age column. Finally, use your dictionary to map your category names.

Note that for boundary cases the lower bound is used for mapping to a bin.

import pandas as pd, numpy as np

df = pd.DataFrame({'Age': [99, 53, 71, 84, 84],
                   'Age_units': ['Y', 'Y', 'Y', 'Y', 'Y']})

bins = [0, 2, 18, 35, 65]
names = ['<2', '2-18', '18-35', '35-65', '65+']

d = dict(enumerate(names, 1))

df['AgeRange'] = np.vectorize(d.get)(np.digitize(df['Age'], bins))

Result

   Age Age_units AgeRange
0   99         Y      65+
1   53         Y    35-65
2   71         Y      65+
3   84         Y      65+
4   84         Y      65+

Map numeric data into bins in Pandas dataframe for seperate groups using dictionaries

A simpler solution would be to use groupby and apply a custom function on each group. In this case, we can define a function reclass that obtains the correct bins and ids and then uses pd.cut:

def reclass(group, name):
    bins = bins_dic[name]
    ids = ids_dic[name]
    return pd.cut(group, bins, labels=ids)
    
df['id'] = df.groupby('polyid')['value'].apply(lambda x: reclass(x, x.name))

Result:

    polyid  value  id
0        1   0.56   1
1        1   0.59   1
2        1   0.62   2
3        1   0.83   3
4        2   0.85   2
5        2   0.01   1
6        2   0.79   2
7        3   0.37   1
8        3   0.99   3
9        3   0.48   1
10       3   0.55   2
11       3   0.06   1

Mapping ranges of values in pandas dataframe

There are a few alternatives.

Pandas via `pd.cut` / NumPy via `np.digitize`

You can construct a list of boundaries, then use specialist library functions. This is described in @EdChum's solution, and also in this answer.

NumPy via `np.select`

df = pd.DataFrame(data=np.random.randint(1,10,10), columns=['a'])

criteria = [df['a'].between(1, 3), df['a'].between(4, 7), df['a'].between(8, 10)]
values = [1, 2, 3]

df['b'] = np.select(criteria, values, 0)

The elements of criteria are Boolean series, so for lists of values, you can use df['a'].isin([1, 3]), etc.

Dictionary mapping via `range`

d = {range(1, 4): 1, range(4, 8): 2, range(8, 11): 3}

df['c'] = df['a'].apply(lambda x: next((v for k, v in d.items() if x in k), 0))

print(df)

   a  b  c
0  1  1  1
1  7  2  2
2  5  2  2
3  1  1  1
4  3  1  1
5  5  2  2
6  4  2  2
7  4  2  2
8  9  3  3
9  3  1  1

Grouping numerical values in categories

Here's a way using pd.cut:

df = df.sort_values('GPA')

df['bins'] = pd.cut(df['GPA'], bins=3, labels = ['A','B','C'])

     Name   GPA bins
3   Ramzi  1.75    A
2  Djamel  2.10    A
1   Betty  2.75    B
4   Alexa  3.15    C
0    Adel  3.50    C

Use a dictionary to key a range of values

Assume you have a dataframe like this:

  range value
0   0     0
1   1     1
2   2     2
3   3     3
4   4     4
5   5     5
6   6     6
7   7     7
8   8     8
9   9     9

Then you can apply the following function to the column 'value':

def get_value(range):
    if range < 5:
        return 'Below 5'
    elif range < 10:
        return 'Between 5 and 10'
    else:
        return 'Above 10'

df['value'] = df.apply(lambda col: get_value(col['range']), axis=1)

To get the result you want.

Binning a column with pandas

You can use pandas.cut:

bins = [0, 1, 5, 10, 25, 50, 100]
df['binned'] = pd.cut(df['percentage'], bins)
print (df)
   percentage     binned
0       46.50   (25, 50]
1       44.20   (25, 50]
2      100.00  (50, 100]
3       42.12   (25, 50]

bins = [0, 1, 5, 10, 25, 50, 100]
labels = [1,2,3,4,5,6]
df['binned'] = pd.cut(df['percentage'], bins=bins, labels=labels)
print (df)
   percentage binned
0       46.50      5
1       44.20      5
2      100.00      6
3       42.12      5

Or numpy.searchsorted:

bins = [0, 1, 5, 10, 25, 50, 100]
df['binned'] = np.searchsorted(bins, df['percentage'].values)
print (df)
   percentage  binned
0       46.50       5
1       44.20       5
2      100.00       6
3       42.12       5

...and then value_counts or groupby and aggregate size:

s = pd.cut(df['percentage'], bins=bins).value_counts()
print (s)
(25, 50]     3
(50, 100]    1
(10, 25]     0
(5, 10]      0
(1, 5]       0
(0, 1]       0
Name: percentage, dtype: int64

s = df.groupby(pd.cut(df['percentage'], bins=bins)).size()
print (s)
percentage
(0, 1]       0
(1, 5]       0
(5, 10]      0
(10, 25]     0
(25, 50]     3
(50, 100]    1
dtype: int64

By default cut returns categorical.

Series methods like Series.value_counts() will use all categories, even if some categories are not present in the data, operations in categorical.

How to efficiently label each value to a bin after I created the bins by pandas.cut() function?

tl;dr: np.digitize is a good solution.

After reading all the comments and answers here and some more Googling, I think I got a solution that I am pretty satisfied. Thank you to all of you guys!

Setup

import pandas as pd
import numpy as np
np.random.seed(42)

bins = [0, 10, 15, 20, 25, 30, np.inf]
labels = bins[1:]
ages = list(range(5, 90, 5))
df = pd.DataFrame({"user_age": ages})
df["user_age_bin"] = pd.cut(df["user_age"], bins=bins, labels=False)

# sort by age 
print(df.sort_values('user_age'))

Output:

 user_age  user_age_bin
0          5             0
1         10             0
2         15             1
3         20             2
4         25             3
5         30             4
6         35             5
7         40             5
8         45             5
9         50             5
10        55             5
11        60             5
12        65             5
13        70             5
14        75             5
15        80             5
16        85             5

Assign category:

# a new age value
new_age=30

# use this right=True and '-1' trick to make the bins match
print(np.digitize(new_age, bins=bins, right=True) -1)

Output:

How to convert the continuous numbers into categorical using pandas?

One idea is use maths with integer division by // by 10, then multiple by 10 and last convert to strings (with repalce if necessary):

s = df['Val'] // 10 * 10
df['new'] = s.replace(0, 1).astype(str) + '-' + (s + 10).astype(str)
print (df)
   Val  Val_Cat      new
0    1     1-10     1-10
1   15    10-20    10-20
2    2     1-10     1-10
3   91   90-100   90-100
4   52    50-60    50-60
5  126  120-130  120-130

Alternative with f-strings:

df['new'] = df['Val'].map(lambda x: f'{x//10*10}-{(x//10*10)+10}')
print (df)
   Val  Val_Cat      new
0    1     1-10     0-10
1   15    10-20    10-20
2    2     1-10     0-10
3   91   90-100   90-100
4   52    50-60    50-60
5  126  120-130  120-130

Your solution with cut is possible change by:

bins = np.arange(0, df['Val'].max() // 10 * 10 + 20, 10)

df['new'] = pd.cut(df.Val, bins = bins, right=False)
print (df)
   Val  Val_Cat         new
0    1     1-10     [0, 10)
1   15    10-20    [10, 20)
2    2     1-10     [0, 10)
3   91   90-100   [90, 100)
4   52    50-60    [50, 60)
5  126  120-130  [120, 130)

How to Map Numeric Data into Categories/Bins in Pandas Dataframe