Custom Sorting in Pandas Dataframe

Custom sorting in pandas dataframe

Pandas 0.15 introduced Categorical Series, which allows a much clearer way to do this:

First make the month column a categorical and specify the ordering to use.

In [21]: df['m'] = pd.Categorical(df['m'], ["March", "April", "Dec"])

In [22]: df # looks the same!
Out[22]:
a b m
0 1 2 March
1 5 6 Dec
2 3 4 April

Now, when you sort the month column it will sort with respect to that list:

In [23]: df.sort_values("m")
Out[23]:
a b m
0 1 2 March
2 3 4 April
1 5 6 Dec

Note: if a value is not in the list it will be converted to NaN.


An older answer for those interested...

You could create an intermediary series, and set_index on that:

df = pd.DataFrame([[1, 2, 'March'],[5, 6, 'Dec'],[3, 4, 'April']], columns=['a','b','m'])
s = df['m'].apply(lambda x: {'March':0, 'April':1, 'Dec':3}[x])
s.sort_values()

In [4]: df.set_index(s.index).sort()
Out[4]:
a b m
0 1 2 March
1 3 4 April
2 5 6 Dec

As commented, in newer pandas, Series has a replace method to do this more elegantly:

s = df['m'].replace({'March':0, 'April':1, 'Dec':3})

The slight difference is that this won't raise if there is a value outside of the dictionary (it'll just stay the same).

Custom sorting with Pandas

One approach would be to use a custom dict to create a 'rank' column, we then use to sort with and then drop the column after sorting:

In [17]:
custom_dict = {'Critical':0, 'High':1, 'Medium':2, 'Low':3}
df['rank'] = df['Criticality'].map(custom_dict)
df

Out[17]:

Name Criticality rank
0 baz High 1
1 foo Critical 0
2 baz Low 3
3 foo Medium 2
4 bar High 1
5 bar Low 3
6 bar Medium 2

[7 rows x 3 columns]

In [19]:
# now sort by 'Name' and 'rank', it will first sort by 'Name' column first and then 'rank'
df.sort(columns=['Name', 'rank'],inplace=True)
df

Out[19]:

Name Criticality rank
4 bar High 1
6 bar Medium 2
5 bar Low 3
0 baz High 1
2 baz Low 3
1 foo Critical 0
3 foo Medium 2

[7 rows x 3 columns]

In [21]:
# now drop the 'rank' column
df.drop(labels=['rank'],axis=1)

Out[21]:

Name Criticality
4 bar High
6 bar Medium
5 bar Low
0 baz High
2 baz Low
1 foo Critical
3 foo Medium

[7 rows x 2 columns]

Sort pandas dataframe by column specifying custom order

From pandas>=1.1.0 you can use the key argument of the .sort_values method to write a lambda function which defines the custom order you prefer.

To do it, you just need to define a custom dictionary with your desired order

custom_dict = {'new': 0, 'fix': 1, 'error': 2}
df.sort_values(by=['col3'], key=lambda x: x.map(custom_dict))

How to sort pandas dataframe by custom order on string index

Just reindex

df.reindex(reorderlist)
Out[89]:
Age G Tm Year id
Player
Maurice Baker 25 7 VAN 2004 5335
Adrian Caldwell 31 81 DAL 1997 6169
Ratko Varda 22 60 TOT 2001 13950
Ryan Bowen 34 52 OKC 2009 6141
Cedric Hunter 27 6 CHH 1991 2967

Update info you have multiple players with same name

out = df.iloc[pd.Categorical(df.index,reorderlist).argsort()]

Custom sorting multiple levels of column index in pandas

Add sort_remaining=False to sort_index to prevent sorting levels below specified:

df = df.sort_index(level=0, axis=1, sort_remaining=False)

df:

Week        2021-10-11           2021-10-18           2021-10-25          
Checks Check Avg Checks Check Avg Checks Check Avg
11th Street 4 5 2 3 0 1
16th Street 10 11 8 9 6 7
Bala Cynwyd 16 17 14 15 12 13

Sample Data Used:

import numpy as np
import pandas as pd

df = pd.DataFrame(
np.arange(18).reshape((-1, 6)),
index=['11th Street', '16th Street', 'Bala Cynwyd'],
columns=pd.MultiIndex.from_arrays(
[np.repeat(pd.to_datetime(['2021-10-25', '2021-10-18', '2021-10-11']),
2),
['Checks', 'Check Avg'] * 3],
names=['Week', None]
)
)

df:

Week        2021-10-25           2021-10-18           2021-10-11          
Checks Check Avg Checks Check Avg Checks Check Avg
11th Street 0 1 2 3 4 5
16th Street 6 7 8 9 10 11
Bala Cynwyd 12 13 14 15 16 17

sorting by a custom list in pandas

Below is an example that performs lexicographic sort on a dataframe.
The idea is to create an numerical index based on the specific sort.
Then to perform a numerical sort based on the index.
A column is added to the dataframe to do so, and is then removed.

import pandas as pd

# Create DataFrame
df = pd.DataFrame(
{'id':[2967, 5335, 13950, 6141, 6169],
'Player': ['Cedric Hunter', 'Maurice Baker',
'Ratko Varda' ,'Ryan Bowen' ,'Adrian Caldwell'],
'Year': [1991, 2004, 2001, 2009, 1997],
'Age': [27, 25, 22, 34, 31],
'Tm': ['CHH' ,'VAN' ,'TOT' ,'OKC', 'DAL'],
'G': [6, 7, 60, 52, 81]})

# Define the sorter
sorter = ['TOT', 'ATL', 'BOS', 'BRK', 'CHA', 'CHH', 'CHI', 'CLE', 'DAL','DEN',
'DET', 'GSW', 'HOU', 'IND', 'LAC', 'LAL', 'MEM', 'MIA', 'MIL',
'MIN', 'NJN', 'NOH', 'NOK', 'NOP', 'NYK', 'OKC', 'ORL', 'PHI',
'PHO', 'POR', 'SAC', 'SAS', 'SEA', 'TOR', 'UTA', 'VAN',
'WAS', 'WSB']

# Create the dictionary that defines the order for sorting
sorterIndex = dict(zip(sorter, range(len(sorter))))

# Generate a rank column that will be used to sort
# the dataframe numerically
df['Tm_Rank'] = df['Tm'].map(sorterIndex)

# Here is the result asked with the lexicographic sort
# Result may be hard to analyze, so a second sorting is
# proposed next
## NOTE:
## Newer versions of pandas use 'sort_values' instead of 'sort'
df.sort_values(['Player', 'Year', 'Tm_Rank'],
ascending = [True, True, True], inplace = True)
df.drop('Tm_Rank', 1, inplace = True)
print(df)

# Here is an example where 'Tm' is sorted first, that will
# give the first row of the DataFrame df to contain TOT as 'Tm'
df['Tm_Rank'] = df['Tm'].map(sorterIndex)
## NOTE:
## Newer versions of pandas use 'sort_values' instead of 'sort'
df.sort_values(['Tm_Rank', 'Player', 'Year'],
ascending = [True , True, True], inplace = True)
df.drop('Tm_Rank', 1, inplace = True)
print(df)

custom sorting pandas dataframe

You could compute the frequency and use the sort method to find the desired order of the index. Then use df.loc[order.index] to reorder the original DataFrame:

order = df.sum(axis=1).sort(inplace=False)
result = df.loc[order.index]

For example,

import pandas as pd

df = pd.DataFrame({
'one.txt': [1, 0, 1, 0, 0, 1, 1, 2, 0],
'third.txt': [1, 0, 1, 0, 1, 0, 1, 0, 0],
'two.txt': [0, 1, 1, 1, 0, 0, 1, 0, 1]},
index=['a', 'i', 'is', 'no', 'not', 'really', 'sentence', 'short', 'think'])

order = df.sum(axis=1).sort(inplace=False, ascending=False)
print(df.loc[order.index])

yields

          one.txt  third.txt  two.txt
sentence 1 1 1
is 1 1 1
short 2 0 0
a 1 1 0
think 0 0 1
really 1 0 0
not 0 1 0
no 0 0 1
i 0 0 1


Related Topics



Leave a reply



Submit