Pandas Rank by Multiple Columns

Pandas rank by multiple columns

Another way would be to type-cast both the columns of interest to str and combine them by concatenating them. Convert these back to numerical values so that they could be differentiated based on their magnitude.

In method=dense, ranks of duplicated values would remain unchanged. (Here: 6)

Since you want to rank these in their descending order, specifying ascending=False in Series.rank() would let you achieve the desired result.

col1 = df["SaleCount"].astype(str) 
col2 = df["TotalRevenue"].astype(str)
df['Rank'] = (col1+col2).astype(int).rank(method='dense', ascending=False).astype(int)
df.sort_values('Rank')

enter image description here

Pandas ranking order based on two columns

Your code gives me expected result.

I can get something similar to your current result if I change columns order ["Bucket","Count"] to ["Count","Bucket"] - so maybe you use wrong data.


Minimal working code:

import pandas as pd

data = {
'B': [5,5,4,4,3,2,2,1],
'C': [60,3,10,2,35,2,2,3],
}

df = pd.DataFrame(data)

df['CB'] = df[['C', 'B']].apply(tuple, axis=1).rank(method='dense', ascending=False).astype(int)
df['BC'] = df[['B', 'C']].apply(tuple, axis=1).rank(method='dense', ascending=False).astype(int)

print(df)

Result:

   B   C  CB  BC
0 5 60 1 1
1 5 3 4 2
2 4 10 3 3
3 4 2 6 4
4 3 35 2 5
5 2 2 7 6
6 2 2 7 6
7 1 3 5 7

Pandas - Group by and rank within group based on multiple columns

method1:

df.sort_values(['asset_id', 'method_rank', 'conf_score'], ascending=[True, True, False], inplace=True)
df['overall_rank'] = 1
df['overall_rank'] = df.groupby(['asset_id'])['overall_rank'].cumsum()

df

   asset_id method_id  method_rank  conf_score  overall_rank
2 10 p4 2 0.8 1
1 10 p3 2 0.6 2
0 10 p2 5 0.8 3
5 20 p2 1 0.5 1
3 20 p3 2 0.9 2
4 20 p1 5 0.7 3

method2:

define a function to sort every group:

def handle_group(group):
group.sort_values(['method_rank', 'conf_score'], ascending=[True, False], inplace=True)
group['overall_rank'] = np.arange(1, len(group)+1)
return group

df.groupby('asset_id', as_index=False).apply(handle_group)

performance test:

def run1(df):
df = df.sort_values(['asset_id', 'method_rank', 'conf_score'], ascending=[True, True, False])
df['overall_rank'] = 1
df['overall_rank'] = df.groupby(['asset_id'])['overall_rank'].cumsum()
return df

def handle_group(group):
group.sort_values(['method_rank', 'conf_score'], ascending=[True, False], inplace=True)
group['overall_rank'] = np.arange(1, len(group)+1)
return group

def run2(df):
df = df.groupby('asset_id', as_index=False).apply(handle_group)
return df

dfn = pd.concat([df]*10000, ignore_index=True)

%%timeit
df1 = run1(dfn)
# 8.61 ms ± 317 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


%%timeit
df2 = run2(dfn).droplevel(0)
# 31.6 ms ± 404 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Pandas rank based on several columns

sort_values('event_id') prior to grouping then pass method='first' to rank

Also note that if occurred_at isn't already datetime, make it datetime.

# unnecessary if already datetime, but doesn't hurt to do it anyway
df.occurred_at = pd.to_datetime(df.occurred_at)

df['rank'] = df.sort_values('event_id') \
.groupby('user_id').occurred_at \
.rank(method='first')
df

Sample Image



Reference for complete verifiable code

from StringIO import StringIO
import pandas as pd

text = """event_id occurred_at user_id
19148 2015-10-01 1
19693 2015-10-05 2
20589 2015-10-12 1
20996 2015-10-15 1
20998 2015-10-15 1
23301 2015-10-23 2
23630 2015-10-26 1
25172 2015-11-03 1
31699 2015-12-11 1
32186 2015-12-14 2
43426 2016-01-13 1
68300 2016-04-04 2
71926 2016-04-19 1"""

df = pd.read_csv(StringIO(text), delim_whitespace=True)

df.occurred_at = pd.to_datetime(df.occurred_at)

df['rank'] = df.sort_values('event_id').groupby('user_id').occurred_at.rank(method='first')

df

Rank DataFrame based on multiple columns

Here is a one-line approach using sort_values:

In [135]: df['rank'] = df.sort_values(['col_a', 'col_b'])['col_b'].index + 1

In [136]: df
Out[136]:
col_a col_b rank
0 0 5 2
1 0 2 1
2 0 8 3
3 1 3 4
4 1 7 6
5 1 4 5

The logic behind this snippet: Basically, the DataFrame.sort_values function accepts multiple column names and returns a sorted copy of the dataframe based on the order of passed column names. The default sorting order is ascending which is what we want. If you wanted another order you could pass the order as an iterable of booleans to the ascending keyword argument. At the end the new indices of the column_b is what we want (plus one).

Is there a way to rank a value within multiple rows and columns in Pandas Dataframe?

Try this with join and apply:

lst = sorted(df.values.flatten().tolist())[::-1]
print(df.join(df.apply(lambda x: [lst.index(i) + 1 for i in x]), rsuffix='_rank'))

Output:

      Column1  Column2  Column3  Column1_rank  Column2_rank  Column3_rank
Row1 60 20 10 3 8 9
Row2 40 30 80 6 7 1
Row3 70 50 50 2 4 4

How can I rank based on condition in Pandas

Use:

#convert columns to numeric
df[['Ratio','Value']]=df[['Ratio','Value']].apply(lambda x: x.str.strip('%')).astype(float)

Remove row with CPI by condition - test rows if no only CPI per Cluster:

m = df['Group'].eq('CPI')
m1 = ~df['Cluster'].isin(df.loc[m, 'Cluster']) | m
df['RankRatio'] = df[m1].groupby('Cluster')['Ratio'].rank(method='first', ascending=True)
df['RankValue'] = df[m1].groupby('Cluster')['Value'].rank(method='first', ascending=False)


print (df)
Cluster Variable Group Ratio Value RankRatio RankValue
0 1 GDP_M3 GDP 20.0 70.0 1.0 2.0
1 1 HPI_M6 HPI 40.0 80.0 3.0 1.0
2 1 GDP_lg2 GDP 35.0 50.0 2.0 3.0
3 2 CPI_M9 CPI 10.0 50.0 NaN NaN
4 2 HPI_lg6 HPI 15.0 65.0 1.0 1.0
5 3 CPI_lg12 CPI 15.0 90.0 1.0 2.0
6 3 CPI_lg1 CPI 20.0 95.0 2.0 1.0

How it working:

For mask2 are filter all Cluster values if match mask1 and filtered original column Cluster, then invert mask by ~. Last chain both conditions by | for bitwise OR for all rows without CPI if exist with another values per Cluster:

print (df.assign(mask1 = m, mask2 = ~df['Cluster'].isin(df.loc[m, 'Cluster']), both = m1))
Cluster Variable Group Ratio Value mask1 mask2 both
0 1 GDP_M3 GDP 20.0 70.0 False True True
1 1 HPI_M6 HPI 40.0 80.0 False True True
2 1 GDP_lg2 GDP 35.0 50.0 False True True
3 2 CPI_M9 CPI 10.0 50.0 True False True
4 2 HPI_lg6 HPI 15.0 65.0 False False False
5 3 CPI_lg12 CPI 15.0 90.0 True False True
6 3 CPI_lg1 CPI 20.0 95.0 True False True

EDIT:

df[['Ratio','Value']]=df[['Ratio','Value']].apply(lambda x: x.str.strip('%')).astype(float)

m = df['Group'].isin(['CPI','HPI'])
m2 = df.groupby('Cluster')['Group'].transform('nunique').ne(1)
m1 = (~df['Cluster'].isin(df.loc[~m, 'Cluster']) | m) & m2
df['RankRatio'] = df[~m1].groupby('Cluster')['Ratio'].rank(method='first', ascending=True)
df['RankValue'] = df[~m1].groupby('Cluster')['Value'].rank(method='first', ascending=False)
print (df)
Cluster Variable Group Ratio Value RankRatio RankValue
0 1 GDP_M3 GDP 20.0 70.0 1.0 1.0
1 1 HPI_M6 HPI 40.0 80.0 NaN NaN
2 1 GDP_lg2 GDP 35.0 50.0 2.0 2.0
3 2 CPI_M9 CPI 10.0 50.0 NaN NaN
4 2 HPI_lg6 HPI 15.0 65.0 NaN NaN
5 3 CPI_lg12 CPI 15.0 90.0 1.0 2.0
6 3 CPI_lg1 CPI 20.0 95.0 2.0 1.0


print (df.assign(mask1 = m, mask2 = ~df['Cluster'].isin(df.loc[~m, 'Cluster']), m2=m2, all = ~m1))
Cluster Variable Group Ratio Value RankRatio RankValue mask1 mask2 \
0 1 GDP_M3 GDP 20.0 70.0 1.0 1.0 False False
1 1 HPI_M6 HPI 40.0 80.0 NaN NaN True False
2 1 GDP_lg2 GDP 35.0 50.0 2.0 2.0 False False
3 2 CPI_M9 CPI 10.0 50.0 NaN NaN True True
4 2 HPI_lg6 HPI 15.0 65.0 NaN NaN True True
5 3 CPI_lg12 CPI 15.0 90.0 1.0 2.0 True True
6 3 CPI_lg1 CPI 20.0 95.0 2.0 1.0 True True

m2 all
0 True True
1 True False
2 True True
3 True False
4 True False
5 False True
6 False True

Pandas Ranking for String Columns

In your case do

df['new'] = df.groupby(['item_number','location_id'])['Date'].rank(ascending=False)
0 5.0
1 4.0
2 3.0
3 2.0
4 1.0
5 5.0
6 4.0
7 3.0
8 2.0
9 1.0
Name: Date, dtype: float64

How to rank rows in pandas with multiple condition?

USE GROUPBY:

df = df.sort_values(['S','L','C','CTR','IM'], ascending=[True,True,True,False,False])
df['Rank'] = df.groupby(['S','L','C']).cumcount() + 1

OR:

df = df.convert_dtypes()
def rank_count(x):
x = x.sort_values(['CTR','IM'], ascending=False).reset_index(drop=True)
x['Rank'] = x.index + 1
return x
result = df.groupby(['S','L','C'], sort=False).apply(rank_count).reset_index(drop=True)


Related Topics



Leave a reply



Submit