Sql-Like Window Functions in Pandas: Row Numbering in Python Pandas Dataframe

SQL-like window functions in PANDAS: Row Numbering in Python Pandas Dataframe

You can do this by using groupby twice along with the rank method:

In [11]: g = df.groupby('key1')

Use the min method argument to give values which share the same data1 the same RN:

In [12]: g['data1'].rank(method='min')
Out[12]:
0 1
1 2
2 2
3 1
4 4
dtype: float64

In [13]: df['RN'] = g['data1'].rank(method='min')

And then groupby these results and add the rank with respect to data2:

In [14]: g1 = df.groupby(['key1', 'RN'])

In [15]: g1['data2'].rank(ascending=False) - 1
Out[15]:
0 0
1 0
2 1
3 0
4 0
dtype: float64

In [16]: df['RN'] += g1['data2'].rank(ascending=False) - 1

In [17]: df
Out[17]:
data1 data2 key1 RN
0 1 1 a 1
1 2 10 a 2
2 2 2 a 3
3 3 3 b 1
4 3 30 a 4

It feels like there ought to be a native way to do this (there may well be!...).

Pandas Equivalent for SQL window function and rows range

Try groupby with shift then reindex back

df['new'] = df.groupby(['customer','day']).purchase.sum().shift().reindex(pd.MultiIndex.from_frame(df[['customer','day']])).values
df
Out[259]:
customer day purchase new
0 Joe 1 5 NaN
1 Joe 1 10 NaN
2 Joe 2 10 15.0
3 Joe 2 5 15.0
4 Joe 4 10 15.0

Update

s = df.groupby(['customer','day']).apply(lambda x : df.loc[df.customer.isin(x['customer'].tolist()) & (df.day.isin(x['day']-1)|df.day.isin(x['day']-2)),'purchase'].sum())
df['new'] = s.reindex(pd.MultiIndex.from_frame(df[['customer','day']])).values
df
Out[271]:
customer day purchase new
0 Joe 1 5 0
1 Joe 1 10 0
2 Joe 2 5 15
3 Joe 2 5 15
4 Joe 4 10 10
5 Joe 7 5 0

Is there any row number alternative like SQL in python?

Thanks to @ScottBoston I looked further and think we can use nth() instead of head() to make use of sum(level=0). Another alternative would be to set_index() before instead of the old solution where I used groupby twice. Anyway, in order of speed, quickest first:

dfout = (df.sort_values(by='amount', ascending=False)
.groupby('group')
.head(3)
.set_index('group')
.sum(level=0)
.reset_index())

or

dfout = (df.sort_values(by='amount', ascending=False)
.groupby('group')
.nth([0,1,2])
.sum(level=0)
.reset_index())

or

dfout = (df.groupby('group')
.apply(lambda x: x['amount'].sort_values(ascending=False).head(3).sum())
.rename('amount')
.reset_index())

or a two-step approach to get your temp dataframe as shown in the question:

mid = df.sort_values(by='amount', ascending=False).groupby('group').head(3).sort_index()
final = mid.set_index('group').sum(level=0)

Full example:

import pandas as pd

data = '''\
group,amount
x,12
x,345
x,3
y,1
y,45
z,14
x,4
x,52
y,54
z,23
z,235
z,21
y,57
y,3
z,87'''

fileobj = pd.compat.StringIO(data)
df = pd.read_csv(fileobj)

dfout = (df.sort_values(by='amount', ascending=False)
.groupby('group')
.nth([0,1,2])
.sum(level=0)
.reset_index())

print(dfout)

Returns:

  group  amount
0 x 409
1 y 156
2 z 345

Pandas equivalent to SQL window functions

For the first SQL:

SELECT state_name,  
state_population,
SUM(state_population)
OVER() AS national_population
FROM population
ORDER BY state_name

Pandas:

df.assign(national_population=df.state_population.sum()).sort_values('state_name')

For the second SQL:

SELECT state_name,  
state_population,
region,
SUM(state_population)
OVER(PARTITION BY region) AS regional_population
FROM population
ORDER BY state_name

Pandas:

df.assign(regional_population=df.groupby('region')['state_population'].transform('sum')) \
.sort_values('state_name')

DEMO:

In [238]: df
Out[238]:
region state_name state_population
0 1 aaa 100
1 1 bbb 110
2 2 ccc 200
3 2 ddd 100
4 2 eee 100
5 3 xxx 55

national_population:

In [246]: df.assign(national_population=df.state_population.sum()).sort_values('state_name')
Out[246]:
region state_name state_population national_population
0 1 aaa 100 665
1 1 bbb 110 665
2 2 ccc 200 665
3 2 ddd 100 665
4 2 eee 100 665
5 3 xxx 55 665

regional_population:

In [239]: df.assign(regional_population=df.groupby('region')['state_population'].transform('sum')) \
...: .sort_values('state_name')
Out[239]:
region state_name state_population regional_population
0 1 aaa 100 210
1 1 bbb 110 210
2 2 ccc 200 400
3 2 ddd 100 400
4 2 eee 100 400
5 3 xxx 55 55

SQL Server or Pandas Rank / Numbering a Window Function by Partition

It sounds like what you need is to order the DENSE_RANK by the minimum LineItem per Category, PartNumber

SELECT 
Category,
LineItem,
PartNumber,
DENSE_RANK() OVER (PARTITION BY Category ORDER BY MinLineItem)
FROM (
SELECT *,
MinLineItem = MIN(LineItem) OVER (PARTITION BY Category, PartNumber)
FROM [TABLE]
) t

db<>fiddle

Rank/Row Number Window Function in Python

This will pick one row per ID_Number with with sorting you defined.

df.sort_values(by=['Score_2', 'Score_1'], ascending=[False, True]).groupby(['ID_Number']).head(1)

Output:

    Action  ID_Number   Score_1     Score_2
3 Invest 222037001 9 0.4600
0 Use 207821021 7 0.4525

How to get dense rank in each partition window in pandas

This is built-in with groupby:

df['dense_rank'] = (df.groupby('Dominant_Topic')['appearance']
.rank(method='dense', ascending=False)
.astype(int)
)

Output:

  Dominant_Topic            word  appearance  dense_rank
0 Topic 0 aaaawww 50 3
1 Topic 0 aacn 100 2
2 Topic 0 aaren 20 4
3 Topic 0 aarongoodwin 200 1
4 Topic 1 aaronjfentress 10 3
5 Topic 1 aaronrodger 20 2
6 Topic 1 aasmiitkap 30 1
7 Topic 2 aavqbketmh 10 1
8 Topic 2 ab 10 1
9 Topic 2 abandon 1 2

Pandas DataFrame Window Function

You could return boolean values where second_pass equals the group max, as idxmax only returns the first occurrence of the max:

df['highest'] = df.groupby(['test', 'analysis'])['second_pass'].transform(lambda x: x == np.amax(x)).astype(bool)

and then use np.where to capture all fruit values that have a group max, and merge the result into your DataFrame like so:

highest_fruits = df.groupby(['test', 'analysis']).apply(lambda x: [f for f in np.where(x.second_pass == np.amax(x.second_pass), x.fruit.tolist(), '').tolist() if f!='']).reset_index()
df =df.merge(highest_fruits, on=['test', 'analysis'], how='left').rename(columns={0: 'highest_fruit'})

finally, for your follow up:

first_pass = df.groupby(['test', 'analysis']).apply(lambda x: {fruit: x.loc[x.fruit==fruit, 'first_pass'] for fruit in x.highest_fruit.iloc[0]}).reset_index()
df =df.merge(first_pass, on=['test', 'analysis'], how='left').rename(columns={0: 'first_pass_highest_fruit'})

to get:

  analysis  first_pass   fruit  order  second_pass  test units highest  \
0 full 12.1 apple 2 20.1 1 g True
1 full 7.1 apple 1 12.0 2 g False
2 partial 14.3 apple 3 13.1 1 g False
3 full 19.1 orange 2 20.1 1 g True
4 full 17.1 orange 1 18.5 2 g True
5 partial 23.4 orange 3 22.7 1 g True
6 full 23.1 grape 3 14.1 1 g False
7 full 17.2 grape 2 17.1 2 g False
8 partial 19.1 grape 1 19.4 1 g False

highest_fruit first_pass_highest_fruit
0 [apple, orange] {'orange': [19.1], 'apple': [12.1]}
1 [orange] {'orange': [17.1]}
2 [orange] {'orange': [23.4]}
3 [apple, orange] {'orange': [19.1], 'apple': [12.1]}
4 [orange] {'orange': [17.1]}
5 [orange] {'orange': [23.4]}
6 [apple, orange] {'orange': [19.1], 'apple': [12.1]}
7 [orange] {'orange': [17.1]}
8 [orange] {'orange': [23.4]}


Related Topics



Leave a reply



Submit