SQL-like window functions in PANDAS: Row Numbering in Python Pandas Dataframe
You can do this by using groupby
twice along with the rank
method:
In [11]: g = df.groupby('key1')
Use the min method argument to give values which share the same data1 the same RN:
In [12]: g['data1'].rank(method='min')
Out[12]:
0 1
1 2
2 2
3 1
4 4
dtype: float64
In [13]: df['RN'] = g['data1'].rank(method='min')
And then groupby these results and add the rank with respect to data2:
In [14]: g1 = df.groupby(['key1', 'RN'])
In [15]: g1['data2'].rank(ascending=False) - 1
Out[15]:
0 0
1 0
2 1
3 0
4 0
dtype: float64
In [16]: df['RN'] += g1['data2'].rank(ascending=False) - 1
In [17]: df
Out[17]:
data1 data2 key1 RN
0 1 1 a 1
1 2 10 a 2
2 2 2 a 3
3 3 3 b 1
4 3 30 a 4
It feels like there ought to be a native way to do this (there may well be!...).
Pandas Equivalent for SQL window function and rows range
Try groupby
with shift
then reindex
back
df['new'] = df.groupby(['customer','day']).purchase.sum().shift().reindex(pd.MultiIndex.from_frame(df[['customer','day']])).values
df
Out[259]:
customer day purchase new
0 Joe 1 5 NaN
1 Joe 1 10 NaN
2 Joe 2 10 15.0
3 Joe 2 5 15.0
4 Joe 4 10 15.0
Update
s = df.groupby(['customer','day']).apply(lambda x : df.loc[df.customer.isin(x['customer'].tolist()) & (df.day.isin(x['day']-1)|df.day.isin(x['day']-2)),'purchase'].sum())
df['new'] = s.reindex(pd.MultiIndex.from_frame(df[['customer','day']])).values
df
Out[271]:
customer day purchase new
0 Joe 1 5 0
1 Joe 1 10 0
2 Joe 2 5 15
3 Joe 2 5 15
4 Joe 4 10 10
5 Joe 7 5 0
Is there any row number alternative like SQL in python?
Thanks to @ScottBoston I looked further and think we can use nth()
instead of head()
to make use of sum(level=0)
. Another alternative would be to set_index()
before instead of the old solution where I used groupby twice. Anyway, in order of speed, quickest first:
dfout = (df.sort_values(by='amount', ascending=False)
.groupby('group')
.head(3)
.set_index('group')
.sum(level=0)
.reset_index())
or
dfout = (df.sort_values(by='amount', ascending=False)
.groupby('group')
.nth([0,1,2])
.sum(level=0)
.reset_index())
or
dfout = (df.groupby('group')
.apply(lambda x: x['amount'].sort_values(ascending=False).head(3).sum())
.rename('amount')
.reset_index())
or a two-step approach to get your temp dataframe as shown in the question:
mid = df.sort_values(by='amount', ascending=False).groupby('group').head(3).sort_index()
final = mid.set_index('group').sum(level=0)
Full example:
import pandas as pd
data = '''\
group,amount
x,12
x,345
x,3
y,1
y,45
z,14
x,4
x,52
y,54
z,23
z,235
z,21
y,57
y,3
z,87'''
fileobj = pd.compat.StringIO(data)
df = pd.read_csv(fileobj)
dfout = (df.sort_values(by='amount', ascending=False)
.groupby('group')
.nth([0,1,2])
.sum(level=0)
.reset_index())
print(dfout)
Returns:
group amount
0 x 409
1 y 156
2 z 345
Pandas equivalent to SQL window functions
For the first SQL:
SELECT state_name,
state_population,
SUM(state_population)
OVER() AS national_population
FROM population
ORDER BY state_name
Pandas:
df.assign(national_population=df.state_population.sum()).sort_values('state_name')
For the second SQL:
SELECT state_name,
state_population,
region,
SUM(state_population)
OVER(PARTITION BY region) AS regional_population
FROM population
ORDER BY state_name
Pandas:
df.assign(regional_population=df.groupby('region')['state_population'].transform('sum')) \
.sort_values('state_name')
DEMO:
In [238]: df
Out[238]:
region state_name state_population
0 1 aaa 100
1 1 bbb 110
2 2 ccc 200
3 2 ddd 100
4 2 eee 100
5 3 xxx 55
national_population:
In [246]: df.assign(national_population=df.state_population.sum()).sort_values('state_name')
Out[246]:
region state_name state_population national_population
0 1 aaa 100 665
1 1 bbb 110 665
2 2 ccc 200 665
3 2 ddd 100 665
4 2 eee 100 665
5 3 xxx 55 665
regional_population:
In [239]: df.assign(regional_population=df.groupby('region')['state_population'].transform('sum')) \
...: .sort_values('state_name')
Out[239]:
region state_name state_population regional_population
0 1 aaa 100 210
1 1 bbb 110 210
2 2 ccc 200 400
3 2 ddd 100 400
4 2 eee 100 400
5 3 xxx 55 55
SQL Server or Pandas Rank / Numbering a Window Function by Partition
It sounds like what you need is to order the DENSE_RANK
by the minimum LineItem
per Category, PartNumber
SELECT
Category,
LineItem,
PartNumber,
DENSE_RANK() OVER (PARTITION BY Category ORDER BY MinLineItem)
FROM (
SELECT *,
MinLineItem = MIN(LineItem) OVER (PARTITION BY Category, PartNumber)
FROM [TABLE]
) t
db<>fiddle
Rank/Row Number Window Function in Python
This will pick one row per ID_Number
with with sorting you defined.
df.sort_values(by=['Score_2', 'Score_1'], ascending=[False, True]).groupby(['ID_Number']).head(1)
Output:
Action ID_Number Score_1 Score_2
3 Invest 222037001 9 0.4600
0 Use 207821021 7 0.4525
How to get dense rank in each partition window in pandas
This is built-in with groupby
:
df['dense_rank'] = (df.groupby('Dominant_Topic')['appearance']
.rank(method='dense', ascending=False)
.astype(int)
)
Output:
Dominant_Topic word appearance dense_rank
0 Topic 0 aaaawww 50 3
1 Topic 0 aacn 100 2
2 Topic 0 aaren 20 4
3 Topic 0 aarongoodwin 200 1
4 Topic 1 aaronjfentress 10 3
5 Topic 1 aaronrodger 20 2
6 Topic 1 aasmiitkap 30 1
7 Topic 2 aavqbketmh 10 1
8 Topic 2 ab 10 1
9 Topic 2 abandon 1 2
Pandas DataFrame Window Function
You could return boolean
values where second_pass
equals the group
max
, as idxmax
only returns the first occurrence of the max
:
df['highest'] = df.groupby(['test', 'analysis'])['second_pass'].transform(lambda x: x == np.amax(x)).astype(bool)
and then use np.where
to capture all fruit
values that have a group
max
, and merge
the result into your DataFrame
like so:
highest_fruits = df.groupby(['test', 'analysis']).apply(lambda x: [f for f in np.where(x.second_pass == np.amax(x.second_pass), x.fruit.tolist(), '').tolist() if f!='']).reset_index()
df =df.merge(highest_fruits, on=['test', 'analysis'], how='left').rename(columns={0: 'highest_fruit'})
finally, for your follow up:
first_pass = df.groupby(['test', 'analysis']).apply(lambda x: {fruit: x.loc[x.fruit==fruit, 'first_pass'] for fruit in x.highest_fruit.iloc[0]}).reset_index()
df =df.merge(first_pass, on=['test', 'analysis'], how='left').rename(columns={0: 'first_pass_highest_fruit'})
to get:
analysis first_pass fruit order second_pass test units highest \
0 full 12.1 apple 2 20.1 1 g True
1 full 7.1 apple 1 12.0 2 g False
2 partial 14.3 apple 3 13.1 1 g False
3 full 19.1 orange 2 20.1 1 g True
4 full 17.1 orange 1 18.5 2 g True
5 partial 23.4 orange 3 22.7 1 g True
6 full 23.1 grape 3 14.1 1 g False
7 full 17.2 grape 2 17.1 2 g False
8 partial 19.1 grape 1 19.4 1 g False
highest_fruit first_pass_highest_fruit
0 [apple, orange] {'orange': [19.1], 'apple': [12.1]}
1 [orange] {'orange': [17.1]}
2 [orange] {'orange': [23.4]}
3 [apple, orange] {'orange': [19.1], 'apple': [12.1]}
4 [orange] {'orange': [17.1]}
5 [orange] {'orange': [23.4]}
6 [apple, orange] {'orange': [19.1], 'apple': [12.1]}
7 [orange] {'orange': [17.1]}
8 [orange] {'orange': [23.4]}
Related Topics
How to Look Ahead One Element (Peek) in a Python Generator
Types That Define '_Eq_' Are Unhashable
How to Bind the Enter Key to a Function in Tkinter
Why Is Using Thread Locals in Django Bad
Multiple Ping Script in Python
Re.Sub Replace with Matched Content
Remove a Tag Using Beautifulsoup But Keep Its Contents
Convert Timedelta to Total Seconds
Most Pythonic Way to Interleave Two Strings
No Module Named 'Pandas._Libs.Tslibs.Timedeltas' in Pyinstaller
Operation on Every Pair of Element in a List
How to Get 'Real-Time' Information Back from a Subprocess.Popen in Python (2.5)
How to Set the Default Color Cycle for All Subplots with Matplotlib
How to Force a List to a Fixed Size