Sql-Like Window Functions in Pandas: Row Numbering in Python Pandas Dataframe

SQL-like window functions in PANDAS: Row Numbering in Python Pandas Dataframe

You can do this by using groupby twice along with the rank method:

In [11]: g = df.groupby('key1')

Use the min method argument to give values which share the same data1 the same RN:

In [12]: g['data1'].rank(method='min')
Out[12]:
0    1
1    2
2    2
3    1
4    4
dtype: float64

In [13]: df['RN'] = g['data1'].rank(method='min')

And then groupby these results and add the rank with respect to data2:

In [14]: g1 = df.groupby(['key1', 'RN'])

In [15]: g1['data2'].rank(ascending=False) - 1
Out[15]:
0    0
1    0
2    1
3    0
4    0
dtype: float64

In [16]: df['RN'] += g1['data2'].rank(ascending=False) - 1

In [17]: df
Out[17]:
   data1  data2 key1  RN
0      1      1    a   1
1      2     10    a   2
2      2      2    a   3
3      3      3    b   1
4      3     30    a   4

It feels like there ought to be a native way to do this (there may well be!...).

Pandas Equivalent for SQL window function and rows range

Try groupby with shift then reindex back

df['new'] = df.groupby(['customer','day']).purchase.sum().shift().reindex(pd.MultiIndex.from_frame(df[['customer','day']])).values
df
Out[259]: 
  customer  day  purchase   new
0      Joe    1         5   NaN
1      Joe    1        10   NaN
2      Joe    2        10  15.0
3      Joe    2         5  15.0
4      Joe    4        10  15.0

Update

s = df.groupby(['customer','day']).apply(lambda x : df.loc[df.customer.isin(x['customer'].tolist()) & (df.day.isin(x['day']-1)|df.day.isin(x['day']-2)),'purchase'].sum())
df['new'] = s.reindex(pd.MultiIndex.from_frame(df[['customer','day']])).values
df
Out[271]: 
  customer  day  purchase  new
0      Joe    1         5    0
1      Joe    1        10    0
2      Joe    2         5   15
3      Joe    2         5   15
4      Joe    4        10   10
5      Joe    7         5    0

Is there any row number alternative like SQL in python?

Thanks to @ScottBoston I looked further and think we can use nth() instead of head() to make use of sum(level=0). Another alternative would be to set_index() before instead of the old solution where I used groupby twice. Anyway, in order of speed, quickest first:

dfout = (df.sort_values(by='amount', ascending=False)
         .groupby('group')
         .head(3)
         .set_index('group')
         .sum(level=0)
         .reset_index())

dfout = (df.sort_values(by='amount', ascending=False)
         .groupby('group')
         .nth([0,1,2])
         .sum(level=0)
         .reset_index())

dfout = (df.groupby('group')
         .apply(lambda x: x['amount'].sort_values(ascending=False).head(3).sum())
         .rename('amount')
         .reset_index())

or a two-step approach to get your temp dataframe as shown in the question:

mid = df.sort_values(by='amount', ascending=False).groupby('group').head(3).sort_index()
final = mid.set_index('group').sum(level=0)

Full example:

import pandas as pd

data = '''\
group,amount
x,12
x,345
x,3
y,1
y,45
z,14
x,4
x,52
y,54
z,23
z,235
z,21
y,57
y,3
z,87'''

fileobj = pd.compat.StringIO(data)
df = pd.read_csv(fileobj)

dfout = (df.sort_values(by='amount', ascending=False)
         .groupby('group')
         .nth([0,1,2])
         .sum(level=0)
         .reset_index())

print(dfout)

Returns:

  group  amount
0     x     409
1     y     156
2     z     345

Pandas equivalent to SQL window functions

For the first SQL:

SELECT state_name,  
       state_population,
       SUM(state_population)
        OVER() AS national_population
FROM population   
ORDER BY state_name

Pandas:

df.assign(national_population=df.state_population.sum()).sort_values('state_name')

For the second SQL:

SELECT state_name,  
       state_population,
       region,
       SUM(state_population)
        OVER(PARTITION BY region) AS regional_population
FROM population    
ORDER BY state_name

Pandas:

df.assign(regional_population=df.groupby('region')['state_population'].transform('sum')) \
  .sort_values('state_name')

DEMO:

In [238]: df
Out[238]:
   region state_name  state_population
0       1        aaa               100
1       1        bbb               110
2       2        ccc               200
3       2        ddd               100
4       2        eee               100
5       3        xxx                55

national_population:

In [246]: df.assign(national_population=df.state_population.sum()).sort_values('state_name')
Out[246]:
   region state_name  state_population  national_population
0       1        aaa               100                  665
1       1        bbb               110                  665
2       2        ccc               200                  665
3       2        ddd               100                  665
4       2        eee               100                  665
5       3        xxx                55                  665

regional_population:

In [239]: df.assign(regional_population=df.groupby('region')['state_population'].transform('sum')) \
     ...:   .sort_values('state_name')
Out[239]:
   region state_name  state_population  regional_population
0       1        aaa               100                  210
1       1        bbb               110                  210
2       2        ccc               200                  400
3       2        ddd               100                  400
4       2        eee               100                  400
5       3        xxx                55                   55

SQL Server or Pandas Rank / Numbering a Window Function by Partition

It sounds like what you need is to order the DENSE_RANK by the minimum LineItem per Category, PartNumber

SELECT 
  Category,
  LineItem,
  PartNumber,
  DENSE_RANK() OVER (PARTITION BY Category ORDER BY MinLineItem)
FROM (
    SELECT *,
      MinLineItem = MIN(LineItem) OVER (PARTITION BY Category, PartNumber)
    FROM [TABLE]
) t

db<>fiddle

Rank/Row Number Window Function in Python

This will pick one row per ID_Number with with sorting you defined.

df.sort_values(by=['Score_2', 'Score_1'], ascending=[False, True]).groupby(['ID_Number']).head(1)

Output:

    Action  ID_Number   Score_1     Score_2
3   Invest  222037001   9   0.4600
0   Use     207821021   7   0.4525

How to get dense rank in each partition window in pandas

This is built-in with groupby:

df['dense_rank'] = (df.groupby('Dominant_Topic')['appearance']
                      .rank(method='dense', ascending=False)
                      .astype(int)
                   )

Output:

  Dominant_Topic            word  appearance  dense_rank
0        Topic 0         aaaawww          50           3
1        Topic 0            aacn         100           2
2        Topic 0           aaren          20           4
3        Topic 0    aarongoodwin         200           1
4        Topic 1  aaronjfentress          10           3
5        Topic 1     aaronrodger          20           2
6        Topic 1      aasmiitkap          30           1
7        Topic 2      aavqbketmh          10           1
8        Topic 2              ab          10           1
9        Topic 2         abandon           1           2

Pandas DataFrame Window Function

You could return boolean values where second_pass equals the group max, as idxmax only returns the first occurrence of the max:

df['highest'] = df.groupby(['test', 'analysis'])['second_pass'].transform(lambda x: x == np.amax(x)).astype(bool)

and then use np.where to capture all fruit values that have a group max, and merge the result into your DataFrame like so:

highest_fruits = df.groupby(['test', 'analysis']).apply(lambda x: [f for f in np.where(x.second_pass == np.amax(x.second_pass), x.fruit.tolist(), '').tolist() if f!='']).reset_index()
df =df.merge(highest_fruits, on=['test', 'analysis'], how='left').rename(columns={0: 'highest_fruit'})

finally, for your follow up:

first_pass = df.groupby(['test', 'analysis']).apply(lambda x: {fruit: x.loc[x.fruit==fruit, 'first_pass'] for fruit in x.highest_fruit.iloc[0]}).reset_index()
df =df.merge(first_pass, on=['test', 'analysis'], how='left').rename(columns={0: 'first_pass_highest_fruit'})

to get:

  analysis  first_pass   fruit  order  second_pass  test units highest  \
0     full        12.1   apple      2         20.1     1     g    True   
1     full         7.1   apple      1         12.0     2     g   False   
2  partial        14.3   apple      3         13.1     1     g   False   
3     full        19.1  orange      2         20.1     1     g    True   
4     full        17.1  orange      1         18.5     2     g    True   
5  partial        23.4  orange      3         22.7     1     g    True   
6     full        23.1   grape      3         14.1     1     g   False   
7     full        17.2   grape      2         17.1     2     g   False   
8  partial        19.1   grape      1         19.4     1     g   False   

     highest_fruit             first_pass_highest_fruit  
0  [apple, orange]  {'orange': [19.1], 'apple': [12.1]}  
1         [orange]                   {'orange': [17.1]}  
2         [orange]                   {'orange': [23.4]}  
3  [apple, orange]  {'orange': [19.1], 'apple': [12.1]}  
4         [orange]                   {'orange': [17.1]}  
5         [orange]                   {'orange': [23.4]}  
6  [apple, orange]  {'orange': [19.1], 'apple': [12.1]}  
7         [orange]                   {'orange': [17.1]}  
8         [orange]                   {'orange': [23.4]}

Sql-Like Window Functions in Pandas: Row Numbering in Python Pandas Dataframe