Pandas Number Rows Within Group in Increasing Order

Pandas number rows within group in increasing order

Use groupby/cumcount:

In [25]: df['C'] = df.groupby(['A','B']).cumcount()+1; df
Out[25]: 
   A  B  C
0  A  a  1
1  A  a  2
2  A  b  1
3  B  a  1
4  B  a  2
5  B  a  3

pandas add row numbers after groupby

Let us do it within two steps, I list both total count and cum count

out = df.sort_values(['A', 'B', 'Date'],
                  ascending=[True, True, False])
out['row number'] = out.groupby(['A','B']).cumcount()
out['count number'] = out.groupby(['A','B'])['Date'].transform('count')

Pandas: number rows within group cumulatively and across another group

This is a tricky problem. You want to calculate the cumcount within group, but for all subsequent groups you need to keep track of how much that was already incremented so you know the offset to apply. That can be done with a max + cumsum of this cumcount over the previous groups. Here the only complication is that you need to determine the relationship between previous and subsequent group labels, in case there isn't some simple + 1 increment between labels of susbequent groups.

# Cumcount within group
s = df.groupby(['col_1', 'col_2']).cumcount()

# Determine how many cumcounts were within all previous groups of `col_1' 
to_merge = s.add(1).groupby(df['col_1']).max().cumsum().add(1).to_frame('new')

# Link group with prior group label
df1 = df[['col_1']].drop_duplicates()
df1['col_1_shift'] = df1['col_1'].shift(-1)
df1 = pd.concat([to_merge, df1.set_index('col_1')], axis=1)

# Bring the group offset over
df = df.merge(df1, left_on='col_1', right_on='col_1_shift', how='left')

# Add the group offset to the cumulative count within group.
# First group (no previous group) is NaN so fill with 1.
df['new'] = df['new'].fillna(1, downcast='infer') + s

# Clean up merging column
df = df.drop(columns='col_1_shift')

    col_1 col_2  col_3  new
0       1     A      1    1
1       1     B      1    1
2       2     A      3    2
3       2     A      3    3
4       2     A      3    4
5       2     B      3    2
6       2     B      3    3
7       2     B      3    4
8       3     A      2    5
9       3     A      2    6
10      3     C      2    5
11      3     C      2    6

Generate a row number for each entry in a pandas grouped dataframe when all rows in each group are the same

You need cumcount:

pet['row_number'] = pet.groupby(['country', 'state', 'city']).cumcount()
print (pet) 
   country state         city counter  row_number
0       US    CA   Los Angles      10           0
1       US    CA   Los Angles      10           1
2       US    CA   Los Angles      10           2
3       US    CA   Los Angles      10           3
4       US    CA   Los Angles      10           4
5       US    CA   Los Angles      10           5
6       US    CA   Los Angles      10           6
7       US    CA   Los Angles      10           7
8       US    CA   Los Angles      10           8
9       US    CA   Los Angles      10           9
10      US    IL  Springfield      20           0
11      US    IL  Springfield      20           1
12      US    IL  Springfield      20           2
13      US    IL  Springfield      20           3
14      US    IL  Springfield      20           4
15      US    IL  Springfield      20           5
16      US    IL  Springfield      20           6
17      US    IL  Springfield      20           7
18      US    IL  Springfield      20           8
19      US    IL  Springfield      20           9
20      US    IL  Springfield      20          10
21      US    IL  Springfield      20          11
22      US    IL  Springfield      20          12
23      US    IL  Springfield      20          13
24      US    IL  Springfield      20          14
25      US    IL  Springfield      20          15
26      US    IL  Springfield      20          16
27      US    IL  Springfield      20          17
28      US    IL  Springfield      20          18
29      US    IL  Springfield      20          19

Pandas get topmost n records within each group

Did you try

df.groupby('id').head(2)

Output generated:

       id  value
id             
1  0   1      1
   1   1      2 
2  3   2      1
   4   2      2
3  7   3      1
4  8   4      1

(Keep in mind that you might need to order/sort before, depending on your data)

EDIT: As mentioned by the questioner, use

df.groupby('id').head(2).reset_index(drop=True)

to remove the MultiIndex and flatten the results:

    id  value
0   1      1
1   1      2
2   2      1
3   2      2
4   3      1
5   4      1

pandas groupby, then sort within groups

What you want to do is actually again a groupby (on the result of the first groupby): sort and take the first three elements per group.

Starting from the result of the first groupby:

In [60]: df_agg = df.groupby(['job','source']).agg({'count':sum})

We group by the first level of the index:

In [63]: g = df_agg['count'].groupby('job', group_keys=False)

Then we want to sort ('order') each group and take the first three elements:

In [64]: res = g.apply(lambda x: x.sort_values(ascending=False).head(3))

However, for this, there is a shortcut function to do this, nlargest:

In [65]: g.nlargest(3)
Out[65]:
job     source
market  A         5
        D         4
        B         3
sales   E         7
        C         6
        B         4
dtype: int64

So in one go, this looks like:

df_agg['count'].groupby('job', group_keys=False).nlargest(3)

Pandas enumerate groups in descending order

Use GroupBy.ngroup with ascending=False:

df.groupby('column', sort=False).ngroup(ascending=False)+1

0    3
1    3
2    2
3    2
4    1
5    1
dtype: int64

For DataFrame that looks like this,

df = pd.DataFrame({'column': [10, 10, 8, 8, 10, 10]})

. . .where only consecutive values are to be grouped, you'll need to modify your grouper:

(df.groupby(df['column'].ne(df['column'].shift()).cumsum(), sort=False)
   .ngroup(ascending=False)
   .add(1))

0    3
1    3
2    2
3    2
4    1
5    1
dtype: int64

How to sort rows within a group (in descending order) using pandas

There's no need to use groupby here, a simple sort_values on the two columns will suffice:

dummy.sort_values(['col1', 'col3'], ascending=[True, False])

   col1 col2  col3
2     1    c     3
1     1    b     2
0     1    a     1
4     2    e     3
3     2    d     2
5     2    f     1
6     3    g     3
8     3    i     2
7     3    h     1

The order for "col2" is correct, you just need to return it as a list now:

col2_list = (dummy.sort_values(['col1', 'col3'], ascending=[True, False])
                  .get('col2')
                  .tolist())

col2_list
# ['c', 'b', 'a', 'e', 'd', 'f', 'g', 'i', 'h']

In response to a request in the comments:

now I want to combine these col2 values with col1 values, can I
directly fetch col1 from dummy df and sorted col2 to create a new
dataframe?
The output should look like (eg): 1 [c,b,a] 2 [e,d,f] ...

Here we can build on the previous solution with Groupby.agg to listify the data:

(dummy.sort_values(['col1', 'col3'], ascending=[True, False])
      .groupby('col1', sort=False)['col2']
      .agg(list)
      .reset_index())

   col1       col2
0     1  [c, b, a]
1     2  [e, d, f]
2     3  [g, i, h]

Sorting columns and selecting top n rows in each group pandas dataframe

There are 2 solutions:

1.sort_values and aggregate head:

df1 = df.sort_values('score',ascending = False).groupby('pidx').head(2)
print (df1)

    mainid pidx pidy  score
8        2    x    w     12
4        1    a    e      8
2        1    c    a      7
10       2    y    x      6
1        1    a    c      5
7        2    z    y      5
6        2    y    z      3
3        1    c    b      2
5        2    x    y      1

2.set_index and aggregate nlargest:

df = df.set_index(['mainid','pidy']).groupby('pidx')['score'].nlargest(2).reset_index() 
print (df)
  pidx  mainid pidy  score
0    a       1    e      8
1    a       1    c      5
2    c       1    a      7
3    c       1    b      2
4    x       2    w     12
5    x       2    y      1
6    y       2    x      6
7    y       2    z      3
8    z       2    y      5

Timings:

np.random.seed(123)
N = 1000000

L1 = list('abcdefghijklmnopqrstu')
L2 = list('efghijklmnopqrstuvwxyz')
df = pd.DataFrame({'mainid':np.random.randint(1000, size=N),
                   'pidx': np.random.randint(10000, size=N),
                   'pidy': np.random.choice(L2, N),
                   'score':np.random.randint(1000, size=N)})
#print (df)

def epat(df):
    grouped = df.groupby('pidx')
    new_df = pd.DataFrame([], columns = df.columns)
    for key, values in grouped:
        new_df = pd.concat([new_df, grouped.get_group(key).sort_values('score', ascending=True)[:2]], 0)
    return (new_df)

print (epat(df))

In [133]: %timeit (df.sort_values('score',ascending = False).groupby('pidx').head(2))
1 loop, best of 3: 309 ms per loop

In [134]: %timeit (df.set_index(['mainid','pidy']).groupby('pidx')['score'].nlargest(2).reset_index())
1 loop, best of 3: 7.11 s per loop

In [147]: %timeit (epat(df))
1 loop, best of 3: 22 s per loop

How to create order in pandas dataframe groups?

groupby user_id and get the rank using received_at

df['count_n'] = df.groupby('user_id').received_at.apply(pd.Series.rank)

This doesn't require a sorting step & will assign the correct rank even if the data frame is not sorted by received_at within each group

if the column user_id is set as an index (as your sample data seems to indicate), you could alternative use the following instead. Although, in recent versions of pandas, grouping by named indexes also works (i.e. the above might work)

df.groupby(level=0).received_at.apply(pd.Series.rank)

Pandas Number Rows Within Group in Increasing Order