Pandas Number Rows Within Group in Increasing Order

Pandas number rows within group in increasing order

Use groupby/cumcount:

In [25]: df['C'] = df.groupby(['A','B']).cumcount()+1; df
Out[25]:
A B C
0 A a 1
1 A a 2
2 A b 1
3 B a 1
4 B a 2
5 B a 3

pandas add row numbers after groupby

Let us do it within two steps, I list both total count and cum count

out = df.sort_values(['A', 'B', 'Date'],
ascending=[True, True, False])
out['row number'] = out.groupby(['A','B']).cumcount()
out['count number'] = out.groupby(['A','B'])['Date'].transform('count')

Pandas: number rows within group cumulatively and across another group

This is a tricky problem. You want to calculate the cumcount within group, but for all subsequent groups you need to keep track of how much that was already incremented so you know the offset to apply. That can be done with a max + cumsum of this cumcount over the previous groups. Here the only complication is that you need to determine the relationship between previous and subsequent group labels, in case there isn't some simple + 1 increment between labels of susbequent groups.

# Cumcount within group
s = df.groupby(['col_1', 'col_2']).cumcount()

# Determine how many cumcounts were within all previous groups of `col_1'
to_merge = s.add(1).groupby(df['col_1']).max().cumsum().add(1).to_frame('new')

# Link group with prior group label
df1 = df[['col_1']].drop_duplicates()
df1['col_1_shift'] = df1['col_1'].shift(-1)
df1 = pd.concat([to_merge, df1.set_index('col_1')], axis=1)

# Bring the group offset over
df = df.merge(df1, left_on='col_1', right_on='col_1_shift', how='left')

# Add the group offset to the cumulative count within group.
# First group (no previous group) is NaN so fill with 1.
df['new'] = df['new'].fillna(1, downcast='infer') + s

# Clean up merging column
df = df.drop(columns='col_1_shift')


    col_1 col_2  col_3  new
0 1 A 1 1
1 1 B 1 1
2 2 A 3 2
3 2 A 3 3
4 2 A 3 4
5 2 B 3 2
6 2 B 3 3
7 2 B 3 4
8 3 A 2 5
9 3 A 2 6
10 3 C 2 5
11 3 C 2 6

Generate a row number for each entry in a pandas grouped dataframe when all rows in each group are the same

You need cumcount:

pet['row_number'] = pet.groupby(['country', 'state', 'city']).cumcount()
print (pet)
country state city counter row_number
0 US CA Los Angles 10 0
1 US CA Los Angles 10 1
2 US CA Los Angles 10 2
3 US CA Los Angles 10 3
4 US CA Los Angles 10 4
5 US CA Los Angles 10 5
6 US CA Los Angles 10 6
7 US CA Los Angles 10 7
8 US CA Los Angles 10 8
9 US CA Los Angles 10 9
10 US IL Springfield 20 0
11 US IL Springfield 20 1
12 US IL Springfield 20 2
13 US IL Springfield 20 3
14 US IL Springfield 20 4
15 US IL Springfield 20 5
16 US IL Springfield 20 6
17 US IL Springfield 20 7
18 US IL Springfield 20 8
19 US IL Springfield 20 9
20 US IL Springfield 20 10
21 US IL Springfield 20 11
22 US IL Springfield 20 12
23 US IL Springfield 20 13
24 US IL Springfield 20 14
25 US IL Springfield 20 15
26 US IL Springfield 20 16
27 US IL Springfield 20 17
28 US IL Springfield 20 18
29 US IL Springfield 20 19

Pandas get topmost n records within each group

Did you try

df.groupby('id').head(2)

Output generated:

       id  value
id
1 0 1 1
1 1 2
2 3 2 1
4 2 2
3 7 3 1
4 8 4 1

(Keep in mind that you might need to order/sort before, depending on your data)

EDIT: As mentioned by the questioner, use

df.groupby('id').head(2).reset_index(drop=True)

to remove the MultiIndex and flatten the results:

    id  value
0 1 1
1 1 2
2 2 1
3 2 2
4 3 1
5 4 1

pandas groupby, then sort within groups

What you want to do is actually again a groupby (on the result of the first groupby): sort and take the first three elements per group.

Starting from the result of the first groupby:

In [60]: df_agg = df.groupby(['job','source']).agg({'count':sum})

We group by the first level of the index:

In [63]: g = df_agg['count'].groupby('job', group_keys=False)

Then we want to sort ('order') each group and take the first three elements:

In [64]: res = g.apply(lambda x: x.sort_values(ascending=False).head(3))

However, for this, there is a shortcut function to do this, nlargest:

In [65]: g.nlargest(3)
Out[65]:
job source
market A 5
D 4
B 3
sales E 7
C 6
B 4
dtype: int64

So in one go, this looks like:

df_agg['count'].groupby('job', group_keys=False).nlargest(3)

Pandas enumerate groups in descending order

Use GroupBy.ngroup with ascending=False:

df.groupby('column', sort=False).ngroup(ascending=False)+1

0 3
1 3
2 2
3 2
4 1
5 1
dtype: int64

For DataFrame that looks like this,

df = pd.DataFrame({'column': [10, 10, 8, 8, 10, 10]})

. . .where only consecutive values are to be grouped, you'll need to modify your grouper:

(df.groupby(df['column'].ne(df['column'].shift()).cumsum(), sort=False)
.ngroup(ascending=False)
.add(1))

0 3
1 3
2 2
3 2
4 1
5 1
dtype: int64

How to sort rows within a group (in descending order) using pandas

There's no need to use groupby here, a simple sort_values on the two columns will suffice:

dummy.sort_values(['col1', 'col3'], ascending=[True, False])

col1 col2 col3
2 1 c 3
1 1 b 2
0 1 a 1
4 2 e 3
3 2 d 2
5 2 f 1
6 3 g 3
8 3 i 2
7 3 h 1

The order for "col2" is correct, you just need to return it as a list now:

col2_list = (dummy.sort_values(['col1', 'col3'], ascending=[True, False])
.get('col2')
.tolist())

col2_list
# ['c', 'b', 'a', 'e', 'd', 'f', 'g', 'i', 'h']

In response to a request in the comments:

now I want to combine these col2 values with col1 values, can I
directly fetch col1 from dummy df and sorted col2 to create a new
dataframe?

The output should look like (eg): 1 [c,b,a] 2 [e,d,f] ...

Here we can build on the previous solution with Groupby.agg to listify the data:

(dummy.sort_values(['col1', 'col3'], ascending=[True, False])
.groupby('col1', sort=False)['col2']
.agg(list)
.reset_index())

col1 col2
0 1 [c, b, a]
1 2 [e, d, f]
2 3 [g, i, h]

Sorting columns and selecting top n rows in each group pandas dataframe

There are 2 solutions:

1.sort_values and aggregate head:

df1 = df.sort_values('score',ascending = False).groupby('pidx').head(2)
print (df1)

mainid pidx pidy score
8 2 x w 12
4 1 a e 8
2 1 c a 7
10 2 y x 6
1 1 a c 5
7 2 z y 5
6 2 y z 3
3 1 c b 2
5 2 x y 1

2.set_index and aggregate nlargest:

df = df.set_index(['mainid','pidy']).groupby('pidx')['score'].nlargest(2).reset_index() 
print (df)
pidx mainid pidy score
0 a 1 e 8
1 a 1 c 5
2 c 1 a 7
3 c 1 b 2
4 x 2 w 12
5 x 2 y 1
6 y 2 x 6
7 y 2 z 3
8 z 2 y 5

Timings:

np.random.seed(123)
N = 1000000

L1 = list('abcdefghijklmnopqrstu')
L2 = list('efghijklmnopqrstuvwxyz')
df = pd.DataFrame({'mainid':np.random.randint(1000, size=N),
'pidx': np.random.randint(10000, size=N),
'pidy': np.random.choice(L2, N),
'score':np.random.randint(1000, size=N)})
#print (df)

def epat(df):
grouped = df.groupby('pidx')
new_df = pd.DataFrame([], columns = df.columns)
for key, values in grouped:
new_df = pd.concat([new_df, grouped.get_group(key).sort_values('score', ascending=True)[:2]], 0)
return (new_df)

print (epat(df))

In [133]: %timeit (df.sort_values('score',ascending = False).groupby('pidx').head(2))
1 loop, best of 3: 309 ms per loop

In [134]: %timeit (df.set_index(['mainid','pidy']).groupby('pidx')['score'].nlargest(2).reset_index())
1 loop, best of 3: 7.11 s per loop

In [147]: %timeit (epat(df))
1 loop, best of 3: 22 s per loop

How to create order in pandas dataframe groups?

groupby user_id and get the rank using received_at

df['count_n'] = df.groupby('user_id').received_at.apply(pd.Series.rank)

This doesn't require a sorting step & will assign the correct rank even if the data frame is not sorted by received_at within each group

if the column user_id is set as an index (as your sample data seems to indicate), you could alternative use the following instead. Although, in recent versions of pandas, grouping by named indexes also works (i.e. the above might work)

df.groupby(level=0).received_at.apply(pd.Series.rank)


Related Topics



Leave a reply



Submit