Pandas number rows within group in increasing order
Use groupby/cumcount
:
In [25]: df['C'] = df.groupby(['A','B']).cumcount()+1; df
Out[25]:
A B C
0 A a 1
1 A a 2
2 A b 1
3 B a 1
4 B a 2
5 B a 3
pandas add row numbers after groupby
Let us do it within two steps, I list both total count and cum count
out = df.sort_values(['A', 'B', 'Date'],
ascending=[True, True, False])
out['row number'] = out.groupby(['A','B']).cumcount()
out['count number'] = out.groupby(['A','B'])['Date'].transform('count')
Pandas: number rows within group cumulatively and across another group
This is a tricky problem. You want to calculate the cumcount within group, but for all subsequent groups you need to keep track of how much that was already incremented so you know the offset to apply. That can be done with a max
+ cumsum
of this cumcount
over the previous groups. Here the only complication is that you need to determine the relationship between previous and subsequent group labels, in case there isn't some simple + 1 increment between labels of susbequent groups.
# Cumcount within group
s = df.groupby(['col_1', 'col_2']).cumcount()
# Determine how many cumcounts were within all previous groups of `col_1'
to_merge = s.add(1).groupby(df['col_1']).max().cumsum().add(1).to_frame('new')
# Link group with prior group label
df1 = df[['col_1']].drop_duplicates()
df1['col_1_shift'] = df1['col_1'].shift(-1)
df1 = pd.concat([to_merge, df1.set_index('col_1')], axis=1)
# Bring the group offset over
df = df.merge(df1, left_on='col_1', right_on='col_1_shift', how='left')
# Add the group offset to the cumulative count within group.
# First group (no previous group) is NaN so fill with 1.
df['new'] = df['new'].fillna(1, downcast='infer') + s
# Clean up merging column
df = df.drop(columns='col_1_shift')
col_1 col_2 col_3 new
0 1 A 1 1
1 1 B 1 1
2 2 A 3 2
3 2 A 3 3
4 2 A 3 4
5 2 B 3 2
6 2 B 3 3
7 2 B 3 4
8 3 A 2 5
9 3 A 2 6
10 3 C 2 5
11 3 C 2 6
Generate a row number for each entry in a pandas grouped dataframe when all rows in each group are the same
You need cumcount
:
pet['row_number'] = pet.groupby(['country', 'state', 'city']).cumcount()
print (pet)
country state city counter row_number
0 US CA Los Angles 10 0
1 US CA Los Angles 10 1
2 US CA Los Angles 10 2
3 US CA Los Angles 10 3
4 US CA Los Angles 10 4
5 US CA Los Angles 10 5
6 US CA Los Angles 10 6
7 US CA Los Angles 10 7
8 US CA Los Angles 10 8
9 US CA Los Angles 10 9
10 US IL Springfield 20 0
11 US IL Springfield 20 1
12 US IL Springfield 20 2
13 US IL Springfield 20 3
14 US IL Springfield 20 4
15 US IL Springfield 20 5
16 US IL Springfield 20 6
17 US IL Springfield 20 7
18 US IL Springfield 20 8
19 US IL Springfield 20 9
20 US IL Springfield 20 10
21 US IL Springfield 20 11
22 US IL Springfield 20 12
23 US IL Springfield 20 13
24 US IL Springfield 20 14
25 US IL Springfield 20 15
26 US IL Springfield 20 16
27 US IL Springfield 20 17
28 US IL Springfield 20 18
29 US IL Springfield 20 19
Pandas get topmost n records within each group
Did you try
df.groupby('id').head(2)
Output generated: id value
id
1 0 1 1
1 1 2
2 3 2 1
4 2 2
3 7 3 1
4 8 4 1
(Keep in mind that you might need to order/sort before, depending on your data)EDIT: As mentioned by the questioner, use
df.groupby('id').head(2).reset_index(drop=True)
to remove the MultiIndex and flatten the results: id value
0 1 1
1 1 2
2 2 1
3 2 2
4 3 1
5 4 1
pandas groupby, then sort within groups
What you want to do is actually again a groupby (on the result of the first groupby): sort and take the first three elements per group.
Starting from the result of the first groupby:
In [60]: df_agg = df.groupby(['job','source']).agg({'count':sum})
We group by the first level of the index:In [63]: g = df_agg['count'].groupby('job', group_keys=False)
Then we want to sort ('order') each group and take the first three elements:In [64]: res = g.apply(lambda x: x.sort_values(ascending=False).head(3))
However, for this, there is a shortcut function to do this, nlargest
:In [65]: g.nlargest(3)
Out[65]:
job source
market A 5
D 4
B 3
sales E 7
C 6
B 4
dtype: int64
So in one go, this looks like:df_agg['count'].groupby('job', group_keys=False).nlargest(3)
Pandas enumerate groups in descending order
Use GroupBy.ngroup
with ascending=False
:
df.groupby('column', sort=False).ngroup(ascending=False)+1
0 3
1 3
2 2
3 2
4 1
5 1
dtype: int64
For DataFrame that looks like this,
df = pd.DataFrame({'column': [10, 10, 8, 8, 10, 10]})
. . .where only consecutive values are to be grouped, you'll need to modify your grouper:(df.groupby(df['column'].ne(df['column'].shift()).cumsum(), sort=False)
.ngroup(ascending=False)
.add(1))
0 3
1 3
2 2
3 2
4 1
5 1
dtype: int64
How to sort rows within a group (in descending order) using pandas
There's no need to use groupby here, a simple sort_values
on the two columns will suffice:
dummy.sort_values(['col1', 'col3'], ascending=[True, False])
col1 col2 col3
2 1 c 3
1 1 b 2
0 1 a 1
4 2 e 3
3 2 d 2
5 2 f 1
6 3 g 3
8 3 i 2
7 3 h 1
The order for "col2" is correct, you just need to return it as a list now:col2_list = (dummy.sort_values(['col1', 'col3'], ascending=[True, False])
.get('col2')
.tolist())
col2_list
# ['c', 'b', 'a', 'e', 'd', 'f', 'g', 'i', 'h']
In response to a request in the comments:
now I want to combine these col2 values with col1 values, can IHere we can build on the previous solution with
directly fetch col1 from dummy df and sorted col2 to create a new
dataframe?The output should look like (eg): 1 [c,b,a] 2 [e,d,f] ...
Groupby.agg
to listify the data:(dummy.sort_values(['col1', 'col3'], ascending=[True, False])
.groupby('col1', sort=False)['col2']
.agg(list)
.reset_index())
col1 col2
0 1 [c, b, a]
1 2 [e, d, f]
2 3 [g, i, h]
Sorting columns and selecting top n rows in each group pandas dataframe
There are 2 solutions:
1.sort_values
and aggregate head
:
df1 = df.sort_values('score',ascending = False).groupby('pidx').head(2)
print (df1)
mainid pidx pidy score
8 2 x w 12
4 1 a e 8
2 1 c a 7
10 2 y x 6
1 1 a c 5
7 2 z y 5
6 2 y z 3
3 1 c b 2
5 2 x y 1
2.set_index
and aggregate nlargest
:df = df.set_index(['mainid','pidy']).groupby('pidx')['score'].nlargest(2).reset_index()
print (df)
pidx mainid pidy score
0 a 1 e 8
1 a 1 c 5
2 c 1 a 7
3 c 1 b 2
4 x 2 w 12
5 x 2 y 1
6 y 2 x 6
7 y 2 z 3
8 z 2 y 5
Timings:np.random.seed(123)
N = 1000000
L1 = list('abcdefghijklmnopqrstu')
L2 = list('efghijklmnopqrstuvwxyz')
df = pd.DataFrame({'mainid':np.random.randint(1000, size=N),
'pidx': np.random.randint(10000, size=N),
'pidy': np.random.choice(L2, N),
'score':np.random.randint(1000, size=N)})
#print (df)
def epat(df):
grouped = df.groupby('pidx')
new_df = pd.DataFrame([], columns = df.columns)
for key, values in grouped:
new_df = pd.concat([new_df, grouped.get_group(key).sort_values('score', ascending=True)[:2]], 0)
return (new_df)
print (epat(df))
In [133]: %timeit (df.sort_values('score',ascending = False).groupby('pidx').head(2))
1 loop, best of 3: 309 ms per loop
In [134]: %timeit (df.set_index(['mainid','pidy']).groupby('pidx')['score'].nlargest(2).reset_index())
1 loop, best of 3: 7.11 s per loop
In [147]: %timeit (epat(df))
1 loop, best of 3: 22 s per loop
How to create order in pandas dataframe groups?
groupby user_id
and get the rank
using received_at
df['count_n'] = df.groupby('user_id').received_at.apply(pd.Series.rank)
This doesn't require a sorting step & will assign the correct rank even if the data frame is not sorted by received_at
within each groupif the column user_id
is set as an index (as your sample data seems to indicate), you could alternative use the following instead. Although, in recent versions of pandas, grouping by named indexes also works (i.e. the above might work)
df.groupby(level=0).received_at.apply(pd.Series.rank)
Related Topics
Convert Categorical Data in Pandas Dataframe
Django - Makemigrations - No Changes Detected
Python Socket Receive - Incoming Packets Always Have a Different Size
How to Convert a Python List into a C Array by Using Ctypes
Pandas Dataframe Column to List
Adding a Y-Axis Label to Secondary Y-Axis in Matplotlib
Splitting a String by List of Indices
Slicing of a Numpy 2D Array, or How to Extract an Mxm Submatrix from an Nxn Array (N>M)
Merging Dictionary Value Lists in Python
How to Delete Created Variables, Functions, etc from the Memory of the Interpreter
Multiprocessing:Use Tqdm to Display a Progress Bar
How to Restrict Foreign Keys Choices to Related Objects Only in Django
Slicing a List into N Nearly-Equal-Length Partitions
How to Get the Ip Address from a Http Request Using the Requests Library
Find Index of Last Occurrence of a Substring in a String
Change Tick Frequency on X (Time, Not Number) Frequency in Matplotlib