How to Groupby Consecutive Values in Pandas Dataframe

How to groupby consecutive values in pandas DataFrame

You can use groupby by custom Series:

df = pd.DataFrame({'a': [1, 1, -1, 1, -1, -1]})
print (df)
a
0 1
1 1
2 -1
3 1
4 -1
5 -1

print ((df.a != df.a.shift()).cumsum())
0 1
1 1
2 2
3 3
4 4
5 4
Name: a, dtype: int32
for i, g in df.groupby([(df.a != df.a.shift()).cumsum()]):
print (i)
print (g)
print (g.a.tolist())

a
0 1
1 1
[1, 1]
2
a
2 -1
[-1]
3
a
3 1
[1]
4
a
4 -1
5 -1
[-1, -1]

groupby consecutive identical values in pandas dataframe and cumulative count of the number of occurences

You need to generate a grouper for the change in values. For this compare each value with the previous one and apply a cumsum. This gives you groups in the itertools.groupby style ([1, 1, 1, 1, 2, 2, 3, 4]), finally group and apply a cumcount.

df['count'] = (df.groupby(df['col'].ne(df['col'].shift()).cumsum())
.cumcount()
)

output:

  col  count
0 a 0
1 a 1
2 a 2
3 a 3
4 b 0
5 b 1
6 a 0
7 b 0

edit: for fun here is a solution using itertools (much faster):

from itertools import groupby, chain
df['count'] = list(chain(*(list(range(len(list(g))))
for _,g in groupby(df['col']))))

NB. this runs much faster (88 µs vs 707 µs on the provided example)

Group by consecutive values in one column and select the earliest and latest date for each group

This solution expects the dates to be sorted from the earliest to the latest, as in the provided example data.

from itertools import groupby
from io import StringIO

import pandas as pd

df = pd.read_csv(
StringIO(
"""col1 col2 col3 col4 col5
a 2021-07-03 17:08 2021-07-04 10:41
b 2021-07-10 04:14 2021-07-11 04:32
c 2021-07-13 02:03 2021-07-14 00:45
d 2021-07-14 21:23 2021-07-15 02:59
d 2021-07-15 04:05 2021-07-15 09:41
e 2021-07-17 13:50 2021-07-18 08:49
a 2021-07-18 10:51 2021-07-18 12:27
a 2021-07-18 13:55 2021-07-19 06:26
f 2021-09-20 22:36 2021-09-20 23:19
f 2021-09-21 23:45 2021-09-23 10:12
"""
),
delim_whitespace=True,
header=0,
)

# Group by consecutive values in col1
groups = [list(group) for key, group in groupby(df["col1"].values.tolist())]
group_end_indices = pd.Series(len(g) for g in groups).cumsum()
group_start_indices = (group_end_indices - group_end_indices.diff(1)).fillna(0).astype(int)

filtered_df = []
for start_ix, end_ix in zip(group_start_indices, group_end_indices):
group = df.iloc[start_ix:end_ix]
if group.shape[0] > 1:
group.iloc[0][["col4", "col5"]] = group.iloc[-1][["col4", "col5"]]
filtered_df.append(group.iloc[0])
filtered_df = pd.DataFrame(filtered_df).reset_index(drop=True)
print(filtered_df)

Output:

  col1        col2   col3        col4   col5
0 a 2021-07-03 17:08 2021-07-04 10:41
1 b 2021-07-10 04:14 2021-07-11 04:32
2 c 2021-07-13 02:03 2021-07-14 00:45
3 d 2021-07-14 21:23 2021-07-15 09:41
4 e 2021-07-17 13:50 2021-07-18 08:49
5 a 2021-07-18 10:51 2021-07-19 06:26
6 f 2021-09-20 22:36 2021-09-23 10:12

Pandas DataFrame group by consecutive same values on multiple columns

Create consecutive groups by compare columns from list with DataFrame.any and then add cumulative sum:

cols = ['user','group','value1','value2']

grouped = df.groupby(((df[cols].shift() != df[cols]).any(axis=1)).cumsum())
for k, v in grouped:
print(f'[group {k}]')
print(v)


[group 1]
user group value1 value2 value3
0 paul accounting foo 3 random123
1 paul accounting foo 3 random456
2 paul accounting foo 3 random789
[group 2]
user group value1 value2 value3
3 paul accounting foo 5 random789
4 paul accounting foo 5 random789
5 paul accounting foo 5 random158
[group 3]
user group value1 value2 value3
6 jack administration foo 5 random487
7 jack administration foo 5 random435
[group 4]
user group value1 value2 value3
8 jack administration bar 3 random483
[group 5]
user group value1 value2 value3
9 jack administration foo 3 random431
10 jack administration foo 3 random478
[group 6]
user group value1 value2 value3
11 paul accounting foo 5 random759
[group 7]
user group value1 value2 value3
12 jack administration bar 3 random431
[group 8]
user group value1 value2 value3
13 jack administration foo 3 random478

Grouping dataframe based on consecutive occurrence of values

Since you're dealing with 0/1s, here's another alternative using diff + cumsum -

df['group'] = df.condition.diff().abs().cumsum().fillna(0).astype(int) + 1    
df

condition H t group
index
0 1 2.0 1.1 1
1 1 7.0 1.5 1
2 0 1.0 0.9 2
3 0 6.5 1.6 2
4 1 7.0 1.1 3
5 1 9.0 1.8 3
6 1 22.0 2.0 3

If you don't mind floats, this can be made a little faster.

df['group'] = df.condition.diff().abs().cumsum() + 1
df.loc[0, 'group'] = 1
df

index condition H t group
0 0 1 2.0 1.1 1.0
1 1 1 7.0 1.5 1.0
2 2 0 1.0 0.9 2.0
3 3 0 6.5 1.6 2.0
4 4 1 7.0 1.1 3.0
5 5 1 9.0 1.8 3.0
6 6 1 22.0 2.0 3.0

Here's the version with numpy equivalents -

df['group'] = 1
df.loc[1:, 'group'] = np.cumsum(np.abs(np.diff(df.condition))) + 1
df

condition H t group
index
0 1 2.0 1.1 1
1 1 7.0 1.5 1
2 0 1.0 0.9 2
3 0 6.5 1.6 2
4 1 7.0 1.1 3
5 1 9.0 1.8 3
6 1 22.0 2.0 3

On my machine, here are the timings -

df = pd.concat([df] * 100000, ignore_index=True)

%timeit df['group'] = df.condition.diff().abs().cumsum().fillna(0).astype(int) + 1
10 loops, best of 3: 25.1 ms per loop

%%timeit
df['group'] = df.condition.diff().abs().cumsum() + 1
df.loc[0, 'group'] = 1

10 loops, best of 3: 23.4 ms per loop

%%timeit
df['group'] = 1
df.loc[1:, 'group'] = np.cumsum(np.abs(np.diff(df.condition))) + 1

10 loops, best of 3: 21.4 ms per loop
%timeit df['group'] = df['condition'].ne(df['condition'].shift()).cumsum()
100 loops, best of 3: 15.8 ms per loop

Groupby consecutive occurrences of two column values in pandas

You can compare both columns with DataFrame.ne for != by shifted rows of both columns and then add DataFrame.any for test if True at least in one column, last added cumulative sum:

diff = df[["a_cn","b_cn"]].ne(df[["a_cn","b_cn"]].shift()).any(axis=1).cumsum()
#alternative
diff = (df[["a_cn","b_cn"]] != df[["a_cn","b_cn"]].shift()).any(axis=1).cumsum()
print (diff)
0 1
1 1
2 1
3 2
4 2
5 3
6 3
7 4
dtype: int32

Your solution should be changed with | for bitwise OR:

diff = (
(df["a_cn"] != df["a_cn"].shift()) |
(df["b_cn"] != df["b_cn"].shift())
).cumsum()
print (diff)
0 1
1 1
2 1
3 2
4 2
5 3
6 3
7 4
dtype: int32

Identify consecutive same values in Pandas Dataframe, with a Groupby

You can try this; 1) Create an extra group variable with df.value.diff().ne(0).cumsum() to denote the value changes; 2) use transform('size') to calculate the group size and compare with three, then you get the flag column you need:

df['flag'] = df.value.groupby([df.id, df.value.diff().ne(0).cumsum()]).transform('size').ge(3).astype(int) 
df

Sample Image


Break downs:

1) diff is not equal to zero (which is literally what df.value.diff().ne(0) means) gives a condition True whenever there is a value change:

df.value.diff().ne(0)
#0 True
#1 False
#2 True
#3 True
#4 False
#5 False
#6 True
#7 False
#8 False
#9 False
#10 True
#11 True
#12 True
#13 False
#14 False
#15 True
#16 False
#17 True
#18 False
#19 False
#20 False
#21 False
#Name: value, dtype: bool

2) Then cumsum gives a non descending sequence of ids where each id denotes a consecutive chunk with same values, note when summing boolean values, True is considered as one while False is considered as zero:

df.value.diff().ne(0).cumsum()
#0 1
#1 1
#2 2
#3 3
#4 3
#5 3
#6 4
#7 4
#8 4
#9 4
#10 5
#11 6
#12 7
#13 7
#14 7
#15 8
#16 8
#17 9
#18 9
#19 9
#20 9
#21 9
#Name: value, dtype: int64

3) combined with id column, you can group the data frame, calculate the group size and get the flag column.

pandas groupby and find max no. of consecutive occurences of 1s in dataframe

You could create a mask with a unique value for each consecutive group of numbers (cumsum + ne/!==), and then groupby that and the ID, sum the numbers, and get the the max:

df.groupby([df['Id'], df['values'].ne(df.groupby('Id')['values'].shift(1)).cumsum()])['values'].sum().groupby(level=0).max().reset_index()

Output:

>>> df
Id values
0 1 3.0
1 2 6.0


Related Topics



Leave a reply



Submit