How to Groupby Consecutive Values in Pandas Dataframe

How to groupby consecutive values in pandas DataFrame

You can use groupby by custom Series:

df = pd.DataFrame({'a': [1, 1, -1, 1, -1, -1]})
print (df)
   a
0  1
1  1
2 -1
3  1
4 -1
5 -1

print ((df.a != df.a.shift()).cumsum())
0    1
1    1
2    2
3    3
4    4
5    4
Name: a, dtype: int32

for i, g in df.groupby([(df.a != df.a.shift()).cumsum()]):
    print (i)
    print (g)
    print (g.a.tolist())

   a
0  1
1  1
[1, 1]
2
   a
2 -1
[-1]
3
   a
3  1
[1]
4
   a
4 -1
5 -1
[-1, -1]

groupby consecutive identical values in pandas dataframe and cumulative count of the number of occurences

You need to generate a grouper for the change in values. For this compare each value with the previous one and apply a cumsum. This gives you groups in the itertools.groupby style ([1, 1, 1, 1, 2, 2, 3, 4]), finally group and apply a cumcount.

df['count'] = (df.groupby(df['col'].ne(df['col'].shift()).cumsum())
                 .cumcount()
              )

output:

  col  count
0   a      0
1   a      1
2   a      2
3   a      3
4   b      0
5   b      1
6   a      0
7   b      0

edit: for fun here is a solution using itertools (much faster):

from itertools import groupby, chain
df['count'] = list(chain(*(list(range(len(list(g)))) 
                           for _,g in  groupby(df['col']))))

NB. this runs much faster (88 µs vs 707 µs on the provided example)

Group by consecutive values in one column and select the earliest and latest date for each group

This solution expects the dates to be sorted from the earliest to the latest, as in the provided example data.

from itertools import groupby
from io import StringIO

import pandas as pd

df = pd.read_csv(
    StringIO(
        """col1 col2    col3    col4    col5
a   2021-07-03  17:08   2021-07-04  10:41
b   2021-07-10  04:14   2021-07-11  04:32
c   2021-07-13  02:03   2021-07-14  00:45
d   2021-07-14  21:23   2021-07-15  02:59
d   2021-07-15  04:05   2021-07-15  09:41
e   2021-07-17  13:50   2021-07-18  08:49
a   2021-07-18  10:51   2021-07-18  12:27
a   2021-07-18  13:55   2021-07-19  06:26
f   2021-09-20  22:36   2021-09-20  23:19
f   2021-09-21  23:45   2021-09-23  10:12
"""
    ),
    delim_whitespace=True,
    header=0,
)

# Group by consecutive values in col1
groups = [list(group) for key, group in groupby(df["col1"].values.tolist())]
group_end_indices = pd.Series(len(g) for g in groups).cumsum()
group_start_indices = (group_end_indices - group_end_indices.diff(1)).fillna(0).astype(int)

filtered_df = []
for start_ix, end_ix in zip(group_start_indices, group_end_indices):
    group = df.iloc[start_ix:end_ix]
    if group.shape[0] > 1:
        group.iloc[0][["col4", "col5"]] = group.iloc[-1][["col4", "col5"]]
    filtered_df.append(group.iloc[0])
filtered_df = pd.DataFrame(filtered_df).reset_index(drop=True)
print(filtered_df)

Output:

  col1        col2   col3        col4   col5
0    a  2021-07-03  17:08  2021-07-04  10:41
1    b  2021-07-10  04:14  2021-07-11  04:32
2    c  2021-07-13  02:03  2021-07-14  00:45
3    d  2021-07-14  21:23  2021-07-15  09:41
4    e  2021-07-17  13:50  2021-07-18  08:49
5    a  2021-07-18  10:51  2021-07-19  06:26
6    f  2021-09-20  22:36  2021-09-23  10:12

Pandas DataFrame group by consecutive same values on multiple columns

Create consecutive groups by compare columns from list with DataFrame.any and then add cumulative sum:

cols = ['user','group','value1','value2']

grouped = df.groupby(((df[cols].shift() != df[cols]).any(axis=1)).cumsum())
for k, v in grouped:
    print(f'[group {k}]')
    print(v)

[group 1]
   user       group value1  value2     value3
0  paul  accounting    foo       3  random123
1  paul  accounting    foo       3  random456
2  paul  accounting    foo       3  random789
[group 2]
   user       group value1  value2     value3
3  paul  accounting    foo       5  random789
4  paul  accounting    foo       5  random789
5  paul  accounting    foo       5  random158
[group 3]
   user           group value1  value2     value3
6  jack  administration    foo       5  random487
7  jack  administration    foo       5  random435
[group 4]
   user           group value1  value2     value3
8  jack  administration    bar       3  random483
[group 5]
    user           group value1  value2     value3
9   jack  administration    foo       3  random431
10  jack  administration    foo       3  random478
[group 6]
    user       group value1  value2     value3
11  paul  accounting    foo       5  random759
[group 7]
    user           group value1  value2     value3
12  jack  administration    bar       3  random431
[group 8]
    user           group value1  value2     value3
13  jack  administration    foo       3  random478

Grouping dataframe based on consecutive occurrence of values

Since you're dealing with 0/1s, here's another alternative using diff + cumsum -

df['group'] = df.condition.diff().abs().cumsum().fillna(0).astype(int) + 1    
df

       condition     H    t  group
index                             
0              1   2.0  1.1      1
1              1   7.0  1.5      1
2              0   1.0  0.9      2
3              0   6.5  1.6      2
4              1   7.0  1.1      3
5              1   9.0  1.8      3
6              1  22.0  2.0      3

If you don't mind floats, this can be made a little faster.

df['group'] = df.condition.diff().abs().cumsum() + 1
df.loc[0, 'group'] = 1
df

   index  condition     H    t  group
0      0          1   2.0  1.1    1.0
1      1          1   7.0  1.5    1.0
2      2          0   1.0  0.9    2.0
3      3          0   6.5  1.6    2.0
4      4          1   7.0  1.1    3.0
5      5          1   9.0  1.8    3.0
6      6          1  22.0  2.0    3.0

Here's the version with numpy equivalents -

df['group'] = 1
df.loc[1:, 'group'] = np.cumsum(np.abs(np.diff(df.condition))) + 1
df

       condition     H    t  group
index                             
0              1   2.0  1.1      1
1              1   7.0  1.5      1
2              0   1.0  0.9      2
3              0   6.5  1.6      2
4              1   7.0  1.1      3
5              1   9.0  1.8      3
6              1  22.0  2.0      3

On my machine, here are the timings -

df = pd.concat([df] * 100000, ignore_index=True)

%timeit df['group'] = df.condition.diff().abs().cumsum().fillna(0).astype(int) + 1 
10 loops, best of 3: 25.1 ms per loop

%%timeit
df['group'] = df.condition.diff().abs().cumsum() + 1
df.loc[0, 'group'] = 1

10 loops, best of 3: 23.4 ms per loop

%%timeit
df['group'] = 1
df.loc[1:, 'group'] = np.cumsum(np.abs(np.diff(df.condition))) + 1

10 loops, best of 3: 21.4 ms per loop

%timeit df['group'] = df['condition'].ne(df['condition'].shift()).cumsum()
100 loops, best of 3: 15.8 ms per loop

Groupby consecutive occurrences of two column values in pandas

You can compare both columns with DataFrame.ne for != by shifted rows of both columns and then add DataFrame.any for test if True at least in one column, last added cumulative sum:

diff = df[["a_cn","b_cn"]].ne(df[["a_cn","b_cn"]].shift()).any(axis=1).cumsum()
#alternative
diff = (df[["a_cn","b_cn"]] != df[["a_cn","b_cn"]].shift()).any(axis=1).cumsum()
print (diff)
0    1
1    1
2    1
3    2
4    2
5    3
6    3
7    4
dtype: int32

Your solution should be changed with | for bitwise OR:

diff = (
    (df["a_cn"] != df["a_cn"].shift()) |
    (df["b_cn"] != df["b_cn"].shift())
).cumsum()
print (diff)
0    1
1    1
2    1
3    2
4    2
5    3
6    3
7    4
dtype: int32

Identify consecutive same values in Pandas Dataframe, with a Groupby

You can try this; 1) Create an extra group variable with df.value.diff().ne(0).cumsum() to denote the value changes; 2) use transform('size') to calculate the group size and compare with three, then you get the flag column you need:

df['flag'] = df.value.groupby([df.id, df.value.diff().ne(0).cumsum()]).transform('size').ge(3).astype(int) 
df

Sample Image

Break downs:

1) diff is not equal to zero (which is literally what df.value.diff().ne(0) means) gives a condition True whenever there is a value change:

df.value.diff().ne(0)
#0      True
#1     False
#2      True
#3      True
#4     False
#5     False
#6      True
#7     False
#8     False
#9     False
#10     True
#11     True
#12     True
#13    False
#14    False
#15     True
#16    False
#17     True
#18    False
#19    False
#20    False
#21    False
#Name: value, dtype: bool

2) Then cumsum gives a non descending sequence of ids where each id denotes a consecutive chunk with same values, note when summing boolean values, True is considered as one while False is considered as zero:

df.value.diff().ne(0).cumsum()
#0     1
#1     1
#2     2
#3     3
#4     3
#5     3
#6     4
#7     4
#8     4
#9     4
#10    5
#11    6
#12    7
#13    7
#14    7
#15    8
#16    8
#17    9
#18    9
#19    9
#20    9
#21    9
#Name: value, dtype: int64

3) combined with id column, you can group the data frame, calculate the group size and get the flag column.

pandas groupby and find max no. of consecutive occurences of 1s in dataframe

You could create a mask with a unique value for each consecutive group of numbers (cumsum + ne/!==), and then groupby that and the ID, sum the numbers, and get the the max:

df.groupby([df['Id'], df['values'].ne(df.groupby('Id')['values'].shift(1)).cumsum()])['values'].sum().groupby(level=0).max().reset_index()

Output:

>>> df
   Id  values
0   1     3.0
1   2     6.0

How to Groupby Consecutive Values in Pandas Dataframe