Group Rows in Data Frame Based on Time Difference Between Consecutive Rows

Group rows in data frame based on time difference between consecutive rows

Here is another possibility which groups rows where the time difference between consecutive rows is less than 4 days.

# create date variable
df$date <- with(df, as.Date(paste(YEAR, MONTH, DAY, sep = "-")))

# calculate succesive differences between dates
# and identify gaps larger than 4
df$gap <- c(0, diff(df$date) > 4)

# cumulative sum of 'gap' variable
df$group <- cumsum(df$gap) + 1

df
# YEAR MONTH DAY HOUR LON LAT date gap group
# 1 1860 10 3 13 -19.5 3 1860-10-03 0 1
# 2 1860 10 3 17 -19.5 4 1860-10-03 0 1
# 3 1860 10 3 21 -19.5 5 1860-10-03 0 1
# 4 1860 10 5 5 -20.5 6 1860-10-05 0 1
# 5 1860 10 5 13 -21.5 7 1860-10-05 0 1
# 6 1860 10 5 17 -21.5 8 1860-10-05 0 1
# 7 1860 10 6 1 -22.5 9 1860-10-06 0 1
# 8 1860 10 6 5 -22.5 10 1860-10-06 0 1
# 9 1860 12 5 9 -22.5 -7 1860-12-05 1 2
# 10 1860 12 5 18 -23.5 -8 1860-12-05 0 2
# 11 1860 12 5 22 -23.5 -9 1860-12-05 0 2
# 12 1860 12 6 6 -24.5 -10 1860-12-06 0 2
# 13 1860 12 6 10 -24.5 -11 1860-12-06 0 2
# 14 1860 12 6 18 -24.5 -12 1860-12-06 0 2

Disclaimer: the diff & cumsum part is inspired by this Q&A: How to partition a vector into groups of regular, consecutive sequences?.

group rows in a pandas data frame when the difference of consecutive rows are less than a value

You can do a named aggregation on groupby:

(df.groupby(df.col1.diff().ge(3).cumsum(), as_index=False)
.agg(col1=('col1','first'),
col2=('col2','sum'),
col3=('col3','sum'),
col4=('col1','last'))
)

Output:

   col1  col2  col3  col4
0 1 7 10 4
1 7 8 15 9
2 15 1 12 15

update without named aggregation you can do some thing like this:

groups = df.groupby(df.col1.diff().ge(3).cumsum())
new_df = groups.agg({'col1':'first', 'col2':'sum','col3':'sum'})
new_df['col4'] = groups['col1'].last()

calculate time differences between consecutive rows using pandas?

Try this example:

import pandas as pd
import io

s = io.StringIO('''
dates,nums
2017-02-01T00:00:01,1
2017-02-01T00:00:01,2
2017-02-01T00:00:06,3
2017-02-01T00:00:07,4
2017-02-01T00:00:10,5
''')

df = pd.read_csv(s)

Currently the frame looks like this:

nums is nothing and just there to be a secondary column of "something".

                 dates  nums
0 2017-02-01T00:00:01 1
1 2017-02-01T00:00:01 2
2 2017-02-01T00:00:06 3
3 2017-02-01T00:00:07 4
4 2017-02-01T00:00:10 5

Carrying on:

# format as datetime
df['dates'] = pd.to_datetime(df['dates'])

# shift the dates up and into a new column
df['dates_shift'] = df['dates'].shift(-1)

# work out the diff
df['time_diff'] = (df['dates_shift'] - df['dates']) / pd.Timedelta(seconds=1)

# remove the temp column
del df['dates_shift']

# see what you've got
print(df)

dates nums time_diff
0 2017-02-01 00:00:01 1 0.0
1 2017-02-01 00:00:01 2 5.0
2 2017-02-01 00:00:06 3 1.0
3 2017-02-01 00:00:07 4 3.0
4 2017-02-01 00:00:10 5 NaN

To get the absolute values change this line above:

df['time_diff'] = (df['dates_shift'] - df['dates']) / pd.Timedelta(seconds=1)

To:

df['time_diff'] = (df['dates_shift'] - df['dates']).abs() / pd.Timedelta(seconds=1)

computing datetime difference among consecutive rows in groupby DataFrame

Use GroupBy.diff and divide by a 1 month timedelta.

df['months'] = df.groupby('name')['date'].diff().div(pd.Timedelta(days=30.44), fill_value=0).round().astype(int)

output

    name       date  months
0 Mark 2018-01-01 0
1 Anne 2018-01-01 0
2 Anne 2018-02-01 1
3 Anne 2018-04-01 2
4 Anne 2018-09-01 5
5 Anne 2019-01-01 4
6 John 2018-02-01 0
7 John 2018-06-01 4
8 John 2019-02-01 8
9 Ethan 2018-03-01 0

Calculate difference between values in consecutive rows by group

The package data.table can do this fairly quickly, using the shift function.

require(data.table)
df <- data.table(group = rep(c(1, 2), each = 3), value = c(10,20,25,5,10,15))
#setDT(df) #if df is already a data frame

df[ , diff := value - shift(value), by = group]
# group value diff
#1: 1 10 NA
#2: 1 20 10
#3: 1 25 5
#4: 2 5 NA
#5: 2 10 5
#6: 2 15 5
setDF(df) #if you want to convert back to old data.frame syntax

Or using the lag function in dplyr

df %>%
group_by(group) %>%
mutate(Diff = value - lag(value))
# group value Diff
# <int> <int> <int>
# 1 1 10 NA
# 2 1 20 10
# 3 1 25 5
# 4 2 5 NA
# 5 2 10 5
# 6 2 15 5

For alternatives pre-data.table::shift and pre-dplyr::lag, see edits.

Group rows based on consecutive line numbers

Convert the numbers to numeric, calculate difference between consecutive numbers and increment the group count when the difference is greater than 1.

transform(df, group = cumsum(c(TRUE, diff(as.numeric(line)) > 1)))

# line group
#1 0001 1
#2 0002 1
#3 0003 1
#4 0011 2
#5 0012 2
#6 0234 3
#7 0235 3
#8 0236 3

If you want to use dplyr :

library(dplyr)
df %>% mutate(group = cumsum(c(TRUE, diff(as.numeric(line)) > 1)))

spread then subtract consecutive rows in pandas data frame

Here's one way using groupby + cumcount to create groups, then use that groups in groupby + first to get the first time each event happens each day. Then pivot.

Finally, use diff to get the difference between "light" and "dark" and assign the differences to column "diff" in df:

out = (df.assign(time=df.groupby(df.groupby('event').cumcount())['time'].transform('first'))
.pivot('time', 'event', 'score').reset_index().rename_axis([None], axis=1)
.assign(diff=lambda x: x['dark']-x['light']))

Output:

                        time        dark       light      diff
0 2022-03-07 06:45:00+00:00 79.857667 80.066667 -0.209000
1 2022-03-30 06:25:00+00:00 105.324000 107.060833 -1.736833
2 2022-03-30 13:40:00+00:00 105.239750 106.863143 -1.623393
3 2022-04-01 06:25:00+00:00 NaN 101.271867 NaN

pandas groupby dataframes, calculate diffs between consecutive rows

You can use groupby

s2= df.groupby(['cycleID'])['mean'].diff()
s2.dropna(inplace=True)

output

1   -8.453876e-12
3 -1.486037e-11
5 2.482933e-12
7 -3.388330e-12
8 3.000000e-12


UPDATE

d = [[1, 1.5020712104685252e-11],
[1, 6.56683605063102e-12],
[2, 1.3993315187144084e-11],
[2, -8.670502467042485e-13],
[3, 7.0270625256163566e-12],
[3, 9.509995221868016e-12],
[4, 1.2901435995915644e-11],
[4, 9.513106448422182e-12]]

df = pd.DataFrame(d, columns=['cycleID', 'mean'])

df2 = df.groupby(['cycleID']).diff().dropna().rename(columns={'mean': 'difference'})
df2['mean'] = df['mean'].iloc[df2.index]
       difference    mean
1 -8.453876e-12 6.566836e-12
3 -1.486037e-11 -8.670502e-13
5 2.482933e-12 9.509995e-12
7 -3.388330e-12 9.513106e-12

difference between rows within groups pandas

See if this helps:

#replaces any value that contains a string value, with a 0
df['col2'] = pd.to_numeric(df.col2, errors='coerce').fillna(0)
#sorts the column in ascending first and calculates the difference
df['diff']=df.sort_values(['col1','col2'],ascending=[1,1]).groupby('col1').diff()
#display the dataframe after sorting col1 in asc and col2 in desc
df.sort_values(['col1','col2'],ascending=[1,0])

Out:

Sample Image

How to calculate differences between consecutive rows in pandas data frame?

diff should give the desired result:

>>> df.diff()
count_a count_b
2015-01-01 NaN NaN
2015-01-02 38465 NaN
2015-01-03 36714 NaN
2015-01-04 35137 NaN
2015-01-05 35864 NaN
....
2015-02-07 142390 25552
2015-02-08 126768 22835
2015-02-09 122324 21485


Related Topics



Leave a reply



Submit