Group Data in R for Consecutive Rows

Group rows based on consecutive line numbers

Convert the numbers to numeric, calculate difference between consecutive numbers and increment the group count when the difference is greater than 1.

transform(df, group = cumsum(c(TRUE, diff(as.numeric(line)) > 1)))

# line group
#1 0001 1
#2 0002 1
#3 0003 1
#4 0011 2
#5 0012 2
#6 0234 3
#7 0235 3
#8 0236 3

If you want to use dplyr :

library(dplyr)
df %>% mutate(group = cumsum(c(TRUE, diff(as.numeric(line)) > 1)))

Group data frame row by consecutive value in R

We could use diff on the adjacent values of 'time', check if the difference is not equal to 1, then change the logical vector to numeric index by taking the cumulative sum (cumsum) so that there is an increment of 1 at each TRUE value

library(dplyr)
df1 %>%
mutate(grp = cumsum(c(TRUE, diff(time) != 1)))

-output

# A tibble: 12 x 2
time grp
<dbl> <int>
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 10 2
7 11 2
8 20 3
9 30 4
10 31 4
11 32 4
12 40 5

Group Data in R for consecutive rows

In dplyr, I would do this by creating another grouping variable for the consecutive rows. This is what the code cumsum(c(1, diff(weight) != 0) is doing in the code chunk below. An example of this is also here.

The group creation can be done within group_by, and then you can proceed accordingly with making any summaries by group.

library(dplyr)

df_in %>%
group_by(ID, group_weight = cumsum(c(1, diff(weight) != 0)), weight) %>%
summarise(start_day = min(start_day), end_day = max(end_day))

Source: local data frame [5 x 5]
Groups: ID, group_weight [?]

ID group_weight weight start_day end_day
(dbl) (dbl) (dbl) (dbl) (dbl)
1 1 1 150 1 7
2 1 2 151 7 10
3 1 3 150 10 30
4 2 4 170 5 20
5 2 5 171 20 30

This approach does leave you with the extra grouping variable in the dataset, which can be removed, if needed, with select(-group_weight) after ungrouping.

How to group consecutive rows having same event and find average?

We can create a grouping variable with rleid from data.table, use that to get the mean of 'pt' as well as return the first value of 'Event'

library(dplyr)
library(data.table)
group %>%
group_by(grp = rleid(Event)) %>%
summarise(Event = first(Event), Value = mean(pt)) %>%
select(-grp)
# A tibble: 4 x 2
# Event Value
# <dbl> <dbl>
#1 1 2.5
#2 2 4
#3 1 12.5
#4 2 4

Or using tapply/rle in base R

with(group, tapply(pt, with(rle(Event),
rep(seq_along(values), lengths)), FUN = mean))
# 1 2 3 4
# 2.5 4.0 12.5 4.0

How to group by consecutive rows in a R dataframe?

With R, an implementation using dplyr would be to take the cumulative sum of the logical comparison between the 'pv_type' and the lag of 'pv_type' as a grouping column and then get the min and max of 'price' as two new columns

library(dplyr)
segmentation %>%
group_by(pv_type_group = cumsum(pv_type != lag(pv_type,
default = first(pv_type))) %>%
mutate(min_v = min(price), max_p = max(price))

Update

With the OP's example, the expected output is summarised, so we use summarise instead of mutate. Also, used rleid (from data.table) instead of the logical cumulative sum

library(data.table)
segmentation %>%
group_by(grp = rleid(types)) %>%
summarise(types = first(types), expectedvalues = min(values)) %>%
ungroup %>%
select(-grp)
# A tibble: 4 x 2
# types expectedvalues
# <fct> <dbl>
#1 peak 1
#2 valley 0.4
#3 peak 1.2
#4 valley 0.1

Select random consecutive rows by group as a proportion of group length

Maybe something like:

df[df[ , {
k = ceiling(0.1 * .N)
sample(head(.I, -k), 1L) + (0L:(k-1L))
}, cell]$V1]

Idea is to pick an sample from the index vector, but the sample must be at least k spaces away from the end of the vector so that if we happen to pick the kth element from the back, we will use the kth to last element from the back. To do this we use head(.I, -k).

head(.I, -k) remove the last k indices. sample(..., 1L) randomly picks an element and since when we need k elements, we choose this picked element and the subsequent k-1 elements.

Assign unique id to consecutive rows within a grouping variable in dplyr

We can use gl

library(dplyr)
df <- df %>%
group_by(group) %>%
mutate(id = as.integer(gl(n(), 2, n()))) %>%
ungroup

R - build unique groups based on consecutive rows and factor level

We can do a group by on 'letter' and the run-length-id (rleid from data.table) on the 'letter', summarise to get the mean of 'time', create the sequence column with row_number() and select out the 'grp' column

library(dplyr)
library(data.table)
test %>%
group_by(letter, grp = rleid(letter)) %>%
summarise(mean_time = mean(time)) %>%
mutate(id = row_number()) %>%
ungroup %>%
select(-grp)
# A tibble: 4 x 3
# letter mean_time id
# <fct> <dbl> <int>
#1 a 2 1
#2 a 6 2
#3 b 4 1
#4 b 9 2

Group rows in data frame based on time difference between consecutive rows

Here is another possibility which groups rows where the time difference between consecutive rows is less than 4 days.

# create date variable
df$date <- with(df, as.Date(paste(YEAR, MONTH, DAY, sep = "-")))

# calculate succesive differences between dates
# and identify gaps larger than 4
df$gap <- c(0, diff(df$date) > 4)

# cumulative sum of 'gap' variable
df$group <- cumsum(df$gap) + 1

df
# YEAR MONTH DAY HOUR LON LAT date gap group
# 1 1860 10 3 13 -19.5 3 1860-10-03 0 1
# 2 1860 10 3 17 -19.5 4 1860-10-03 0 1
# 3 1860 10 3 21 -19.5 5 1860-10-03 0 1
# 4 1860 10 5 5 -20.5 6 1860-10-05 0 1
# 5 1860 10 5 13 -21.5 7 1860-10-05 0 1
# 6 1860 10 5 17 -21.5 8 1860-10-05 0 1
# 7 1860 10 6 1 -22.5 9 1860-10-06 0 1
# 8 1860 10 6 5 -22.5 10 1860-10-06 0 1
# 9 1860 12 5 9 -22.5 -7 1860-12-05 1 2
# 10 1860 12 5 18 -23.5 -8 1860-12-05 0 2
# 11 1860 12 5 22 -23.5 -9 1860-12-05 0 2
# 12 1860 12 6 6 -24.5 -10 1860-12-06 0 2
# 13 1860 12 6 10 -24.5 -11 1860-12-06 0 2
# 14 1860 12 6 18 -24.5 -12 1860-12-06 0 2

Disclaimer: the diff & cumsum part is inspired by this Q&A: How to partition a vector into groups of regular, consecutive sequences?.



Related Topics



Leave a reply



Submit