Create Grouping Variable For Consecutive Sequences and Split Vector

Create a list of vectors of consecutive values from a vector

An option is to split by creating a grouping variable created by checking the difference of adjacent elements

split(vec, cumsum(c(TRUE, diff(vec) != 1)))
#$`1`
#[1] 1 2

#$`2`
#[1] 5

#$`3`
#[1] 7 8 9

#$`4`
#[1] 11 12 13

#$`5`
#[1] 15

Finding the number of consecutive days in data

You could do:

df %>% 
group_by(cumsum(c(0, diff(day) - 1))) %>%
summarise(sequences = paste(first(day), last(day), sep = ' - '),
length = n()) %>%
filter(length > 1) %>%
select(sequences, length)

#> # A tibble: 2 x 2
#> sequences length
#> <chr> <int>
#> 1 2022-01-03 - 2022-01-05 3
#> 2 2022-01-10 - 2022-01-13 4

Group data frame row by consecutive value in R

We could use diff on the adjacent values of 'time', check if the difference is not equal to 1, then change the logical vector to numeric index by taking the cumulative sum (cumsum) so that there is an increment of 1 at each TRUE value

library(dplyr)
df1 %>%
mutate(grp = cumsum(c(TRUE, diff(time) != 1)))

-output

# A tibble: 12 x 2
time grp
<dbl> <int>
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 10 2
7 11 2
8 20 3
9 30 4
10 31 4
11 32 4
12 40 5

Group rows based on consecutive line numbers

Convert the numbers to numeric, calculate difference between consecutive numbers and increment the group count when the difference is greater than 1.

transform(df, group = cumsum(c(TRUE, diff(as.numeric(line)) > 1)))

# line group
#1 0001 1
#2 0002 1
#3 0003 1
#4 0011 2
#5 0012 2
#6 0234 3
#7 0235 3
#8 0236 3

If you want to use dplyr :

library(dplyr)
df %>% mutate(group = cumsum(c(TRUE, diff(as.numeric(line)) > 1)))

Split a vector by its sequences

split(x, cumsum(c(TRUE, diff(x)!=1)))
#$`1`
#[1] 7
#
#$`2`
#[1] 1 2 3 4
#
#$`3`
#[1] 6 7
#
#$`4`
#[1] 9

Create ID for specific sequence of consecutive days based on grouping variable in R

Try:

library(dplyr)

mydata %>%
group_by(country) %>%
distinct(seq.ID = cumsum(event_date != lag(event_date, default = first(event_date)) + 1L)

Output:

# A tibble: 5 x 2
# Groups: country [2]
seq.ID country
<int> <fct>
1 1 Angola
2 2 Angola
3 1 Benin
4 2 Benin
5 3 Benin

You can also use the .keep_all argument in distinct and preserve the first date of each sequence:

mydata %>%
group_by(country) %>%
distinct(seq.ID = cumsum(event_date != lag(event_date, default = first(event_date)) + 1L),
.keep_all = TRUE)

# A tibble: 5 x 3
# Groups: country [2]
country event_date seq.ID
<fct> <date> <int>
1 Angola 2017-06-16 1
2 Angola 2017-08-22 2
3 Benin 2019-04-18 1
4 Benin 2018-03-15 2
5 Benin 2016-03-17 3

In case of desired non-aggregated output with different sequence IDs, you could do:

mydata %>%
mutate(
seq.ID = cumsum(
(event_date != lag(event_date, default = first(event_date)) + 1L) |
country != lag(country, default = first(country))
)
)

country event_date seq.ID
1 Angola 2017-06-16 1
2 Angola 2017-06-17 1
3 Angola 2017-06-18 1
4 Angola 2017-08-22 2
5 Angola 2017-08-23 2
6 Benin 2019-04-18 3
7 Benin 2019-04-19 3
8 Benin 2019-04-20 3
9 Benin 2018-03-15 4
10 Benin 2018-03-16 4
11 Benin 2016-03-17 5

Note that there is a typo in your last event_date, this is why the outputs don't correspond 100% to your desired output.



Related Topics



Leave a reply



Submit