﻿ Create Grouping Variable For Consecutive Sequences and Split Vector - ITCodar

# Create Grouping Variable For Consecutive Sequences and Split Vector

## Create a list of vectors of consecutive values from a vector

An option is to `split` by creating a grouping variable created by checking the `diff`erence of adjacent elements

``split(vec, cumsum(c(TRUE, diff(vec) != 1)))#\$`1`#[1] 1 2#\$`2`#[1] 5#\$`3`#[1] 7 8 9#\$`4`#[1] 11 12 13#\$`5`#[1] 15``

## Finding the number of consecutive days in data

You could do:

``df %>%   group_by(cumsum(c(0, diff(day) - 1))) %>%  summarise(sequences = paste(first(day), last(day), sep = ' - '),            length    = n()) %>%  filter(length > 1) %>%  select(sequences, length)#> # A tibble: 2 x 2#>   sequences               length#>   <chr>                    <int>#> 1 2022-01-03 - 2022-01-05      3#> 2 2022-01-10 - 2022-01-13      4``

## Group data frame row by consecutive value in R

We could use `diff` on the adjacent values of 'time', check if the difference is not equal to 1, then change the logical vector to numeric index by taking the cumulative sum (`cumsum`) so that there is an increment of 1 at each TRUE value

``library(dplyr)df1 %>%    mutate(grp = cumsum(c(TRUE, diff(time) != 1)))``

-output

``# A tibble: 12 x 2    time   grp   <dbl> <int> 1     1     1 2     2     1 3     3     1 4     4     1 5     5     1 6    10     2 7    11     2 8    20     3 9    30     410    31     411    32     412    40     5``

## Group rows based on consecutive line numbers

Convert the numbers to numeric, calculate difference between consecutive numbers and increment the group count when the difference is greater than 1.

``transform(df, group = cumsum(c(TRUE, diff(as.numeric(line)) > 1)))#  line group#1 0001     1#2 0002     1#3 0003     1#4 0011     2#5 0012     2#6 0234     3#7 0235     3#8 0236     3``

If you want to use `dplyr` :

``library(dplyr)df %>% mutate(group = cumsum(c(TRUE, diff(as.numeric(line)) > 1)))``

## Split a vector by its sequences

``split(x, cumsum(c(TRUE, diff(x)!=1)))#\$`1`#[1] 7##\$`2`#[1] 1 2 3 4##\$`3`#[1] 6 7##\$`4`#[1] 9``

## Create ID for specific sequence of consecutive days based on grouping variable in R

Try:

``library(dplyr)mydata %>%  group_by(country) %>%  distinct(seq.ID = cumsum(event_date != lag(event_date, default = first(event_date)) + 1L)``

Output:

``# A tibble: 5 x 2# Groups:   country [2]  seq.ID country   <int> <fct>  1      1 Angola 2      2 Angola 3      1 Benin  4      2 Benin  5      3 Benin ``

You can also use the `.keep_all` argument in `distinct` and preserve the first date of each sequence:

``mydata %>%  group_by(country) %>%  distinct(seq.ID = cumsum(event_date != lag(event_date, default = first(event_date)) + 1L),           .keep_all = TRUE)# A tibble: 5 x 3# Groups:   country [2]  country event_date seq.ID  <fct>   <date>      <int>1 Angola  2017-06-16      12 Angola  2017-08-22      23 Benin   2019-04-18      14 Benin   2018-03-15      25 Benin   2016-03-17      3``

In case of desired non-aggregated output with different sequence IDs, you could do:

``mydata %>%  mutate(    seq.ID = cumsum(      (event_date != lag(event_date, default = first(event_date)) + 1L) |        country != lag(country, default = first(country))    )  )   country event_date seq.ID1   Angola 2017-06-16      12   Angola 2017-06-17      13   Angola 2017-06-18      14   Angola 2017-08-22      25   Angola 2017-08-23      26    Benin 2019-04-18      37    Benin 2019-04-19      38    Benin 2019-04-20      39    Benin 2018-03-15      410   Benin 2018-03-16      411   Benin 2016-03-17      5``

Note that there is a typo in your last `event_date`, this is why the outputs don't correspond 100% to your desired output.