Dplyr: Lead() and Lag() Wrong When Used with Group_By()

dplyr: lead() and lag() wrong when used with group_by()

It seems you have to pass additional argument to lag and lead functions. When I run your function without arrange, but with order_by added, everything seems to be ok.

df %>%
group_by(name) %>%
mutate(next.score = lead(score, order_by=name),
before.score = lag(score, order_by=name))

Output:

  name score next.score before.score
1   Al   100         60           NA
2  Jen    80        100           NA
3   Al    60         80          100
4  Jen   100         60           80
5   Al    80         NA           60
6  Jen    60         NA          100

My sessionInfo():

R version 3.1.1 (2014-07-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=Polish_Poland.1250  LC_CTYPE=Polish_Poland.1250        LC_MONETARY=Polish_Poland.1250
[4] LC_NUMERIC=C                   LC_TIME=Polish_Poland.1250    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_0.4.1

loaded via a namespace (and not attached):
[1] assertthat_0.1  DBI_0.3.1       lazyeval_0.1.10 magrittr_1.5                parallel_3.1.1  Rcpp_0.11.5    
[7] tools_3.1.1

Dplyr Lags on Summarised Grouped Data

You want to first summarise to get the sum and the mean, then you can use a mutate statement to get the lag of each column, then rearrange the columns.

library(tidyverse)

test.df2 <- test.df1 %>%
  group_by(dateidx) %>%
  summarise(sumA = sum(A),
            meanB = mean(B)) %>%
  mutate(sumAlag = lag(sumA),
         meanBlag = lag(meanB)) %>% 
  select(dateidx, starts_with("sum"), starts_with("mean"))

Output

  dateidx     sumA sumAlag meanB meanBlag
  <date>     <dbl>   <dbl> <dbl>    <dbl>
1 2019-01-02   300      NA 1425       NA 
2 2019-01-03   100     300 2000     1425 
3 2019-01-07   700     100  342.    2000 
4 2019-01-10   535     700  200      342.

Is this a bug in group_by and lead/lag?

The first error seems to be version specific, but the second one we can remove by selecting the first observation of 'count' or last one.

df %>%
   group_by(hour) %>%
   mutate(diff = count - lag(count, default = first(count)))

Computing lags but grouping by two categories with dplyr

Do you mean group_by(ID) and effectively "order by YEAR"?

MyData  %>%
  group_by(ID) %>%
  mutate(var3 = var1 - dplyr::lag(var2)) %>%
  print(n=99)
# # A tibble: 25 x 5
# # Groups:   ID [5]
#     YEAR    ID  var1    var2  var3
#    <int> <int> <dbl>   <dbl> <dbl>
#  1  2010     1 11.1   1.16   NA   
#  2  2011     1 13.5  -0.550  12.4 
#  3  2012     1 10.2   2.11   10.7 
#  4  2013     1  8.57  1.43    6.46
#  5  2014     1 12.6   1.89   11.2 
#  6  2010     2  8.87  1.87   NA   
#  7  2011     2  5.30  1.70    3.43
#  8  2012     2  6.81  0.956   5.11
#  9  2013     2 13.3  -0.0296 12.4 
# 10  2014     2  9.98 -1.27   10.0 
# 11  2010     3  8.62  0.258  NA   
# 12  2011     3 12.4   2.00   12.2 
# 13  2012     3 16.1   2.12   14.1 
# 14  2013     3  8.48  2.83    6.37
# 15  2014     3 10.6   0.190   7.80
# 16  2010     4 12.3   0.887  NA   
# 17  2011     4 10.9   1.07   10.0 
# 18  2012     4  7.99  1.09    6.92
# 19  2013     4 10.1   1.95    9.03
# 20  2014     4 11.1   1.82    9.17
# 21  2010     5 15.1   1.67   NA   
# 22  2011     5 10.4   0.492   8.76
# 23  2012     5 10.0   1.66    9.51
# 24  2013     5 10.6   0.567   8.91
# 25  2014     5  5.32 -0.881   4.76

(Disregarding your summarize into a mutate for now.)

Lag / lead by group in R and dplyr

The lag by default offsets with n=1. However, we have duplicate elements for 'Team', and 'Date'. Inorder to get the expected output, we need to get the distinct rows of 'Team', 'Date', create a 'Date_lagged' with the lag of 'Date' and right_join (or left_join) with the original dataset.

distinct(df, Team, Date) %>%
        mutate(Date_Lagged = lag(Date)) %>%
        right_join(., df) %>%
        select(Team, Date, Points, Date_Lagged)
#   Team       Date Points Date_Lagged
#1     A 2016-05-10      1        <NA>
#2     A 2016-05-10      4        <NA>
#3     A 2016-05-10      3        <NA>
#4     A 2016-05-10      2        <NA>
#5     B 2016-05-12      1  2016-05-10
#6     B 2016-05-12      5  2016-05-10
#7     B 2016-05-12      6  2016-05-10
#8     C 2016-05-15      1  2016-05-12
#9     C 2016-05-15      2  2016-05-12
#10    D 2016-05-30      3  2016-05-15
#11    D 2016-05-30      9  2016-05-15

Or we can also do

df %>% 
    mutate(Date_Lagged = rep(lag(unique(Date)), table(Date)))

Lagging variable by group does not work in dplyr

Here is an idea. We group by monthvec in order to get the number of rows (cnt) of each group. We ungroup and use the first value of cnt as the size of the lag. We regroup on monthvec and replace the values in each group with the first value of each group.

library(dplyr)

df %>% 
 group_by(monthvec) %>% 
 mutate(cnt = n()) %>% 
 ungroup() %>% 
 mutate(lag.growth = lag(growth, first(cnt))) %>% 
 group_by(monthvec) %>% 
 mutate(lag.growth = first(lag.growth)) %>% 
 select(-cnt)

which gives,

# A tibble: 13 x 3
# Groups:   monthvec [5]
   monthvec growth lag.growth
      <int>  <dbl>      <dbl>
 1        1    0.3         NA
 2        1    0.3         NA
 3        2    0.5        0.3
 4        2    0.5        0.3
 5        3    0.7        0.5
 6        3    0.7        0.5
 7        3    0.7        0.5
 8        4    0.1        0.7
 9        4    0.1        0.7
10        4    0.1        0.7
11        5    0.6        0.1
12        5    0.6        0.1
13        5    0.6        0.1

dplyr how to lag by group

This looks like a bug; There might be some unintended mask of the lag function between dplyr and stats package, try this work around:

df %>% 
    group_by(team) %>% 
    # explicitly specify the source of the lag function here
    mutate(receive = dplyr::lag(order, n=unique(lead_time), default=0))

#Source: local data frame [10 x 4]
#Groups: team [2]

#     team order lead_time receive
#   <fctr> <dbl>     <dbl>   <dbl>
#1       a     2         3       0
#2       a     4         3       0
#3       a     3         3       0
#4       a     5         3       2
#5       a     6         3       4
#6       b     7         2       0
#7       b     8         2       0
#8       b     5         2       7
#9       b     4         2       8
#10      b     5         2       5

Error in dplyr when using nested group_by

If you want one observation for every unique value of dem_sect, then:

test <- df %>%
    group_by(sector) %>%
    mutate(
        md_area = mean(area),
        md_peso = mean(peso_kg),
        se_loc = sqrt(var(peso_kg))/sqrt(length(peso_kg)),
        cv_loc = sd(peso_kg)/mean(peso_kg)*100
    ) %>%
    ungroup() %>%
    group_by(dem_sect) %>%
    summarize(sum_sect = sum(SEloc * CVloc)))

I don't think SEloc and CVloc exist, should be se_loc and cv_loc, right?

R lead lag function summarize within group and calculate percent

Is this the output you're looking for?

df %>%
  arrange(tier_1, -sequence_number) %>%
  group_by(tier_1) %>%   # already grouped this way, only including for clarity
  mutate(cuml = cumsum(lag(count_of_sequence_numbers, default = 0)),
         diff = count_of_sequence_numbers - cuml) %>%
  ungroup()


## A tibble: 35 x 6
#   tier_1         sequence_number count_of_sequence_numbers  percent  cuml  diff
#   <chr>                    <int>                     <int>    <dbl> <dbl> <dbl>
# 1 Organic Search              10                         1 0.000542     0     1
# 2 Organic Search               9                         2 0.00108      1     1
# 3 Organic Search               8                         6 0.00325      3     3
# 4 Organic Search               7                         8 0.00434      9    -1
# 5 Organic Search               6                         5 0.00271     17   -12
# 6 Organic Search               5                        21 0.0114      22    -1
# 7 Organic Search               4                        41 0.0222      43    -2
# 8 Organic Search               3                       119 0.0645      84    35
# 9 Organic Search               2                       460 0.249      203   257
#10 Organic Search               1                      1176 0.638      663   513
## … with 25 more rows

Is there anyway to use lag and lead functions together from dplyr

Instead of using lead and lag you can use rolling operations which can be adapted easily if your window size increases/decreases.

library(dplyr)
library(zoo)

df %>%
  mutate(result1 = lag(rollapplyr(BP, 4, function(x) 
                       paste0(rev(x), collapse = ''), fill = NA)), 
         result2 = rollapply(BP, 2, align = 'left', function(x) 
                       paste0(rev(x), collapse = ''), fill = NA))

#   ID BP result1 result2
#1 Id1  A    <NA>      AA
#2 Id2  A    <NA>      TA
#3 Id3  T    <NA>      CT
#4 Id4  C    <NA>      AC
#5 Id5  A    CTAA      TA
#6 Id6  T    ACTA      AT
#7 Id7  A    TACT      TA
#8 Id8  T    ATAC    <NA>

Suggestion by @G. Grothendieck avoids the above hacky way with rev and lag.

df %>% 
  mutate(result11 = rollapply(BP,list(-(1:4)), paste, collapse = '', fill = NA), 
         result2 = rollapply(BP, list(1:2), paste, collapse = '', fill = NA))

Dplyr: Lead() and Lag() Wrong When Used with Group_By()