Dplyr: Lead() and Lag() Wrong When Used with Group_By()

dplyr: lead() and lag() wrong when used with group_by()

It seems you have to pass additional argument to lag and lead functions. When I run your function without arrange, but with order_by added, everything seems to be ok.

df %>%
group_by(name) %>%
mutate(next.score = lead(score, order_by=name),
before.score = lag(score, order_by=name))

Output:

  name score next.score before.score
1 Al 100 60 NA
2 Jen 80 100 NA
3 Al 60 80 100
4 Jen 100 60 80
5 Al 80 NA 60
6 Jen 60 NA 100

My sessionInfo():

R version 3.1.1 (2014-07-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=Polish_Poland.1250 LC_CTYPE=Polish_Poland.1250 LC_MONETARY=Polish_Poland.1250
[4] LC_NUMERIC=C LC_TIME=Polish_Poland.1250

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] dplyr_0.4.1

loaded via a namespace (and not attached):
[1] assertthat_0.1 DBI_0.3.1 lazyeval_0.1.10 magrittr_1.5 parallel_3.1.1 Rcpp_0.11.5
[7] tools_3.1.1

Dplyr Lags on Summarised Grouped Data

You want to first summarise to get the sum and the mean, then you can use a mutate statement to get the lag of each column, then rearrange the columns.

library(tidyverse)

test.df2 <- test.df1 %>%
group_by(dateidx) %>%
summarise(sumA = sum(A),
meanB = mean(B)) %>%
mutate(sumAlag = lag(sumA),
meanBlag = lag(meanB)) %>%
select(dateidx, starts_with("sum"), starts_with("mean"))

Output

  dateidx     sumA sumAlag meanB meanBlag
<date> <dbl> <dbl> <dbl> <dbl>
1 2019-01-02 300 NA 1425 NA
2 2019-01-03 100 300 2000 1425
3 2019-01-07 700 100 342. 2000
4 2019-01-10 535 700 200 342.

Is this a bug in group_by and lead/lag?

The first error seems to be version specific, but the second one we can remove by selecting the first observation of 'count' or last one.

df %>%
group_by(hour) %>%
mutate(diff = count - lag(count, default = first(count)))

Computing lags but grouping by two categories with dplyr

Do you mean group_by(ID) and effectively "order by YEAR"?

MyData  %>%
group_by(ID) %>%
mutate(var3 = var1 - dplyr::lag(var2)) %>%
print(n=99)
# # A tibble: 25 x 5
# # Groups: ID [5]
# YEAR ID var1 var2 var3
# <int> <int> <dbl> <dbl> <dbl>
# 1 2010 1 11.1 1.16 NA
# 2 2011 1 13.5 -0.550 12.4
# 3 2012 1 10.2 2.11 10.7
# 4 2013 1 8.57 1.43 6.46
# 5 2014 1 12.6 1.89 11.2
# 6 2010 2 8.87 1.87 NA
# 7 2011 2 5.30 1.70 3.43
# 8 2012 2 6.81 0.956 5.11
# 9 2013 2 13.3 -0.0296 12.4
# 10 2014 2 9.98 -1.27 10.0
# 11 2010 3 8.62 0.258 NA
# 12 2011 3 12.4 2.00 12.2
# 13 2012 3 16.1 2.12 14.1
# 14 2013 3 8.48 2.83 6.37
# 15 2014 3 10.6 0.190 7.80
# 16 2010 4 12.3 0.887 NA
# 17 2011 4 10.9 1.07 10.0
# 18 2012 4 7.99 1.09 6.92
# 19 2013 4 10.1 1.95 9.03
# 20 2014 4 11.1 1.82 9.17
# 21 2010 5 15.1 1.67 NA
# 22 2011 5 10.4 0.492 8.76
# 23 2012 5 10.0 1.66 9.51
# 24 2013 5 10.6 0.567 8.91
# 25 2014 5 5.32 -0.881 4.76

(Disregarding your summarize into a mutate for now.)

Lag / lead by group in R and dplyr

The lag by default offsets with n=1. However, we have duplicate elements for 'Team', and 'Date'. Inorder to get the expected output, we need to get the distinct rows of 'Team', 'Date', create a 'Date_lagged' with the lag of 'Date' and right_join (or left_join) with the original dataset.

distinct(df, Team, Date) %>%
mutate(Date_Lagged = lag(Date)) %>%
right_join(., df) %>%
select(Team, Date, Points, Date_Lagged)
# Team Date Points Date_Lagged
#1 A 2016-05-10 1 <NA>
#2 A 2016-05-10 4 <NA>
#3 A 2016-05-10 3 <NA>
#4 A 2016-05-10 2 <NA>
#5 B 2016-05-12 1 2016-05-10
#6 B 2016-05-12 5 2016-05-10
#7 B 2016-05-12 6 2016-05-10
#8 C 2016-05-15 1 2016-05-12
#9 C 2016-05-15 2 2016-05-12
#10 D 2016-05-30 3 2016-05-15
#11 D 2016-05-30 9 2016-05-15

Or we can also do

df %>% 
mutate(Date_Lagged = rep(lag(unique(Date)), table(Date)))

Lagging variable by group does not work in dplyr

Here is an idea. We group by monthvec in order to get the number of rows (cnt) of each group. We ungroup and use the first value of cnt as the size of the lag. We regroup on monthvec and replace the values in each group with the first value of each group.

library(dplyr)

df %>%
group_by(monthvec) %>%
mutate(cnt = n()) %>%
ungroup() %>%
mutate(lag.growth = lag(growth, first(cnt))) %>%
group_by(monthvec) %>%
mutate(lag.growth = first(lag.growth)) %>%
select(-cnt)

which gives,

# A tibble: 13 x 3
# Groups: monthvec [5]
monthvec growth lag.growth
<int> <dbl> <dbl>
1 1 0.3 NA
2 1 0.3 NA
3 2 0.5 0.3
4 2 0.5 0.3
5 3 0.7 0.5
6 3 0.7 0.5
7 3 0.7 0.5
8 4 0.1 0.7
9 4 0.1 0.7
10 4 0.1 0.7
11 5 0.6 0.1
12 5 0.6 0.1
13 5 0.6 0.1

dplyr how to lag by group

This looks like a bug; There might be some unintended mask of the lag function between dplyr and stats package, try this work around:

df %>% 
group_by(team) %>%
# explicitly specify the source of the lag function here
mutate(receive = dplyr::lag(order, n=unique(lead_time), default=0))

#Source: local data frame [10 x 4]
#Groups: team [2]

# team order lead_time receive
# <fctr> <dbl> <dbl> <dbl>
#1 a 2 3 0
#2 a 4 3 0
#3 a 3 3 0
#4 a 5 3 2
#5 a 6 3 4
#6 b 7 2 0
#7 b 8 2 0
#8 b 5 2 7
#9 b 4 2 8
#10 b 5 2 5

Error in dplyr when using nested group_by

If you want one observation for every unique value of dem_sect, then:

test <- df %>%
group_by(sector) %>%
mutate(
md_area = mean(area),
md_peso = mean(peso_kg),
se_loc = sqrt(var(peso_kg))/sqrt(length(peso_kg)),
cv_loc = sd(peso_kg)/mean(peso_kg)*100
) %>%
ungroup() %>%
group_by(dem_sect) %>%
summarize(sum_sect = sum(SEloc * CVloc)))

I don't think SEloc and CVloc exist, should be se_loc and cv_loc, right?

R lead lag function summarize within group and calculate percent

Is this the output you're looking for?

df %>%
arrange(tier_1, -sequence_number) %>%
group_by(tier_1) %>% # already grouped this way, only including for clarity
mutate(cuml = cumsum(lag(count_of_sequence_numbers, default = 0)),
diff = count_of_sequence_numbers - cuml) %>%
ungroup()


## A tibble: 35 x 6
# tier_1 sequence_number count_of_sequence_numbers percent cuml diff
# <chr> <int> <int> <dbl> <dbl> <dbl>
# 1 Organic Search 10 1 0.000542 0 1
# 2 Organic Search 9 2 0.00108 1 1
# 3 Organic Search 8 6 0.00325 3 3
# 4 Organic Search 7 8 0.00434 9 -1
# 5 Organic Search 6 5 0.00271 17 -12
# 6 Organic Search 5 21 0.0114 22 -1
# 7 Organic Search 4 41 0.0222 43 -2
# 8 Organic Search 3 119 0.0645 84 35
# 9 Organic Search 2 460 0.249 203 257
#10 Organic Search 1 1176 0.638 663 513
## … with 25 more rows

Is there anyway to use lag and lead functions together from dplyr

Instead of using lead and lag you can use rolling operations which can be adapted easily if your window size increases/decreases.

library(dplyr)
library(zoo)

df %>%
mutate(result1 = lag(rollapplyr(BP, 4, function(x)
paste0(rev(x), collapse = ''), fill = NA)),
result2 = rollapply(BP, 2, align = 'left', function(x)
paste0(rev(x), collapse = ''), fill = NA))

# ID BP result1 result2
#1 Id1 A <NA> AA
#2 Id2 A <NA> TA
#3 Id3 T <NA> CT
#4 Id4 C <NA> AC
#5 Id5 A CTAA TA
#6 Id6 T ACTA AT
#7 Id7 A TACT TA
#8 Id8 T ATAC <NA>

Suggestion by @G. Grothendieck avoids the above hacky way with rev and lag.

df %>% 
mutate(result11 = rollapply(BP,list(-(1:4)), paste, collapse = '', fill = NA),
result2 = rollapply(BP, list(1:2), paste, collapse = '', fill = NA))


Related Topics



Leave a reply



Submit