dplyr: lead() and lag() wrong when used with group_by()
It seems you have to pass additional argument to lag and lead functions. When I run your function without arrange, but with order_by added, everything seems to be ok.
df %>%
group_by(name) %>%
mutate(next.score = lead(score, order_by=name),
before.score = lag(score, order_by=name))
Output:
name score next.score before.score
1 Al 100 60 NA
2 Jen 80 100 NA
3 Al 60 80 100
4 Jen 100 60 80
5 Al 80 NA 60
6 Jen 60 NA 100
My sessionInfo():
R version 3.1.1 (2014-07-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=Polish_Poland.1250 LC_CTYPE=Polish_Poland.1250 LC_MONETARY=Polish_Poland.1250
[4] LC_NUMERIC=C LC_TIME=Polish_Poland.1250
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] dplyr_0.4.1
loaded via a namespace (and not attached):
[1] assertthat_0.1 DBI_0.3.1 lazyeval_0.1.10 magrittr_1.5 parallel_3.1.1 Rcpp_0.11.5
[7] tools_3.1.1
Dplyr Lags on Summarised Grouped Data
You want to first summarise
to get the sum and the mean, then you can use a mutate
statement to get the lag of each column, then rearrange the columns.
library(tidyverse)
test.df2 <- test.df1 %>%
group_by(dateidx) %>%
summarise(sumA = sum(A),
meanB = mean(B)) %>%
mutate(sumAlag = lag(sumA),
meanBlag = lag(meanB)) %>%
select(dateidx, starts_with("sum"), starts_with("mean"))
Output
dateidx sumA sumAlag meanB meanBlag
<date> <dbl> <dbl> <dbl> <dbl>
1 2019-01-02 300 NA 1425 NA
2 2019-01-03 100 300 2000 1425
3 2019-01-07 700 100 342. 2000
4 2019-01-10 535 700 200 342.
Is this a bug in group_by and lead/lag?
The first error seems to be version specific, but the second one we can remove by selecting the first
observation of 'count' or last
one.
df %>%
group_by(hour) %>%
mutate(diff = count - lag(count, default = first(count)))
Computing lags but grouping by two categories with dplyr
Do you mean group_by(ID)
and effectively "order by YEAR"?
MyData %>%
group_by(ID) %>%
mutate(var3 = var1 - dplyr::lag(var2)) %>%
print(n=99)
# # A tibble: 25 x 5
# # Groups: ID [5]
# YEAR ID var1 var2 var3
# <int> <int> <dbl> <dbl> <dbl>
# 1 2010 1 11.1 1.16 NA
# 2 2011 1 13.5 -0.550 12.4
# 3 2012 1 10.2 2.11 10.7
# 4 2013 1 8.57 1.43 6.46
# 5 2014 1 12.6 1.89 11.2
# 6 2010 2 8.87 1.87 NA
# 7 2011 2 5.30 1.70 3.43
# 8 2012 2 6.81 0.956 5.11
# 9 2013 2 13.3 -0.0296 12.4
# 10 2014 2 9.98 -1.27 10.0
# 11 2010 3 8.62 0.258 NA
# 12 2011 3 12.4 2.00 12.2
# 13 2012 3 16.1 2.12 14.1
# 14 2013 3 8.48 2.83 6.37
# 15 2014 3 10.6 0.190 7.80
# 16 2010 4 12.3 0.887 NA
# 17 2011 4 10.9 1.07 10.0
# 18 2012 4 7.99 1.09 6.92
# 19 2013 4 10.1 1.95 9.03
# 20 2014 4 11.1 1.82 9.17
# 21 2010 5 15.1 1.67 NA
# 22 2011 5 10.4 0.492 8.76
# 23 2012 5 10.0 1.66 9.51
# 24 2013 5 10.6 0.567 8.91
# 25 2014 5 5.32 -0.881 4.76
(Disregarding your summarize
into a mutate
for now.)
Lag / lead by group in R and dplyr
The lag
by default offsets with n=1
. However, we have duplicate elements for 'Team', and 'Date'. Inorder to get the expected output, we need to get the distinct
rows of 'Team', 'Date', create a 'Date_lagged' with the lag
of 'Date' and right_join
(or left_join
) with the original dataset.
distinct(df, Team, Date) %>%
mutate(Date_Lagged = lag(Date)) %>%
right_join(., df) %>%
select(Team, Date, Points, Date_Lagged)
# Team Date Points Date_Lagged
#1 A 2016-05-10 1 <NA>
#2 A 2016-05-10 4 <NA>
#3 A 2016-05-10 3 <NA>
#4 A 2016-05-10 2 <NA>
#5 B 2016-05-12 1 2016-05-10
#6 B 2016-05-12 5 2016-05-10
#7 B 2016-05-12 6 2016-05-10
#8 C 2016-05-15 1 2016-05-12
#9 C 2016-05-15 2 2016-05-12
#10 D 2016-05-30 3 2016-05-15
#11 D 2016-05-30 9 2016-05-15
Or we can also do
df %>%
mutate(Date_Lagged = rep(lag(unique(Date)), table(Date)))
Lagging variable by group does not work in dplyr
Here is an idea. We group by monthvec
in order to get the number of rows (cnt
) of each group. We ungroup and use the first value of cnt
as the size of the lag. We regroup on monthvec
and replace the values in each group with the first value of each group.
library(dplyr)
df %>%
group_by(monthvec) %>%
mutate(cnt = n()) %>%
ungroup() %>%
mutate(lag.growth = lag(growth, first(cnt))) %>%
group_by(monthvec) %>%
mutate(lag.growth = first(lag.growth)) %>%
select(-cnt)
which gives,
# A tibble: 13 x 3
# Groups: monthvec [5]
monthvec growth lag.growth
<int> <dbl> <dbl>
1 1 0.3 NA
2 1 0.3 NA
3 2 0.5 0.3
4 2 0.5 0.3
5 3 0.7 0.5
6 3 0.7 0.5
7 3 0.7 0.5
8 4 0.1 0.7
9 4 0.1 0.7
10 4 0.1 0.7
11 5 0.6 0.1
12 5 0.6 0.1
13 5 0.6 0.1
dplyr how to lag by group
This looks like a bug; There might be some unintended mask of the lag
function between dplyr
and stats
package, try this work around:
df %>%
group_by(team) %>%
# explicitly specify the source of the lag function here
mutate(receive = dplyr::lag(order, n=unique(lead_time), default=0))
#Source: local data frame [10 x 4]
#Groups: team [2]
# team order lead_time receive
# <fctr> <dbl> <dbl> <dbl>
#1 a 2 3 0
#2 a 4 3 0
#3 a 3 3 0
#4 a 5 3 2
#5 a 6 3 4
#6 b 7 2 0
#7 b 8 2 0
#8 b 5 2 7
#9 b 4 2 8
#10 b 5 2 5
Error in dplyr when using nested group_by
If you want one observation for every unique value of dem_sect
, then:
test <- df %>%
group_by(sector) %>%
mutate(
md_area = mean(area),
md_peso = mean(peso_kg),
se_loc = sqrt(var(peso_kg))/sqrt(length(peso_kg)),
cv_loc = sd(peso_kg)/mean(peso_kg)*100
) %>%
ungroup() %>%
group_by(dem_sect) %>%
summarize(sum_sect = sum(SEloc * CVloc)))
I don't think SEloc
and CVloc
exist, should be se_loc
and cv_loc
, right?
R lead lag function summarize within group and calculate percent
Is this the output you're looking for?
df %>%
arrange(tier_1, -sequence_number) %>%
group_by(tier_1) %>% # already grouped this way, only including for clarity
mutate(cuml = cumsum(lag(count_of_sequence_numbers, default = 0)),
diff = count_of_sequence_numbers - cuml) %>%
ungroup()
## A tibble: 35 x 6
# tier_1 sequence_number count_of_sequence_numbers percent cuml diff
# <chr> <int> <int> <dbl> <dbl> <dbl>
# 1 Organic Search 10 1 0.000542 0 1
# 2 Organic Search 9 2 0.00108 1 1
# 3 Organic Search 8 6 0.00325 3 3
# 4 Organic Search 7 8 0.00434 9 -1
# 5 Organic Search 6 5 0.00271 17 -12
# 6 Organic Search 5 21 0.0114 22 -1
# 7 Organic Search 4 41 0.0222 43 -2
# 8 Organic Search 3 119 0.0645 84 35
# 9 Organic Search 2 460 0.249 203 257
#10 Organic Search 1 1176 0.638 663 513
## … with 25 more rows
Is there anyway to use lag and lead functions together from dplyr
Instead of using lead
and lag
you can use rolling operations which can be adapted easily if your window size increases/decreases.
library(dplyr)
library(zoo)
df %>%
mutate(result1 = lag(rollapplyr(BP, 4, function(x)
paste0(rev(x), collapse = ''), fill = NA)),
result2 = rollapply(BP, 2, align = 'left', function(x)
paste0(rev(x), collapse = ''), fill = NA))
# ID BP result1 result2
#1 Id1 A <NA> AA
#2 Id2 A <NA> TA
#3 Id3 T <NA> CT
#4 Id4 C <NA> AC
#5 Id5 A CTAA TA
#6 Id6 T ACTA AT
#7 Id7 A TACT TA
#8 Id8 T ATAC <NA>
Suggestion by @G. Grothendieck avoids the above hacky way with rev
and lag
.
df %>%
mutate(result11 = rollapply(BP,list(-(1:4)), paste, collapse = '', fill = NA),
result2 = rollapply(BP, list(1:2), paste, collapse = '', fill = NA))
Related Topics
How to Delete Columns That Contain Only Nas
How to Pass Parameters to a Shiny App via Url
Displaying a Greater Than or Equal Sign
R Error "Sum Not Meaningful for Factors"
Python's Xrange Alternative for R or How to Loop Over Large Dataset Lazilly
Create Empty Data Frame with Column Names by Assigning a String Vector
Write List of Data.Frames to Separate CSV Files with Lapply
No Rtools Compatible with R Version 3.5.0 Was Found
Switch Displayed Traces via Plotly Dropdown Menu
Solution. How to Install_Github When There Is a Proxy
Emulate Split() with Dplyr Group_By: Return a List of Data Frames
How to Change the First Row to Be the Header in R
Ggplot2: Changing the Order of Stacks on a Bar Graph
Differencebetween Gc() and Rm()