Beginner Tips on Using Plyr to Calculate Year-Over-Year Change Across Groups

Beginner tips on using plyr to calculate year-over-year change across groups

I know you asked for a "plyr"-specific solution, but for the sake of sharing, here is an alternative approach in base R. In my opinion, I find the base R approach just as "readable". And, at least in this particular case, it's a lot faster!

output <- within(df1, {
yoy <- ave(ab, team, lg, FUN = function(x) c(NA, diff(x)))
})
head(output)
# year lg team ab yoy
# 1 1884 UA ALT 108 NA
# 2 1997 AL ANA 1703 NA
# 3 1998 AL ANA 1502 -201
# 4 1999 AL ANA 660 -842
# 5 2000 AL ANA 85 -575
# 6 2001 AL ANA 219 134

library(rbenchmark)

benchmark(DDPLY = {
ddply(df1, .(team, lg), mutate ,
yoy = c(NA, diff(ab)))
}, WITHIN = {
within(df1, {
yoy <- ave(ab, team, lg, FUN = function(x) c(NA, diff(x)))
})
}, columns = c("test", "replications", "elapsed",
"relative", "user.self"))
# test replications elapsed relative user.self
# 1 DDPLY 100 10.675 4.974 10.609
# 2 WITHIN 100 2.146 1.000 2.128

Update: data.table

If your data are very large, check out data.table. Even with this example, you'll find a good speedup in relative terms. Plus the syntax is super compact and, in my opinion, easily readable.

library(plyr)
df1 <- aggregate(ab~year+lg+team, FUN=sum, data=baseball)
library(data.table)
DT <- data.table(df1)
DT
# year lg team ab
# 1: 1884 UA ALT 108
# 2: 1997 AL ANA 1703
# 3: 1998 AL ANA 1502
# 4: 1999 AL ANA 660
# 5: 2000 AL ANA 85
# ---
# 2523: 1895 NL WSN 839
# 2524: 1896 NL WSN 982
# 2525: 1897 NL WSN 1426
# 2526: 1898 NL WSN 1736
# 2527: 1899 NL WSN 787

Now, look at this concise solution:

DT[, yoy := c(NA, diff(ab)), by = "team,lg"]
DT
# year lg team ab yoy
# 1: 1884 UA ALT 108 NA
# 2: 1997 AL ANA 1703 NA
# 3: 1998 AL ANA 1502 -201
# 4: 1999 AL ANA 660 -842
# 5: 2000 AL ANA 85 -575
# ---
# 2523: 1895 NL WSN 839 290
# 2524: 1896 NL WSN 982 143
# 2525: 1897 NL WSN 1426 444
# 2526: 1898 NL WSN 1736 310
# 2527: 1899 NL WSN 787 -949

Quarterly Year over Year Growth Rate

Here's a very simple solution:

YearOverYear<-function (x,periodsPerYear){
if(NROW(x)<=periodsPerYear){
stop("too few rows")
}
else{
indexes<-1:(NROW(x)-periodsPerYear)
return(c(rep(NA,periodsPerYear),(x[indexes+periodsPerYear]-x[indexes])/x[indexes]))
}
}

> cbind(df,YoY=YearOverYear(df$value,4))
date value YoY
1 2000-01-01 1592 NA
2 2000-04-01 1825 NA
3 2000-07-01 1769 NA
4 2000-10-01 1909 NA
5 2001-01-01 2022 0.27010050
6 2001-04-01 2287 0.25315068
7 2001-07-01 2169 0.22611645
8 2001-10-01 2366 0.23939235
9 2002-01-01 2001 -0.01038576
10 2002-04-01 2087 -0.08745081
11 2002-07-01 2099 -0.03227294
12 2002-10-01 2258 -0.04564666

Calculate percent change from a baseline year (t0) to a subsequent BUT LIMITED series of years (t1, ..., tk)

To answer your expanded question, use transform combined with ddply from the plyr package:

ddply(df, .(case), transform, change = ((100 / value[1]) * value) - 100)

In regard to your comment on the NA and Inf values, this is expected behavior as you are dividing by zero, making the change meaningless. You could delete those entries.

Multiple density graphs different groups (based on factor level) using plyr

I see that @Andrie just beat me to most of this. I'm still going to post my answer, since filling only certain quantiles of the distribution requires a slightly different approach.

set.seed(1234)
Aa = c(rnorm(40000, 50, 10))
Bb = c(rnorm(4000, 70, 10))
Cc = c(rnorm(400, 75, 10))
Dd = c(rnorm(40, 80, 10))
yvar = c(Aa, Bb, Cc, Dd)
gen <- c(rep("Aa", length(Aa)),rep("Bb", length(Bb)), rep("Cc", length(Cc)),
rep("Dd", length(Dd)))
mydf <- data.frame(grp = gen,x = c(Aa,Bb,Cc,Dd))

#Calculate the densities and an indicator for the desire quantile
# for later use in subsetting
mydf <- ddply(mydf,.(grp),.fun = function(x){
tmp <- density(x$x)
x1 <- tmp$x
y1 <- tmp$y
q80 <- x1 >= quantile(x$x,0.8)
data.frame(x=x1,y=y1,q80=q80)
})

#Separate data frame for the means
mydfMean <- ddply(mydf,.(grp),summarise,mn = mean(x))

ggplot(mydf,aes(x = x)) +
facet_wrap(~grp) +
geom_line(aes(y = y)) +
geom_ribbon(data = subset(mydf,q80),aes(ymax = y),ymin = 0, fill = "black") +
geom_vline(data = mydfMean,aes(xintercept = mn),colour = "black")

Sample Image

Using to PLYR to count with Which Condition

This isn't exactly what you're looking for but here are two pieces of advice:

  1. plyr is an older version of dplyr so I would use the newer one, especially because it come in the tidyverse group. dplyr's count can deal with factors.
  2. Factors aren't commonly used in R anymore. I would suggest just coercing with as.character

With dplyr you could write something like:

data %>% filter(numeric > 10) %>% count(factor)

Adding a base year index to R dataframe with multiple groups

We can create the 'VAL.IND' after doing the calculation within the grouping variable ('GRP'). This can be done in many ways.

One option is data.table where we create 'data.table' from 'data.frame' (setDT(df)), Grouped by 'GRP', we divide the 'VAL' by the 'VAL' that corresponds to 'YEAR' value of 2000.

 library(data.table)
setDT(df)[, VAL.IND := VAL/VAL[YEAR==2000], by = GRP]

NOTE: The base YEAR is a bit confusing wrt to the result. In the example, both the 'A' and 'B' GRP have 'YEAR' 2000. Suppose, if the OP meant to use the minimum YEAR value (considering that it is numeric column), VAL/VAL[YEAR==2000] in the above code can be replaced with VAL/VAL[which.min(YEAR)].


Or you can use a similar code with dplyr. We group by 'GRP' and use mutate to create the 'VAL.IND'

 library(dplyr)
df %>%
group_by(GRP) %>%
mutate(VAL.IND = VAL/VAL[YEAR==2000])

Here also, if we needed replace VAL/VAL[YEAR==2000] with VAL/VAL[which.min(YEAR)]


A base R option with split/unsplit. We split the dataset by the 'GRP' column to convert the data.frame to a list of dataframes, loop through the list output with lapply, create a new column using transform (or within) and convert the list with the added column back to a single data.frame by unsplit.

  unsplit(lapply(split(df, df$GRP), function(x) 
transform(x, VAL.IND= VAL/VAL[YEAR==2000])), df$GRP)

Note that we can also use do.call(rbind instead of unsplit. But, I prefer unsplit to get the same row order as the original dataset.

How to calculate percentage change from different rows over different spans

You can declare your data as ts() and use cbind() and diff()

data <- read.table(header=T,text='gvkey  PRCCQ
1004 23.750
1004 13.875
1004 11.250
1004 10.375
1004 13.600
1004 14.000
1004 17.060
1005 8.150
1005 7.400
1005 11.440
1005 6.200
1005 5.500
1005 4.450
1005 4.500
1005 8.010')

data <- split(data,list(data$gvkey))
(newdata <- do.call(rbind,lapply(data,function(data) { data <- ts(data) ; cbind(data,Quarter=diff(data[,2]),Two.Quarter=diff(data[,2],2))})))

data.gvkey data.PRCCQ Quarter Two.Quarter
[1,] 1004 23.750 NA NA
[2,] 1004 13.875 -9.875 NA
[3,] 1004 11.250 -2.625 -12.500
[4,] 1004 10.375 -0.875 -3.500
[5,] 1004 13.600 3.225 2.350
[6,] 1004 14.000 0.400 3.625
[7,] 1004 17.060 3.060 3.460
[8,] 1005 8.150 NA NA
[9,] 1005 7.400 -0.750 NA
[10,] 1005 11.440 4.040 3.290
[11,] 1005 6.200 -5.240 -1.200
[12,] 1005 5.500 -0.700 -5.940
[13,] 1005 4.450 -1.050 -1.750
[14,] 1005 4.500 0.050 -1.000
[15,] 1005 8.010 3.510 3.560

EDIT:

Another way, without split() and lapply() (probably faster)

data <- read.table(header=T,text='gvkey  PRCCQ
1004 23.750
1004 13.875
1004 11.250
1004 10.375
1004 13.600
1004 14.000
1004 17.060
1005 8.150
1005 7.400
1005 11.440
1005 6.200
1005 5.500
1005 4.450
1005 4.500
1005 8.010')
newdata <- do.call(rbind,by(data, data$gvkey,function(data) { data <- ts(data) ; cbind(data,Quarter=diff(data[,2]),Two.Quarter=diff(data[,2],2))}))

ddply transformation (percentage change) in R

Also, it would be easier if you use lag:

df.summary %>% group_by(Brand) %>% 
mutate(pChange = (EUR - lag(EUR))/lag(EUR) * 100)

# Source: local data frame [10 x 5]
#Groups: Brand [5]
#
# Brand Year EUR pos pChange
# <fctr> <fctr> <dbl> <dbl> <dbl>
#1 Brand1 2015 637896.7 318948.3 NA
#2 Brand1 2016 721944.2 998868.8 13.17573
#3 Brand2 2015 708697.6 354348.8 NA
#4 Brand2 2016 300541.1 858968.2 -57.59248
#5 Brand3 2015 454890.1 227445.1 NA
#6 Brand3 2016 576095.6 742937.9 26.64500
#7 Brand4 2015 305712.0 152856.0 NA
#8 Brand4 2016 174073.3 392748.6 -43.05970
#9 Brand5 2015 589970.7 294985.3 NA
#10 Brand5 2016 518510.2 849225.8 -12.11254

As suggested by @r2evans, if the Year is not arranged beforehand,

df.summary %>% group_by(Brand) %>% arrange(Year) %>%
mutate(pChange = (EUR - lag(EUR))/lag(EUR) * 100)

How to find difference between values in two rows in an R dataframe using dplyr

In dplyr:

require(dplyr)
df %>%
group_by(farm) %>%
mutate(volume = cumVol - lag(cumVol, default = cumVol[1]))

Source: local data frame [8 x 5]
Groups: farm

period farm cumVol other volume
1 1 A 1 1 0
2 2 A 5 2 4
3 3 A 15 3 10
4 4 A 31 4 16
5 1 B 10 5 0
6 2 B 12 6 2
7 3 B 16 7 4
8 4 B 24 8 8

Perhaps the desired output should actually be as follows?

df %>%
group_by(farm) %>%
mutate(volume = cumVol - lag(cumVol, default = 0))

period farm cumVol other volume
1 1 A 1 1 1
2 2 A 5 2 4
3 3 A 15 3 10
4 4 A 31 4 16
5 1 B 10 5 10
6 2 B 12 6 2
7 3 B 16 7 4
8 4 B 24 8 8

Edit: Following up on your comments I think you are looking for arrange(). It that is not the case it might be best to start a new question.

df1 <- data.frame(period=rep(1:4,4), farm=rep(c(rep('A',4),rep('B',4)),2), crop=(c(rep('apple',8), rep('pear',8))), cumCropVol=c(1,5,15,31,10,12,16,24,11,15,25,31,20,22,26,34), other = rep(1:8,2) ); 
df1 %>%
arrange(desc(period), desc(farm)) %>%
group_by(period, farm) %>%
summarise(cumVol=sum(cumCropVol))

Edit: Follow up #2

df1 <- data.frame(period=rep(1:4,4), farm=rep(c(rep('A',4),rep('B',4)),2), crop=(c(rep('apple',8), rep('pear',8))), cumCropVol=c(1,5,15,31,10,12,16,24,11,15,25,31,20,22,26,34), other = rep(1:8,2) ); 
df <- df1 %>%
arrange(desc(period), desc(farm)) %>%
group_by(period, farm) %>%
summarise(cumVol=sum(cumCropVol))

ungroup(df) %>%
arrange(farm) %>%
group_by(farm) %>%
mutate(volume = cumVol - lag(cumVol, default = 0))

Source: local data frame [8 x 4]
Groups: farm

period farm cumVol volume
1 1 A 12 12
2 2 A 20 8
3 3 A 40 20
4 4 A 62 22
5 1 B 30 30
6 2 B 34 4
7 3 B 42 8
8 4 B 58 16


Related Topics



Leave a reply



Submit