R Ddply with Multiple Variables

R ddply with multiple variables

Okay, took me a little bit to figure out what you want, but here is a solution:

cols.to.sub <- paste0("mm", 1:3)
df1 <- ddply(
df, .(ID, variable),
function(x) {
x[cols.to.sub] <- t(t(as.matrix(x[cols.to.sub])) - unlist(x[x$phase == 3, cols.to.sub]))
x
} )

This produces (first 6 rows):

    ID phase variable mm1 mm2 mm3
1 101 1 A -2 -2 -2
2 101 2 A -1 -1 -1
3 101 3 A 0 0 0
4 101 1 B -2 -2 -2
5 101 2 B -1 -1 -1
6 101 3 B 0 0 0

Generally speaking the best way to debug this type of issue is to put a browser() statement inside the function you are passing to ddply, so you can examine the objects at your leisure. Doing so would have revealed that:

  1. The data frame passed to your function includes the ID columns, as well as the phase columns, so your mm columns are not the first three (hence the need to define cols.to.sub)
  2. Even if you address that, you can't operate on data frames that have unequal dimensions, so what I do here is convert to matrix, and then take advantage of vector recycling to subtract the one row from the rest of the matrix. I need to t (transpose) because vector recycling is column-wise.

ddply summarise on multiple variables

I don't know what plyr does internally, but data.table is only going to use the columns that are in the expression itself, effectively scanning the data only once (column by column):

library(data.table)
dt = data.table(df)

lapply(c('hw', 'app', 'srvc'), function(name) dt[, .N, by = name])

How to use a for loop to use ddply on multiple columns?

OP mentioned to use simple for-loop for this transformation on data. I understand that there are many other optimized way to solve this but in order to respect OP desired I tried using for-loop based solution. I have used dplyr as plyr is old now.

library(dplyr)
Subject <- c(rep(1, times = 6), rep(2, times = 6))
GroupOfInterest <- c(letters[rep(1:3, times = 4)])
Feature1 <- sample(1:20, 12, replace = T)
Feature2 <- sample(400:500, 12, replace = T)
Feature3 <- sample(1:5, 12, replace = T)
#small change in the way data.frame is created
df.main <- data.frame(Subject,GroupOfInterest, Feature1, Feature2,
Feature3, stringsAsFactors = FALSE)

Feat <- c(colnames(df.main[3:5]))

# Ready with Key columns on which grouping is done
resultdf <- unique(select(df.main, Subject, GroupOfInterest))
#> resultdf
# Subject GroupOfInterest
#1 1 a
#2 1 b
#3 1 c
#7 2 a
#8 2 b
#9 2 c

#For loop for each column
for(q in Feat){
summean <- paste0('mean(', q, ')')
summ_name <- paste0(q) #Name of the column to store sum
df_sum <- df.main %>%
group_by(Subject, GroupOfInterest) %>%
summarise_(.dots = setNames(summean, summ_name))
#merge the result of new sum column in resultdf
resultdf <- merge(resultdf, df_sum, by = c("Subject", "GroupOfInterest"))
}

# Final result
#> resultdf
# Subject GroupOfInterest Feature1 Feature2 Feature3
#1 1 a 6.5 473.0 3.5
#2 1 b 4.5 437.0 2.0
#3 1 c 12.0 415.5 3.5
#4 2 a 10.0 437.5 3.0
#5 2 b 3.0 447.0 4.5
#6 2 c 6.0 462.0 2.5

Using ddply to summarize multiple variables in R

Is this what you want ?

library(reshape2)
df$Count=1
df1=as.data.frame(acast(df, User.ID~Campaign.ID,value.var="Count"))
names(df1)=paste0('Click_',names(df1))
#Change the NA to 0
df1[is.na(df1)]=0

> df1
Click_10852036 Click_11181834 Click_11272183
AMsySZa8l0q0G9zNCsqGQ9-y5MYi 0 1 0
AMsySZaGnaf3z8Q7BzFkzxhLD76R 1 0 0
AMsySZb_uZeGo8NmzdWUBbEL7HEl 0 0 1
AMsySZY9u3XoNZ4qOfmK2JnaXbBg 1 0 0
AMsySZZE17Pzu6wwv_HkNhVDYSFJ 1 0 0
AMsySZZOF_CrRXtClA8dna1W-YVg 0 1 0

Pass multiple arguments to ddply

In order to pass arguments into the function for use by dplyr, I recommend reading this regarding non-standard evaluation (NSE). Here is an edited function producing the same output as my original.

library(dplyr)

random_df <- data.frame(
region = c('A','B','C','C'),
number_of_reports = c(1, 3, 2, 1),
report_MV = c(12, 33, 22, 12)
)

output_graph <- function(df, group, args) {

grp_quo <- enquo(group)

df %>%
group_by(!!grp_quo) %>%
summarise(!!!args)

}

args <- list(
Reports = quo(sum(number_of_reports)),
MV_Reports = quo(sum(report_MV))
)

output_graph(random_df, region, args)

# # A tibble: 3 x 3
# region Reports MV_Reports
# <fctr> <dbl> <dbl>
# 1 A 1.00 12.0
# 2 B 3.00 33.0
# 3 C 3.00 34.0

Using cast() or ddply() to summarise the mean for two continuous variables in one dataframe

It is not a ddply() or a cast() solution, but using tidyverse and reshape2 you can do:

df %>%
group_by(Date, Independent_Variable) %>%
summarise(Independent_Value = mean(Independent_Value)) %>%
mutate(Independent_Variable = paste(Independent_Variable, "IV", sep = "_")) %>%
dcast(Date~Independent_Variable, value.var = "Independent_Value") %>%
arrange(factor(Date, levels = month.name)) %>%
left_join(df %>%
group_by(Date, Independent_Variable) %>%
summarise(Sapflow = mean(Sapflow)) %>%
mutate(Independent_Variable = paste(Independent_Variable, "Sapflow", sep = "_")) %>%
dcast(Date~Independent_Variable, value.var = "Sapflow") %>%
arrange(factor(Date, levels = month.name)),
by = c("Date" = "Date"))

Date Humidity_IV Radiation_IV Temperature_IV Humidity_Sapflow
1 June 17.60733 263.6733 70.56133 16.067000
2 July 21.80065 270.9065 61.33065 23.356774
3 August 18.38968 178.9806 71.73355 22.941613
4 September 14.82200 152.2333 72.21367 19.309333
5 October 11.34867 93.6000 81.74300 6.700667
Radiation_Sapflow Temperature_Sapflow
1 16.067000 16.067000
2 23.356774 23.356774
3 22.941613 22.941613
4 19.309333 19.309333
5 6.700667 6.700667

First, it is grouping by "Date" and "Independent_Variable" and summarising "Independent_Value". Second, it is adding "_IV" to the values in Independent_Variable. Third, it is reshaping the data and arranging according the real order of months. Fourth, it is doing the first three steps for "Sapflow". Finally, it is merging the two.

Or by using just tidyverse:

df %>%
group_by(Date, Independent_Variable) %>% #Grouping
summarise_all(funs(mean = mean(.))) %>% #Summarising all variables and adding "_mean" to the new variables
arrange(factor(Date, levels = month.name)) #Arranging according the real order of months

Date Independent_Variable Independent_Value_mean Sapflow_mean
<fct> <fct> <dbl> <dbl>
1 June Humidity 17.6 16.1
2 June Radiation 264. 16.1
3 June Temperature 70.6 16.1
4 July Humidity 21.8 23.4
5 July Radiation 271. 23.4
6 July Temperature 61.3 23.4

How to summarize over multiple columns programatically using ddply?

You can consider dplyr package - generally it's much faster than plyr and also has pretty syntax.

library(dplyr)

x <- c(2,4,3,1,5,7)
y <- c(3,2,6,3,4,6)
group1 <- c("A","A","A","A","B","B")
group2 <- c("X","X","Y","Y","Z","X")

aggFunction <- function(dataframe, toAverage, toGroup) {
dataframe %>%
group_by_(.dots = toGroup) %>%
summarise_(.dots = setNames(sprintf("mean(%s)", toAverage), toAverage))
}

data <- data.frame(group1, group2, x, y)
aggFunction(data, c("x", "y"), c("group1", "group2"))

It gives:

  group1 group2 x   y
1 A X 3 2.5
2 A Y 2 4.5
3 B X 7 6.0
4 B Z 5 4.0

ddply multiple function arguments + naming

There's probably an easier way to do this, but you could combine your use of plyr with reshape2:

require(plyr)
require(reshape2)

d2 <- ddply(d, c("code", "station"), function(df) {
df[which.min(df$date.time),]
})

d3 <- dcast(d2, code ~ station, value.var = "date.time")

d3

code L5 L7
1 10888 1368005216 1368011698
2 10891 1367943040 1367959536

dcast converts POSIXct classes to integer, so you'll have to convert them back:

d3[,grepl("^L", colnames(d3))] <- lapply(d3[,grepl("^L", colnames(d3))], as.POSIXct,  
origin="1970-10-01")

d3
code L5 L7
1 10888 2004-02-06 04:26:56 2004-02-06 06:14:58
2 10891 2004-02-05 11:10:40 2004-02-05 15:45:36

EDIT

I just thought of an easier way that doesn't require reshape2:

  as.POSIXct(tapply(df$date.time, df$station, min), origin="1970-10-01")
+ })

code L5 L7
1 10888 2014-02-05 04:26:56 2014-02-05 06:14:58
2 10891 2014-02-04 11:10:40 2014-02-04 15:45:36

All of this assumes that you really want your output to list each station's values in different columns. If you're ok with station identifiers being a separate column by themselves, djhurio's response is simplest.



Related Topics



Leave a reply



Submit