R Dynamically build list in data.table (or ddply)
Another way is to use .SDcols
to group the columns for which you'd like to perform the same operations together. Let's say that you require columns a,d,e
to be summed by type
where as, b,g
should have mean
taken and c,f
its median, then,
# constructing an example data.table:
set.seed(45)
dt <- data.table(type=rep(c("hello","bye","ok"), each=3), a=sample(9),
b = rnorm(9), c=runif(9), d=sample(9), e=sample(9),
f = runif(9), g=rnorm(9))
# type a b c d e f g
# 1: hello 6 -2.5566166 0.7485015 9 6 0.5661358 -2.2066521
# 2: hello 3 1.1773119 0.6559926 3 3 0.4586280 -0.8376586
# 3: hello 2 -0.1015588 0.2164430 1 7 0.9299597 1.7216593
# 4: bye 8 -0.2260640 0.3924327 8 2 0.1271187 0.4360063
# 5: bye 7 -1.0720503 0.3256450 7 8 0.5774691 0.7571990
# 6: bye 5 -0.7131021 0.4855804 6 9 0.2687791 1.5398858
# 7: ok 1 -0.4680549 0.8476840 2 4 0.5633317 1.5393945
# 8: ok 4 0.4183264 0.4402595 4 1 0.7592801 2.1829996
# 9: ok 9 -1.4817436 0.5080116 5 5 0.2357030 -0.9953758
# 1) set key
setkey(dt, "type")
# 2) group col-ids by similar operations
id1 <- which(names(dt) %in% c("a", "d", "e"))
id2 <- which(names(dt) %in% c("b","g"))
id3 <- which(names(dt) %in% c("c","f"))
# 3) now use these ids in with .SDcols parameter
dt1 <- dt[, lapply(.SD, sum), by="type", .SDcols=id1]
dt2 <- dt[, lapply(.SD, mean), by="type", .SDcols=id2]
dt3 <- dt[, lapply(.SD, median), by="type", .SDcols=id3]
# 4) merge them.
dt1[dt2[dt3]]
# type a d e b g c f
# 1: bye 20 21 19 -0.6704055 0.9110304 0.3924327 0.2687791
# 2: hello 11 13 16 -0.4936211 -0.4408838 0.6559926 0.5661358
# 3: ok 14 11 10 -0.5104907 0.9090061 0.5080116 0.5633317
If/when you have many many column, making a list like the one you've might be cumbersome.
Converting ddply syntax into data.table
Firstly, your function is terribly inefficient and exposes a lack of understanding of what a function to be passed to plyr should look like. For ddply(), it should take a generic data frame as input and output a data frame. By 'generic' in this context, I mean a data frame that would be produced as any one of the 'splits' defined by combinations of the levels of the grouping variables. Your function should look more like this:
count.and.sum <- function(d) data.frame(n = length(d$val), valsum = sum(d$val))
The grouping variable combinations are taken care of in the ddply() call.
Secondly, your ddply()
call creates one line data frames because each observation is associated with a unique combination of area, day and type. A more realistic application of ddply()
for this toy example would be to summarize by day:
Direct method using summarise
as the 'apply' function:
ddply(dat, .(day), summarise, nrow = length(val), valsum = sum(val))
Using count.and.sum
:
ddply(dat, .(day), count.and.sum)
This is very likely to be much faster than your version of count.and.sum
.
As for an equivalent data.table version (not necessarily the most efficient), try this:
library(data.table)
DT <- data.table(dat, key = c('area', 'day', 'type'))
DT[, list(n = length(val), valsum = sum(val)), by = 'day']
Here's a slightly more elaborate toy example with 100K observations:
set.seed(5490)
dat2 <- data.frame(area = sample(letters[1:2], 1e5, replace = TRUE),
day = sample(as.Date("2012-10-01") + c(0:10) * days(1),
1e5, replace = TRUE),
type = sample(paste0("t", 1:2), 1e5, replace = TRUE),
val = runif(1e5))
system.time(u <- ddply(dat2, .(area, day, type), summarise,
n = length(val), valsum = sum(val)))
DT2 <- data.table(dat2, key = c('area', 'day', 'type'))
system.time(v <- DT2[, list(n = length(val), valsum = sum(val)), by = key(DT)])
identical(u, as.data.frame(v))
On my system, the data.table
version is about 4.5 times faster than the plyr
version (0.09s elapsed for plyr, 0.02 for data.table).
data.table `:=` assignment expressions with dynamic inputs (existing columns) and outputs (new column names)
Maybe I don't understand the problem well, but does this suffice:
DT[, (col_out) := .SD[[col_in_one]]+.SD[[col_in_two]],
.SDcols = c(col_in_one,col_in_two)]
DT
# a b y bah
#1: 0 0 2 0
#2: 0 1 2 1
#3: 1 0 2 1
#4: 1 1 2 2
To answer the edited question, to get the eval
to work, use .SD
as environment:
DT[, (col_out) := eval(expr, .SD)]
Also, see this question and the update there - eval and quote in data.table
ddply 'drop=FALSE' equivalent in data.table
I think the ddply
approach you mentioned may actually be your best option here. I came up with another solution, but I'd prefer ddply
, to be honest..
Here we go:
f <- function(x, y) {list( mean(x * y), sum(x), sum(y))}
dt[, c("v1", "v2", "v3") := f(some_var1, some_var2), by = list(grp1, grp2)]
dt[, c("some_var1", "some_var2") := NULL]
Now, we would imitate the .drop=FALSE
-functionality by doing a left join, using merge()
. The trick here, is that we construct a left table using grid::expand.grid
which contains all combinations of grp1
and grp2
:
left_tab <- grid::expand.grid(grp1 = unique(dt$grp1), grp2 = unique(dt$grp2))
merge(left_tab, unique(dt), all.x = TRUE)
# grp1 grp2 val_1 val_2 val_3
# 1 a x 3 3 1
# 2 a y 6 6 4
# 3 a z NA NA NA
# 4 b x 6 3 2
# 5 b y 6 8 3
# 6 b z NA NA NA
# 7 c x 5 5 1
# 8 c y 10 5 2
# 9 c z 10 5 2
Do I have this equivalency for data.table and ddply right?
You were not saying whether you were getting error. The dummy.table.file$
looked both wrong and I thought ddply would probably make val
available in the evaluation environment. This runs without error.:
filt.test <- ddply(dummy.table, .(group1, group2),
mutate,
val=filtfilt(filt=low.pass,x=val))
#--------------
str(filt.test)
'data.frame': 30 obs. of 3 variables:
$ val : num 0.000239 0.000425 0.000416 0.000402 0.000225 ...
$ group1: int 1 1 1 1 1 1 1 1 1 1 ...
$ group2: int 1 2 3 4 5 6 7 8 9 10 ...
I was wondering if it might be better to use summarise
instead of mutate
, but it looks like I get the same result. I also suspect you are not getting data.table efficiency. That might be improved with the dplyr
package, though.
create a formula in a data.table environment in R
lm
can accept a character string as the formula so combine that with .SD
like this:
> x[, as.list(coef(lm("a ~ b", .SD))), by = id]
id (Intercept) b
1: 1 21 -1
2: 2 21 -1
3: 3 21 -1
4: 4 21 -1
5: 5 21 -1
Better & faster way to sum & ifelse for a large set of columns in a big data frame using ddply R
We could possibly reduce the time by switching to dplyr
. Also, instead of doing the sum
and then using ifelse
to check and reconvert, this can be directly done by checking any
value greater than 0
library(dplyr)
dummies %>%
dplyr::select(id, where(is.numeric)) %>%
dplyr::group_by(id) %>%
dplyr::summarise(across(everything(), ~ +(any(. > 0, na.rm = TRUE))))
or using data.table
library(data.table)
setDT(dummies)[, lapply(.SD, function(x)
+(any(x > 0, na.rm = TRUE))), id, .SDcols = patterns('group')]
Speeding up ddply
You can do that with the shift
function from the data.table
package. An example for xend
:
library(data.table)
setDT(df)[, xend := shift(x, 1L, fill = x[.N], type = "lag"), by = group]
For all columns:
setDT(df)[, c("xend","yend","zend") := .(shift(x, 1L, fill = x[.N], type = "lag"),
shift(y, 1L, fill = y[.N], type = "lag"),
shift(z, 1L, fill = z[.N], type = "lag")),
by = group]
this gives you:
> head(df)
group x y z xend yend zend
1: 1 0.56725304 0.7539735 0.20542455 0.71538606 0.3864990 0.01586889
2: 1 0.64251519 0.1255183 0.93371528 0.56725304 0.7539735 0.20542455
3: 1 0.14182485 0.7351444 0.89199415 0.64251519 0.1255183 0.93371528
4: 1 0.06613097 0.7625182 0.92669617 0.14182485 0.7351444 0.89199415
5: 1 0.71538606 0.3864990 0.01586889 0.06613097 0.7625182 0.92669617
6: 4 0.27188921 0.5496977 0.09282217 0.27188921 0.5496977 0.09282217
Another approach as suggested by @akrun in the comments:
setDT(df)[, c("xend","yend","zend") := lapply(.SD, function(x) shift(x, fill = x[.N]))
, by = group]
Although this approach requires less typing and gives more flexibility with regard to including variables, it is also considerably slower.
In the question, you stated:
For the last value in the group, the ends are taken as the first point
in the group.
However, according to the desired behavior as you described, for the last value in the group the previous value in the group is used. I supposed you mean:
For the first value in the group, the ends are taken as the last point
in the group.
Used data:
set.seed(1)
n = 1e5
df = data.frame(group=sample(1:as.integer(n/2),n,replace=T),
x = runif(n),
y = runif(n),
z = runif(n))
df = df[with(df,order(group)),]
Sending in Column Name to ddply from Function
There has got to be a better way. And I couldn't figure out how to make it work with summarise.
my.fun <- function(df, count.column) {
ddply(df, .(x), function(d) sum(d[[count.column]]))
}
dat <- data.frame(x=letters[1:2], y=1:10)
> my.fun(dat, 'y')
x V1
1 a 25
2 b 30
>
Related Topics
Displaying Image on Point Hover in Plotly
Ggplot2': Label Values of Barplot That Uses 'Fun.Y="Mean"' of 'Stat_Summary'
How to Ensure That a Partition Has Representative Observations from Each Level of a Factor
R Crashing While Displaying Ggplot After Update (Process Memory Read Out of Range)
Filter Groups in Dplyr That Exclusively Contain Specific Combinations of Values
R: Faceted Bar Chart with Percentages Labels Independent for Each Plot
Subset a Data.Frame with Multiple Conditions
Setting Column Width in R Shiny Datatable Does Not Work in Case of Lots of Column
Customize Background to Highlight Ranges of Data in Ggplot
Ordered Factors in Ggplot2 Bar Chart
Replace a Subset of a Data Frame with Dplyr Join Operations
R, Sweave, Latex - Escape Variables to Be Printed in Latex
Package 'Pbkrtest' Is Not Available (For R Version 3.2.2)
Applying Rolling Mean by Group in R
Flatten Nested List into 1-Deep List
How to Merge Two Data Frames in R by a Common Column with Mismatched Date/Time Values
Getting the Error "Level Sets of Factors Are Different" When Running a for Loop