R Dynamically Build "List" in Data.Table (Or Ddply)

R Dynamically build list in data.table (or ddply)

Another way is to use .SDcols to group the columns for which you'd like to perform the same operations together. Let's say that you require columns a,d,e to be summed by type where as, b,g should have mean taken and c,f its median, then,

# constructing an example data.table:
set.seed(45)
dt <- data.table(type=rep(c("hello","bye","ok"), each=3), a=sample(9),
b = rnorm(9), c=runif(9), d=sample(9), e=sample(9),
f = runif(9), g=rnorm(9))

# type a b c d e f g
# 1: hello 6 -2.5566166 0.7485015 9 6 0.5661358 -2.2066521
# 2: hello 3 1.1773119 0.6559926 3 3 0.4586280 -0.8376586
# 3: hello 2 -0.1015588 0.2164430 1 7 0.9299597 1.7216593
# 4: bye 8 -0.2260640 0.3924327 8 2 0.1271187 0.4360063
# 5: bye 7 -1.0720503 0.3256450 7 8 0.5774691 0.7571990
# 6: bye 5 -0.7131021 0.4855804 6 9 0.2687791 1.5398858
# 7: ok 1 -0.4680549 0.8476840 2 4 0.5633317 1.5393945
# 8: ok 4 0.4183264 0.4402595 4 1 0.7592801 2.1829996
# 9: ok 9 -1.4817436 0.5080116 5 5 0.2357030 -0.9953758

# 1) set key
setkey(dt, "type")

# 2) group col-ids by similar operations
id1 <- which(names(dt) %in% c("a", "d", "e"))
id2 <- which(names(dt) %in% c("b","g"))
id3 <- which(names(dt) %in% c("c","f"))

# 3) now use these ids in with .SDcols parameter
dt1 <- dt[, lapply(.SD, sum), by="type", .SDcols=id1]
dt2 <- dt[, lapply(.SD, mean), by="type", .SDcols=id2]
dt3 <- dt[, lapply(.SD, median), by="type", .SDcols=id3]

# 4) merge them.
dt1[dt2[dt3]]

# type a d e b g c f
# 1: bye 20 21 19 -0.6704055 0.9110304 0.3924327 0.2687791
# 2: hello 11 13 16 -0.4936211 -0.4408838 0.6559926 0.5661358
# 3: ok 14 11 10 -0.5104907 0.9090061 0.5080116 0.5633317

If/when you have many many column, making a list like the one you've might be cumbersome.

Converting ddply syntax into data.table

Firstly, your function is terribly inefficient and exposes a lack of understanding of what a function to be passed to plyr should look like. For ddply(), it should take a generic data frame as input and output a data frame. By 'generic' in this context, I mean a data frame that would be produced as any one of the 'splits' defined by combinations of the levels of the grouping variables. Your function should look more like this:

count.and.sum <- function(d) data.frame(n = length(d$val), valsum = sum(d$val))

The grouping variable combinations are taken care of in the ddply() call.

Secondly, your ddply() call creates one line data frames because each observation is associated with a unique combination of area, day and type. A more realistic application of ddply() for this toy example would be to summarize by day:

Direct method using summarise as the 'apply' function:

ddply(dat, .(day), summarise, nrow = length(val), valsum = sum(val))

Using count.and.sum:

ddply(dat, .(day), count.and.sum)

This is very likely to be much faster than your version of count.and.sum.

As for an equivalent data.table version (not necessarily the most efficient), try this:

library(data.table)
DT <- data.table(dat, key = c('area', 'day', 'type'))

DT[, list(n = length(val), valsum = sum(val)), by = 'day']

Here's a slightly more elaborate toy example with 100K observations:

set.seed(5490)
dat2 <- data.frame(area = sample(letters[1:2], 1e5, replace = TRUE),
day = sample(as.Date("2012-10-01") + c(0:10) * days(1),
1e5, replace = TRUE),
type = sample(paste0("t", 1:2), 1e5, replace = TRUE),
val = runif(1e5))

system.time(u <- ddply(dat2, .(area, day, type), summarise,
n = length(val), valsum = sum(val)))

DT2 <- data.table(dat2, key = c('area', 'day', 'type'))
system.time(v <- DT2[, list(n = length(val), valsum = sum(val)), by = key(DT)])

identical(u, as.data.frame(v))

On my system, the data.table version is about 4.5 times faster than the plyr version (0.09s elapsed for plyr, 0.02 for data.table).

data.table `:=` assignment expressions with dynamic inputs (existing columns) and outputs (new column names)

Maybe I don't understand the problem well, but does this suffice:

DT[, (col_out) := .SD[[col_in_one]]+.SD[[col_in_two]],
.SDcols = c(col_in_one,col_in_two)]
DT
# a b y bah
#1: 0 0 2 0
#2: 0 1 2 1
#3: 1 0 2 1
#4: 1 1 2 2

To answer the edited question, to get the eval to work, use .SD as environment:

DT[, (col_out) := eval(expr, .SD)]

Also, see this question and the update there - eval and quote in data.table

ddply 'drop=FALSE' equivalent in data.table

I think the ddply approach you mentioned may actually be your best option here. I came up with another solution, but I'd prefer ddply, to be honest..

Here we go:

f <- function(x, y) {list( mean(x * y), sum(x), sum(y))}

dt[, c("v1", "v2", "v3") := f(some_var1, some_var2), by = list(grp1, grp2)]
dt[, c("some_var1", "some_var2") := NULL]

Now, we would imitate the .drop=FALSE-functionality by doing a left join, using merge(). The trick here, is that we construct a left table using grid::expand.grid which contains all combinations of grp1 and grp2:

left_tab <- grid::expand.grid(grp1 = unique(dt$grp1), grp2 = unique(dt$grp2))
merge(left_tab, unique(dt), all.x = TRUE)
# grp1 grp2 val_1 val_2 val_3
# 1 a x 3 3 1
# 2 a y 6 6 4
# 3 a z NA NA NA
# 4 b x 6 3 2
# 5 b y 6 8 3
# 6 b z NA NA NA
# 7 c x 5 5 1
# 8 c y 10 5 2
# 9 c z 10 5 2

Do I have this equivalency for data.table and ddply right?

You were not saying whether you were getting error. The dummy.table.file$ looked both wrong and I thought ddply would probably make val available in the evaluation environment. This runs without error.:

 filt.test <- ddply(dummy.table, .(group1, group2),
mutate,
val=filtfilt(filt=low.pass,x=val))
#--------------
str(filt.test)
'data.frame': 30 obs. of 3 variables:
$ val : num 0.000239 0.000425 0.000416 0.000402 0.000225 ...
$ group1: int 1 1 1 1 1 1 1 1 1 1 ...
$ group2: int 1 2 3 4 5 6 7 8 9 10 ...

I was wondering if it might be better to use summarise instead of mutate, but it looks like I get the same result. I also suspect you are not getting data.table efficiency. That might be improved with the dplyr package, though.

create a formula in a data.table environment in R

lm can accept a character string as the formula so combine that with .SD like this:

> x[, as.list(coef(lm("a ~ b", .SD))), by = id]
id (Intercept) b
1: 1 21 -1
2: 2 21 -1
3: 3 21 -1
4: 4 21 -1
5: 5 21 -1

Better & faster way to sum & ifelse for a large set of columns in a big data frame using ddply R

We could possibly reduce the time by switching to dplyr. Also, instead of doing the sum and then using ifelse to check and reconvert, this can be directly done by checking any value greater than 0

library(dplyr)
dummies %>%
dplyr::select(id, where(is.numeric)) %>%
dplyr::group_by(id) %>%
dplyr::summarise(across(everything(), ~ +(any(. > 0, na.rm = TRUE))))

or using data.table

library(data.table)
setDT(dummies)[, lapply(.SD, function(x)
+(any(x > 0, na.rm = TRUE))), id, .SDcols = patterns('group')]

Speeding up ddply

You can do that with the shift function from the data.table package. An example for xend:

library(data.table) 
setDT(df)[, xend := shift(x, 1L, fill = x[.N], type = "lag"), by = group]

For all columns:

setDT(df)[, c("xend","yend","zend") := .(shift(x, 1L, fill = x[.N], type = "lag"),
shift(y, 1L, fill = y[.N], type = "lag"),
shift(z, 1L, fill = z[.N], type = "lag")),
by = group]

this gives you:

> head(df)
group x y z xend yend zend
1: 1 0.56725304 0.7539735 0.20542455 0.71538606 0.3864990 0.01586889
2: 1 0.64251519 0.1255183 0.93371528 0.56725304 0.7539735 0.20542455
3: 1 0.14182485 0.7351444 0.89199415 0.64251519 0.1255183 0.93371528
4: 1 0.06613097 0.7625182 0.92669617 0.14182485 0.7351444 0.89199415
5: 1 0.71538606 0.3864990 0.01586889 0.06613097 0.7625182 0.92669617
6: 4 0.27188921 0.5496977 0.09282217 0.27188921 0.5496977 0.09282217

Another approach as suggested by @akrun in the comments:

setDT(df)[, c("xend","yend","zend") := lapply(.SD, function(x) shift(x, fill = x[.N]))
, by = group]

Although this approach requires less typing and gives more flexibility with regard to including variables, it is also considerably slower.


In the question, you stated:

For the last value in the group, the ends are taken as the first point
in the group.

However, according to the desired behavior as you described, for the last value in the group the previous value in the group is used. I supposed you mean:

For the first value in the group, the ends are taken as the last point
in the group.


Used data:

set.seed(1)
n = 1e5
df = data.frame(group=sample(1:as.integer(n/2),n,replace=T),
x = runif(n),
y = runif(n),
z = runif(n))
df = df[with(df,order(group)),]

Sending in Column Name to ddply from Function

There has got to be a better way. And I couldn't figure out how to make it work with summarise.

my.fun <- function(df, count.column) { 
ddply(df, .(x), function(d) sum(d[[count.column]]))
}

dat <- data.frame(x=letters[1:2], y=1:10)

> my.fun(dat, 'y')
x V1
1 a 25
2 b 30
>


Related Topics



Leave a reply



Submit