Using Mean with .Sd and .Sdcols in Data.Table

How to create means and s.d. columns with data.table (based on multiple conditions)

As far as I can tell you want the sum of con per id, for data where ds is between -365 and 0, then the mean, median and standard deviation of those sums. This can be done by chaining two expressions.

DT[ds >= -365 & ds < 0, sum(con), by = "id"][, .(mean(V1), median(V1), sd(V1))]

This first subsets your data, then sums con per id, then takes the mean etc. of the sums with a second expression

Use other columns with .SD in data.table

The issue is that list(cyl, lapply(.)) is returning something that is not really frame-like. If you look at that outside of the data.table environment, that looks like:

str(list(cyl = sum(t$cyl), lapply(t[,c("mpg","disp")], mean)))
# List of 2
# $ cyl: num 198
# $ :List of 2
# ..$ mpg : num 20.1
# ..$ disp: num 231

when a better return would look something like:

str(c(list(cyl = sum(t$cyl)), lapply(t[,c("mpg","disp")], mean)))
# List of 3
# $ cyl : num 198
# $ mpg : num 20.1
# $ disp: num 231

Instead, concatenate two lists:

t[, c(list(cyl = sum(cyl)), lapply(.SD, mean)), .SDcols = c('mpg', 'disp')]
# cyl mpg disp
# <num> <num> <num>
# 1: 198 20.09062 230.7219

or just concatenate a numeric sum(cyl) to the lapply list (thanks BrianMontgomery):

t[, c(cyl = sum(cyl), lapply(.SD, mean)), .SDcols = c('mpg', 'disp')]

Using .SDcols and .SD with two groups of columns in data.table

Edited 2021-06-14 to cover the env parameter available with development version 1.14.1 of data.table.

The release notes for data.table v1.11.0 (01 May 2018) state:

Thus, with= should no longer be needed in any cases. Please change to
using the .. prefix and over the next few years we will start to
formally deprecate and remove the with= parameter.

So, here are three other options:

Using the .. prefix

library(data.table) # development version 1.12.9 used
dt[, (colC) := .SD[, ..colA] * .SD[, ..colB]][]
   a1 a2 a3 b1 b2 b3 c1 c2  c3
1: 1 4 7 11 14 17 11 56 119
2: 2 5 8 12 15 18 24 75 144
3: 3 6 9 13 16 19 39 96 171

Although the result is as expected, there are warning messages

1: In [.data.table(.SD, , ..colA) : Both 'colA'
and '..colA' exist in calling scope. Please remove the '..colA'
variable in calling scope for clarity.

2: In [.data.table(.SD, ,
..colB) : Both 'colB' and '..colB' exist in calling scope. Please
remove the '..colB' variable in calling scope for clarity.

To get rid of the warnings, an intermediate result can be used

tmp <- dt[, ..colA] * dt[, ..colB]
dt[, (colC) := tmp][]

Using mget()

mget() works for me as well:

dt[, (colC) := .SD[, mget(colA)] * .SD[, mget(colB)]][]

EDIT: Using the new env argument

With development version 1.14.1 as of 2021-05-10, a new interface for programming on data.table has been added (see item 10 in NEWS, the new vignette programming on data.table, and jangorecki's comment on https://github.com/Rdatatable/data.table/pull/4304).

The recommend way is to use the new env argument:

dt[, (colC) := .SD[, x] * .SD[, y], env = I(list(x = colA, y = colB))][]
   a1 a2 a3 b1 b2 b3 c1 c2  c3
1: 1 4 7 11 14 17 11 56 119
2: 2 5 8 12 15 18 24 75 144
3: 3 6 9 13 16 19 39 96 171

Data

library(data.table)
colA <- paste0("a", 1:3)
colB <- paste0("b", 1:3)
colC <- paste0("c", 1:3)
dt <- data.table(1:3, 4:6, 7:9, 11:13, 14:16, 17:19)
setnames(dt, c(colA, colB))[]

.SD in data.table join to refer to arbitrary list of columns in i

Another approach would be to melt the data from wide to long and then join to each other.

molten_dt1 = melt(DT1, measure.vars = vals)[, variable := as.integer(substring(variable, 2))]
molten_dt2 = melt(DT2, measure.vars = weights)[, variable := as.integer(substring(variable, 2))]

molten_dt1[molten_dt2,
on = .(k2, variable)
][,
weighted.mean(value, i.value),
by = .(k1, k2)]

The reason that it is not straight forward is that anytime we need to do parallel column lookups (i.e., v1 * w1 and v2 * w2), complications always increase because we need to account for that relationship between columns. Melting the data allows us to simplify our approach because the data structure allows us to join and also we are using vectors in the weighted.mean as opposed to data.frames.

One other note, is that you may be able to simplify the original approach if you create a new weighted.mean() method for lists which allows us to skip the setDT requirement.

## slight changes made to stats:::weighted.mean.default
weighted.mean.list = function (x, w, ..., na.rm = FALSE)
{
x = unlist(x)
if (missing(w)) {
if (na.rm)
x <- x[!is.na(x)]
return(sum(x)/length(x))
}
w = unlist(w)
if (length(w) != length(x))
stop("'x' and 'w' must have the same length")
if (na.rm) {
i <- !is.na(x)
w <- w[i]
x <- x[i]
}
sum((x * w)[w != 0])/sum(w)
}

DT1[DT2, on=.(k2)][, .(wmean = weighted.mean(mget(vals), mget(weights))), by=.(k1,k2)]

rowmean and standard deviation using data.table

The issues lies in sd() which doesn't work row-wise.

x[,
c("meanY",'sdY',"nY") :=
.(rowMeans(.SD, na.rm = TRUE),
apply(.SD, 1, sd, na.rm = TRUE),
rowSums(!is.na(.SD))),
.SDcols = 2:10]

Apply Function Across Columns in data.table with do.call and .SD in R

Here is an option using apply(MARGIN=1, ...)

func <- function(x, threshold) {
if (any(x <= threshold)) return(max(x[x <= threshold]))
NA
}
test_dt_v1[, max_with_filter := apply(.SD, 1, func, threshold=1),
.SDcols=c("a","b","c")]

Another option using do.call and pmax by converting values above 1 to NA first (idea came from rowwise maximum for R)

test_dt_v1[, max_with_filter := do.call(pmax, c(`is.na<-`(.SD, .SD>1), na.rm=T))]

Doing j on multiple .SDcols groups in data.table

If there are multiple groups, one option would be to melt the dataset into 'long' format as melt can take multiple measure patterns

Convert the dataset columns to numeric class

nm1 <- names(DT)[-1]
DT[, (nm1) := lapply(.SD, as.numeric), .SDcols = nm1]

Then melt and dcast those columns

dM <- dcast(melt(DT, measure = patterns("V[2-3]", "V[4-5]"))[, 
lapply(.SD, function(x) x/sum(x)) , V1, .SDcols = value1:value2],
V1~rowid(V1), value.var = c('value1', 'value2'))[, -1, with = FALSE]

and assign the output back to the columns

DT[, (nm1) := dM][]    
# V1 V2 V3 V4 V5
#1: A 0.2857143 0.7142857 0.2500000 0.7500000
#2: B 0.5000000 0.5000000 0.5000000 0.5000000
#3: C 0.5555556 0.4444444 0.7500000 0.2500000
#4: D 0.6666667 0.3333333 0.4444444 0.5555556
#5: E 0.5000000 0.5000000 0.5555556 0.4444444


Related Topics



Leave a reply



Submit