How to create means and s.d. columns with data.table (based on multiple conditions)
As far as I can tell you want the sum of con
per id
, for data where ds
is between -365 and 0, then the mean, median and standard deviation of those sums. This can be done by chaining two expressions.
DT[ds >= -365 & ds < 0, sum(con), by = "id"][, .(mean(V1), median(V1), sd(V1))]
This first subsets your data, then sums con
per id
, then takes the mean etc. of the sums with a second expression
Use other columns with .SD in data.table
The issue is that list(cyl, lapply(.))
is returning something that is not really frame-like. If you look at that outside of the data.table
environment, that looks like:
str(list(cyl = sum(t$cyl), lapply(t[,c("mpg","disp")], mean)))
# List of 2
# $ cyl: num 198
# $ :List of 2
# ..$ mpg : num 20.1
# ..$ disp: num 231
when a better return would look something like:
str(c(list(cyl = sum(t$cyl)), lapply(t[,c("mpg","disp")], mean)))
# List of 3
# $ cyl : num 198
# $ mpg : num 20.1
# $ disp: num 231
Instead, c
oncatenate two lists:
t[, c(list(cyl = sum(cyl)), lapply(.SD, mean)), .SDcols = c('mpg', 'disp')]
# cyl mpg disp
# <num> <num> <num>
# 1: 198 20.09062 230.7219
or just concatenate a numeric sum(cyl)
to the lapply
list (thanks BrianMontgomery):
t[, c(cyl = sum(cyl), lapply(.SD, mean)), .SDcols = c('mpg', 'disp')]
Using .SDcols and .SD with two groups of columns in data.table
Edited 2021-06-14 to cover the env
parameter available with development version 1.14.1 of data.table
.
The release notes for data.table v1.11.0 (01 May 2018) state:
Thus,
with=
should no longer be needed in any cases. Please change to
using the..
prefix and over the next few years we will start to
formally deprecate and remove thewith=
parameter.
So, here are three other options:
Using the ..
prefix
library(data.table) # development version 1.12.9 used
dt[, (colC) := .SD[, ..colA] * .SD[, ..colB]][]
a1 a2 a3 b1 b2 b3 c1 c2 c3
1: 1 4 7 11 14 17 11 56 119
2: 2 5 8 12 15 18 24 75 144
3: 3 6 9 13 16 19 39 96 171
Although the result is as expected, there are warning messages
1: In [.data.table(.SD, , ..colA) : Both 'colA'
and '..colA' exist in calling scope. Please remove the '..colA'
variable in calling scope for clarity.
2: In [.data.table(.SD, ,
..colB) : Both 'colB' and '..colB' exist in calling scope. Please
remove the '..colB' variable in calling scope for clarity.
To get rid of the warnings, an intermediate result can be used
tmp <- dt[, ..colA] * dt[, ..colB]
dt[, (colC) := tmp][]
Using mget()
mget()
works for me as well:
dt[, (colC) := .SD[, mget(colA)] * .SD[, mget(colB)]][]
EDIT: Using the new env
argument
With development version 1.14.1 as of 2021-05-10, a new interface for programming on data.table has been added (see item 10 in NEWS, the new vignette programming on data.table, and jangorecki's comment on https://github.com/Rdatatable/data.table/pull/4304).
The recommend way is to use the new env
argument:
dt[, (colC) := .SD[, x] * .SD[, y], env = I(list(x = colA, y = colB))][]
a1 a2 a3 b1 b2 b3 c1 c2 c3
1: 1 4 7 11 14 17 11 56 119
2: 2 5 8 12 15 18 24 75 144
3: 3 6 9 13 16 19 39 96 171
Data
library(data.table)
colA <- paste0("a", 1:3)
colB <- paste0("b", 1:3)
colC <- paste0("c", 1:3)
dt <- data.table(1:3, 4:6, 7:9, 11:13, 14:16, 17:19)
setnames(dt, c(colA, colB))[]
.SD in data.table join to refer to arbitrary list of columns in i
Another approach would be to melt the data from wide to long and then join to each other.
molten_dt1 = melt(DT1, measure.vars = vals)[, variable := as.integer(substring(variable, 2))]
molten_dt2 = melt(DT2, measure.vars = weights)[, variable := as.integer(substring(variable, 2))]
molten_dt1[molten_dt2,
on = .(k2, variable)
][,
weighted.mean(value, i.value),
by = .(k1, k2)]
The reason that it is not straight forward is that anytime we need to do parallel column lookups (i.e., v1 * w1
and v2 * w2
), complications always increase because we need to account for that relationship between columns. Melting the data allows us to simplify our approach because the data structure allows us to join and also we are using vectors in the weighted.mean
as opposed to data.frames.
One other note, is that you may be able to simplify the original approach if you create a new weighted.mean()
method for lists which allows us to skip the setDT
requirement.
## slight changes made to stats:::weighted.mean.default
weighted.mean.list = function (x, w, ..., na.rm = FALSE)
{
x = unlist(x)
if (missing(w)) {
if (na.rm)
x <- x[!is.na(x)]
return(sum(x)/length(x))
}
w = unlist(w)
if (length(w) != length(x))
stop("'x' and 'w' must have the same length")
if (na.rm) {
i <- !is.na(x)
w <- w[i]
x <- x[i]
}
sum((x * w)[w != 0])/sum(w)
}
DT1[DT2, on=.(k2)][, .(wmean = weighted.mean(mget(vals), mget(weights))), by=.(k1,k2)]
rowmean and standard deviation using data.table
The issues lies in sd()
which doesn't work row-wise.
x[,
c("meanY",'sdY',"nY") :=
.(rowMeans(.SD, na.rm = TRUE),
apply(.SD, 1, sd, na.rm = TRUE),
rowSums(!is.na(.SD))),
.SDcols = 2:10]
Apply Function Across Columns in data.table with do.call and .SD in R
Here is an option using apply(MARGIN=1, ...)
func <- function(x, threshold) {
if (any(x <= threshold)) return(max(x[x <= threshold]))
NA
}
test_dt_v1[, max_with_filter := apply(.SD, 1, func, threshold=1),
.SDcols=c("a","b","c")]
Another option using do.call
and pmax
by converting values above 1 to NA first (idea came from rowwise maximum for R)
test_dt_v1[, max_with_filter := do.call(pmax, c(`is.na<-`(.SD, .SD>1), na.rm=T))]
Doing j on multiple .SDcols groups in data.table
If there are multiple groups, one option would be to melt
the dataset into 'long' format as melt
can take multiple measure
patterns
Convert the dataset columns to numeric
class
nm1 <- names(DT)[-1]
DT[, (nm1) := lapply(.SD, as.numeric), .SDcols = nm1]
Then melt
and dcast
those columns
dM <- dcast(melt(DT, measure = patterns("V[2-3]", "V[4-5]"))[,
lapply(.SD, function(x) x/sum(x)) , V1, .SDcols = value1:value2],
V1~rowid(V1), value.var = c('value1', 'value2'))[, -1, with = FALSE]
and assign the output back to the columns
DT[, (nm1) := dM][]
# V1 V2 V3 V4 V5
#1: A 0.2857143 0.7142857 0.2500000 0.7500000
#2: B 0.5000000 0.5000000 0.5000000 0.5000000
#3: C 0.5555556 0.4444444 0.7500000 0.2500000
#4: D 0.6666667 0.3333333 0.4444444 0.5555556
#5: E 0.5000000 0.5000000 0.5555556 0.4444444
Related Topics
Percentage Histogram with Facet_Wrap
How to Apply a Hierarchical or K-Means Cluster Analysis Using R
What Are Helpful Optimizations in R for Big Data Sets
Model Matrix with All Pairwise Interactions Between Columns
What Are Productive Ways to Debug Rcpp Compiled Code Loaded in R (On Os X Mavericks)
R: How to Make a Barplot with Labels Parallel (Horizontal) to Bars
Treat Na as Zero Only When Adding a Number
S4 Classes: Multiple Types Per Slot
Colons Equals Operator in R? New Syntax
Warning When Defining Factor: Duplicated Levels in Factors Are Deprecated
Multiply Columns in a Data Frame by a Vector
Subset Data Based on Partial Match of Column Names
How to Increase Stack Space Overflow for Pandoc in R
Rcharts with Highcharts as Shiny Application