Update multiple data.table columns elegantly
How about
dt[, (names(dt)) := lapply(.SD, function(x) x/mean(x))]
If you need to specify certain columns, you could use
dt[, 1:40 := lapply(.SD, function(x) x/mean(x)), .SDcols = 1:40]
or
cols <- names(dt)[c(1,5,10)]
dt[, (cols) := lapply(.SD, function(x) x/mean(x)), .SDcols = cols]
Elegantly assigning multiple columns in data.table with lapply()
Yes, you're right in question here :
I understand it is more efficient to loop over a vector of column names using
:=
to assign:
for (col in paste0("V", 20:100))
dt[, col := sqrt(dt[[col]]), with = FALSE]
Aside: note that the new way of doing that is :
for (col in paste0("V", 20:100))
dt[ , (col) := sqrt(dt[[col]])]
because the with = FALSE
wasn't easy to read whether it referred to the LHS or the RHS of :=
. End aside.
As you know, that's efficient because that does each column one by one, so working memory is only needed for one column at a time. That can make a difference between it working and it failing with the dreaded out-of-memory error.
The problem with lapply
on the RHS of :=
is that the RHS (the lapply
) is evaluated first; i.e., the result for the 80 columns is created. That's 80 column's worth of new memory which has to be allocated and populated. So you need 80 column's worth of free RAM for that operation to succeed. That RAM usage dominates vs the subsequently instant operation of assigning (plonking) those 80 new columns into the data.table's column pointer slots.
As @Frank pointed to, if you have a lot of columns (say 10,000 or more) then the small overhead of dispatching to the [.data.table
method starts to add up). To eliminate that overhead that there is data.table::set
which under ?set
is described as a "loopable" :=
. I use a for
loop for this type of operation. It's the fastest way and is fairly easy to write and read.
for (col in paste0("V", 20:100))
set(dt, j = col, value = sqrt(dt[[col]]))
Although with just 80 columns, it's unlikely to matter. (Note it may be more common to loop set
over a large number of rows than a large number of columns.) However, looped set
doesn't solve the problem of the repeated reference to the dt
symbol name that you mentioned in the question :
I don't like this because I don't like reference the data.table in a j expression.
Agreed. So the best I can do is revert to your looping of :=
but use get
instead.
for (col in paste0("V", 20:100))
dt[, (col) := sqrt(get(col))]
However, I fear that using get
in j
carry an overhead. Benchmarking made in #1380. Also, perhaps it is confusing to use get()
on the RHS but not on the LHS. To address that we could sugar the LHS and allow get()
as well, #1381 :
for (col in paste0("V", 20:100))
dt[, get(col) := sqrt(get(col))]
Also, maybe value
of set
could be run within scope of DT
, #1382.
for (col in paste0("V", 20:100))
set(dt, j = col, value = sqrt(get(col))
updating multiple columns of selected rows of a data table with duplicate key values
Here is one possibility:
DT1[is.na(v1), c("v1", "v2") := DT2[.SD[["id2"]], list(v1, v2)]]
# id1 id2 v1 v2
#1: 1 a -0.6264538 1.5952808
#2: 2 c -0.8356286 0.5757814
#3: 1 e 3.0000000 33.0000000
#4: 2 e 0.1836433 0.4874291
#5: 2 e 3.0000000 33.0000000
#6: 1 n 1.0000000 11.0000000
Update multiple columns of a table conditionally in a single command
It would be wise to add a WHERE
clause.
UPDATE names
SET name = CASE WHEN name = 'Mic' THEN 'Michael' ELSE name END
,client_name = CASE WHEN client_name = 'Mic' THEN 'Michael'
ELSE client_name END
,requester_name = CASE WHEN requester_name = 'Mic' THEN 'Michael'
ELSE requester_name END
WHERE 'Mic' IN (name, client_name, requester_name);
Else, the whole table will be updated unconditionally. Updates that change values to the same value are still updates creating dead rows, triggering triggers and so on. While the resulting rows would not be wrong, it would still bloat the table to twice its size, making VACUUM
necessary, and be generally very slow.
BTW, either form of the CASE
statement is good here.
Changing multiple Columns in data.table r
One option would be to use set
as this involves multiple columns. The advantage of using set
is that it will avoid the overhead of [.data.table
and makes it faster.
library(data.table)
for(j in cols){
set(dt, i=NULL, j=j, value= dt[[j]]/dt[[j]][rownum])
}
Or a slightly slower option would be
dt[, (cols) :=lapply(.SD, function(x) x/x[rownum]), .SDcols=cols]
Apply a function to every specified column in a data.table and update by reference
This seems to work:
dt[ , (cols) := lapply(.SD, "*", -1), .SDcols = cols]
The result is
a b d
1: -1 -1 1
2: -2 -2 2
3: -3 -3 3
There are a few tricks here:
- Because there are parentheses in
(cols) :=
, the result is assigned to the columns specified incols
, instead of to some new variable named "cols". .SDcols
tells the call that we're only looking at those columns, and allows us to use.SD
, theS
ubset of theD
ata associated with those columns.lapply(.SD, ...)
operates on.SD
, which is a list of columns (like all data.frames and data.tables).lapply
returns a list, so in the endj
looks likecols := list(...)
.
EDIT: Here's another way that is probably faster, as @Arun mentioned:
for (j in cols) set(dt, j = j, value = -dt[[j]])
Can I programmatically update the type of a set of columns (to factors) in data.table?
Yes, this is fairly straightforward:
dt1[, (vars.factors) := lapply(.SD, as.factor), .SDcols=vars.factors]
In the LHS
(of := in j
), we specify the names of the columns. If a column already exists, it'll be updated, else, a new column will be created. In the RHS, we loop over all the columns in .SD
(which stands for Subset of Data), and we specify the columns that should be in .SD
with the .SDcols
argument.
Following up on comment:
Note that we need to wrap LHS with ()
for it to be evaluated and fetch the column names within vars.factors
variable. This is because we allow the syntax
DT[, col := value]
when there's only one column to assign, by specifying the column name as a symbol (without quotes), purely for convenience. This creates a column named col
and assigns value
to it.
To distinguish these two cases apart, we need the ()
. Wrapping it with ()
is sufficient to identify that we really need to get the values within the variable.
What is the most elegant way to apply a function to multiple pairs of columns in a data.table or data.frame?
1) gv Using gv in the collapse package we could do this:
library(collapse)
DT[, (result.cols) := gv(.SD, one.cols) - gv(.SD, two.cols)]
2) gvr We can alternately use the regex variant of gv to eliminate one.cols and two.cols:
library(collapse)
result.cols <- sub(1, 3, gvr(DT, "1$", "names"))
DT[, (result.cols) := gvr(.SD, "1$") - gvr(.SD, "2$")]
3) across Using dplyr we can use across eliminating result.cols as well.
library(dplyr)
DT %>%
mutate(across(ends_with("1"), .names="{sub(1,3,.col)}") - across(ends_with("2")))
4) data.table If we write it like this it is straight forward in data.table:
DT[, result.cols] <- DT[, ..one.cols] - DT[, ..two.cols]
or
DT[, (result.cols) := .SD[, one.cols, with=FALSE] - .SD[, two.cols, with=FALSE]]
Related Topics
How Does Gganimate Order an Ordered Bar Time-Series
Ordering Permutation in Rcpp I.E. Base::Order()
R - How to Test for Character(0) in If Statement
Adding Simple Legend to Plot in R
Display an Axis Value in Millions in Ggplot
How to Extract Everything Until First Occurrence of Pattern
How to Get My Blogdown Blog on R-Bloggers
Dplyr Join Warning: Joining Factors with Different Levels
Why Is Seq(X) So Much Slower Than 1:Length(X)
Dplyr::First() to Choose First Non Na Value
Installing a Package Offline from Github
How to Combine Aes() and Aes_String() Options
Randomly Sample a Percentage of Rows Within a Data Frame
Rstudio Empty on Startup - No Windows, No Menus, No Rendering
Reshape Wide Format, to Multi-Column Long Format