Select / assign to data.table when variable names are stored in a character vector
Two ways to programmatically select variable(s):
with = FALSE
:DT = data.table(col1 = 1:3)
colname = "col1"
DT[, colname, with = FALSE]
# col1
# 1: 1
# 2: 2
# 3: 3'dot dot' (
..
) prefix:DT[, ..colname]
# col1
# 1: 1
# 2: 2
# 3: 3
For further description of the 'dot dot' (..
) notation, see New Features in 1.10.2 (it is currently not described in help text).
To assign to variable(s), wrap the LHS of :=
in parentheses:
DT[, (colname) := 4:6]
# col1
# 1: 4
# 2: 5
# 3: 6
The latter is known as a column plonk, because you replace the whole column vector by reference. If a subset i
was present, it would subassign by reference. The parens around (colname)
is a shorthand introduced in version v1.9.4 on CRAN Oct 2014. Here is the news item:
Using
with = FALSE
with:=
is now deprecated in all cases, given that wrapping
the LHS of:=
with parentheses has been preferred for some time.
colVar = "col1"
DT[, (colVar) := 1] # please change to this
DT[, c("col1", "col2") := 1] # no change
DT[, 2:4 := 1] # no change
DT[, c("col1","col2") := list(sum(a), mean(b))] # no change
DT[, `:=`(...), by = ...] # no change
See also Details section in ?`:=`
:
DT[i, (colnamevector) := value]
# [...] The parens are enough to stop the LHS being a symbol
And to answer further question in comment, here's one way (as usual there are many ways) :
DT[, colname := cumsum(get(colname)), with = FALSE]
# col1
# 1: 4
# 2: 9
# 3: 15
or, you might find it easier to read, write and debug just to eval
a paste
, similar to constructing a dynamic SQL statement to send to a server :
expr = paste0("DT[,",colname,":=cumsum(",colname,")]")
expr
# [1] "DT[,col1:=cumsum(col1)]"
eval(parse(text=expr))
# col1
# 1: 4
# 2: 13
# 3: 28
If you do that a lot, you can define a helper function EVAL
:
EVAL = function(...)eval(parse(text=paste0(...)),envir=parent.frame(2))
EVAL("DT[,",colname,":=cumsum(",colname,")]")
# col1
# 1: 4
# 2: 17
# 3: 45
Now that data.table
1.8.2 automatically optimizes j
for efficiency, it may be preferable to use the eval
method. The get()
in j
prevents some optimizations, for example.
Or, there is set()
. A low overhead, functional form of :=
, which would be fine here. See ?set
.
set(DT, j = colname, value = cumsum(DT[[colname]]))
DT
# col1
# 1: 4
# 2: 21
# 3: 66
Selecting columns of a data.table using a vector of column names or column positions without using with = F
An option is to use double dots
DT[, ..mycols]
# A C
#1: 0.1188208 -0.17328827
#2: -0.5622505 0.84231231
#3: 0.8111072 -1.59802306
#4: 0.7968823 2.08468489
# ...
Or specify it in .SDcols
DT[, .SD, .SDcols = mycols]
or else with = FALSE
as the OP mentioned in the post
How to assign a variable values stored in a vector to a series of variable names stored in a character vector in R?
assign
is not vectorized, so you can use Map
here specifying the environment.
Map(function(x, y) assign(x, y, envir = .GlobalEnv), my_variables, my_values)
A
#[1] 1
B
#[1] 2
C
#[1] 3
However, it is not a good practice to have such variables in the global environment.
Use a named vector :
name_vec <- setNames(my_values, my_variables)
name_vec
#A B C
#1 2 3
Or named list as.list(name_vec)
.
How to assign vector of strings as variable names, in for loop, in data.table, in dplyr
This is a common situation which can be handled with ease by using a list.
This is what I would do if the data files are different in structure, i.e., columns differ in names, data types, or order:
library(data.table)
file_names <- list.files(pattern = "*.csv")
list_of_df <- lapply(file_names, fread)
list_of_df <- setNames(list_of_df, file_names)
list_of_df
$area.csv
id name
1: 1 normal name
2: 2 with,comma
3: 3 with%percent
$farmland.csv
id name
1: 1 normal name
2: 2 with,comma
3: 3 with%percent
$GDPpercapita.csv
id name
1: 1 normal name
2: 2 with,comma
3: 3 with%percent
Note that I have made up three sample files for demonstration. See Data section for details.
The elements of the resulting list object list_of_df
are named like the files the data were loaded from.
Now, we can operate on the elements of the list using lapply()
or a for
loop, e.g.,
lapply(
list_of_df,
function(df) df[, lapply(.SD, function(col) if (is.character(col)) stringr::str_remove_all(col, "[,%]") else col)]
)
$area.csv
id name
1: 1 normal name
2: 2 withcomma
3: 3 withpercent
$farmland.csv
id name
1: 1 normal name
2: 2 withcomma
3: 3 withpercent
$GDPpercapita.csv
id name
1: 1 normal name
2: 2 withcomma
3: 3 withpercent
Note that the code to remove ,
and %
has been simplified.
lapply()
has the advantage over a for
loop that is returns a list again which is convenient for subsequent processing steps.
As a side note: there is a speciality with data.table
as it is able to update by reference, i.e., without copying the data.table. So, we can update list_of_df
in place which might be a benefit in terms of speed and memory consumption for large datasets:
address(list_of_df) # just for demonstration
for (df in list_of_df) {
cols <- which(sapply(df, is.character))
df[, (cols) := lapply(.SD, stringr::str_remove_all, "[,%]"), .SDcols = cols]
}
address(list_of_df)
The calls to address(list_of_df)
before and after the for
loop have been added to demonstrate that list_of_df
still occupies the same storage location but has been changed in place.
list_of_df
$area.csv
id name
1: 1 normal name
2: 2 withcomma
3: 3 withpercent
$farmland.csv
id name
1: 1 normal name
2: 2 withcomma
3: 3 withpercent
$GDPpercapita.csv
id name
1: 1 normal name
2: 2 withcomma
3: 3 withpercent
In case the datasets read from file have a similar structure, i.e. same name, order and data type of columns, we can combine the single pieces into one large dataset using rbindlist()
My preferred workflow for this use case is along
library(data.table)
library(magrittr)
file_names <- list.files(pattern = "*.csv")
big_df <- lapply(file_names, fread) %>%
set_names(file_names) %>%
rbindlist(idcol = "file_name")
big_df
file_name id name
1: area.csv 1 normal name
2: area.csv 2 with,comma
3: area.csv 3 with%percent
4: farmland.csv 1 normal name
5: farmland.csv 2 with,comma
6: farmland.csv 3 with%percent
7: GDPpercapita.csv 1 normal name
8: GDPpercapita.csv 2 with,comma
9: GDPpercapita.csv 3 with%percent
Note that rbindlist()
has created an id column from the names of the list elements. This allows for distinguishing the origin of each row.
Working with one uniform data structure simplifies subsequent processing
cols <- which(sapply(big_df, is.character))
big_df[, (cols) := lapply(.SD, stringr::str_remove_all, "[,%]"), .SDcols = cols]
big_df
file_name id name
1: area.csv 1 normal name
2: area.csv 2 withcomma
3: area.csv 3 withpercent
4: farmland.csv 1 normal name
5: farmland.csv 2 withcomma
6: farmland.csv 3 withpercent
7: GDPpercapita.csv 1 normal name
8: GDPpercapita.csv 2 withcomma
9: GDPpercapita.csv 3 withpercent
As the OP is using mutate()
here is an all "tidyverse" approach. It does essentially the same as the data.table versions above:
library(purrr)
library(dplyr)
file_names <- list.files(pattern = "*.csv")
list_of_df <- map(file_names, readr::read_csv) %>%
set_names(file_names)
list_of_df %>%
map( ~ mutate(.x, across(where(is.character), ~ stringr::str_remove_all(.x, "[,%]"))))
$area.csv
# A tibble: 3 x 2
id name
<dbl> <chr>
1 1 normal name
2 2 withcomma
3 3 withpercent
$farmland.csv
# A tibble: 3 x 2
id name
<dbl> <chr>
1 1 normal name
2 2 withcomma
3 3 withpercent
$GDPpercapita.csv
# A tibble: 3 x 2
id name
<dbl> <chr>
1 1 normal name
2 2 withcomma
3 3 withpercent
map()
is the equivalent of base R's lapply()
. Also readr::read_csv()
is used instead of data.table
's fread()
.
Data
Caveat: The code below will create 3 files in the current working directory!
library(data.table)
dummy <- data.table(id = 1:3, name = c("normal name", "with,comma", "with%percent"))
extern <- c("area.csv", "farmland.csv", "GDPpercapita.csv")
for (fn in extern) fwrite(dummy, fn)
The code saves a dummy data.table three times as csv file to disk using three different file names.
creating, directly, data.tables with column names from variables, and using variables for column names with :=
For the first question, I'm not absolutely sure, but you may want to try and see if fread
is of any help creating an empty data.table with named columns.
As for the second question, try
DT[, c(nameOfCols) := 10]
Where nameOfCols
is the vector with names of the columns you want to modify. See ?data.table
Use variable name to calculate or modify columns in a data.table
You can use :
library(data.table)
name = c("Bob","Mary","Jane","Kim")
weight = c(60,65,45,55)
height = c(170,165,140,135)
dft = data.table(name,weight,height)
col1 <- 'weight'
col2 <- 'height'
dft[, (col1) := get(col2) + 13]
dft
# name weight height
#1: Bob 183 170
#2: Mary 178 165
#3: Jane 153 140
#4: Kim 148 135
r data.table row subset with column name as a variable
I guess you are looking for get
:
library(data.table)
DT <- data.table(x1=1:11, x2=11:21)
var <- "x1"
DT[get(var)==1,]
data.table grouped operations with variable names of columns without slow DT[, mean(get(colName)), by = grp]
It would be better to pass the dataset name d
to the FOO
function instead of passing the character string "d"
. Also, you can use lapply
combined with .SD
so that you can benefit from internal optimization instead of using mean(get(colName))
.
FOO2 = function(dataName=d, colName = "x") { # d instead of "d" passed to the first argument!
dataName[, lapply(.SD, mean), by=grp, .SDcols=colName]
}
Benchmark: FOO
vs FOO2
set.seed(147852)
n <- 1e7
d <- data.table(x = 1:n, grp = sample(1:1e5, n, replace = T))
microbenchmark::microbenchmark(
FOO(),
FOO2(),
times=5L
)
Unit: milliseconds
expr min lq mean median uq max neval
FOO() 4632.4014 4672.7781 4787.4958 4707.9023 4846.7081 5077.6893 5
FOO2() 255.0828 267.1322 297.0389 275.4467 281.9873 405.5456 5
Related Topics
How to Combine Multiple Conditions to Subset a Data-Frame Using "Or"
Reshape Multiple Value Columns to Wide Format
How to Use Greek Symbols in Ggplot2
Error: Unexpected Symbol/Input/String Constant/Numeric Constant/Special in My Code
Remove Part of String After "."
Compare Two Data.Frames to Find the Rows in Data.Frame 1 That Are Not Present in Data.Frame 2
Multirow Axis Labels With Nested Grouping Variables
Difference Between '%In%' and '=='
How Can Two Strings Be Concatenated
Split a Large Dataframe into a List of Data Frames Based on Common Value in Column
How to Change Legend Title in Ggplot
How to Succinctly Write a Formula With Many Variables from a Data Frame
How to Save Warnings and Errors as Output from a Function
How to Subset Matrix to One Column, Maintain Matrix Data Type, Maintain Row/Column Names
Force the Origin to Start At 0
Create a Sequential Number (Counter) For Rows Within Each Group of a Dataframe