Select/Assign to Data.Table When Variable Names Are Stored in a Character Vector

Select / assign to data.table when variable names are stored in a character vector

Two ways to programmatically select variable(s):

with = FALSE:

 DT = data.table(col1 = 1:3)
 colname = "col1"
 DT[, colname, with = FALSE] 
 #    col1
 # 1:    1
 # 2:    2
 # 3:    3

'dot dot' (..) prefix:

 DT[, ..colname]    
 #    col1
 # 1:    1
 # 2:    2
 # 3:    3

For further description of the 'dot dot' (..) notation, see New Features in 1.10.2 (it is currently not described in help text).

To assign to variable(s), wrap the LHS of := in parentheses:

DT[, (colname) := 4:6]    
#    col1
# 1:    4
# 2:    5
# 3:    6

The latter is known as a column plonk, because you replace the whole column vector by reference. If a subset i was present, it would subassign by reference. The parens around (colname) is a shorthand introduced in version v1.9.4 on CRAN Oct 2014. Here is the news item:

Using with = FALSE with := is now deprecated in all cases, given that wrapping
the LHS of := with parentheses has been preferred for some time.

colVar = "col1"

DT[, (colVar) := 1]                             # please change to this
DT[, c("col1", "col2") := 1]                    # no change
DT[, 2:4 := 1]                                  # no change
DT[, c("col1","col2") := list(sum(a), mean(b))]  # no change
DT[, `:=`(...), by = ...]                       # no change

Selecting columns of a data.table using a vector of column names or column positions without using with = F

An option is to use double dots

DT[, ..mycols]
#          A           C
#1:  0.1188208 -0.17328827
#2: -0.5622505  0.84231231
#3:  0.8111072 -1.59802306
#4:  0.7968823  2.08468489
# ...

Or specify it in .SDcols

DT[, .SD, .SDcols = mycols]

or else with = FALSE as the OP mentioned in the post

How to assign a variable values stored in a vector to a series of variable names stored in a character vector in R?

assign is not vectorized, so you can use Map here specifying the environment.

Map(function(x, y) assign(x, y, envir = .GlobalEnv), my_variables, my_values)

A
#[1] 1
B
#[1] 2
C
#[1] 3

However, it is not a good practice to have such variables in the global environment.

Use a named vector :

name_vec <- setNames(my_values, my_variables)
name_vec
#A B C 
#1 2 3

Or named list as.list(name_vec).

How to assign vector of strings as variable names, in for loop, in data.table, in dplyr

This is a common situation which can be handled with ease by using a list.

This is what I would do if the data files are different in structure, i.e., columns differ in names, data types, or order:

library(data.table)
file_names <- list.files(pattern = "*.csv")
list_of_df <- lapply(file_names, fread)
list_of_df <- setNames(list_of_df, file_names)
list_of_df

$area.csv
   id         name
1:  1  normal name
2:  2   with,comma
3:  3 with%percent

$farmland.csv
   id         name
1:  1  normal name
2:  2   with,comma
3:  3 with%percent

$GDPpercapita.csv
   id         name
1:  1  normal name
2:  2   with,comma
3:  3 with%percent

Note that I have made up three sample files for demonstration. See Data section for details.

The elements of the resulting list object list_of_df are named like the files the data were loaded from.

Now, we can operate on the elements of the list using lapply() or a for loop, e.g.,

lapply(
  list_of_df, 
  function(df) df[, lapply(.SD, function(col) if (is.character(col)) stringr::str_remove_all(col, "[,%]") else col)]
  )

$area.csv
   id        name
1:  1 normal name
2:  2   withcomma
3:  3 withpercent

$farmland.csv
   id        name
1:  1 normal name
2:  2   withcomma
3:  3 withpercent

$GDPpercapita.csv
   id        name
1:  1 normal name
2:  2   withcomma
3:  3 withpercent

Note that the code to remove , and % has been simplified.

lapply() has the advantage over a for loop that is returns a list again which is convenient for subsequent processing steps.

As a side note: there is a speciality with data.table as it is able to update by reference, i.e., without copying the data.table. So, we can update list_of_df in place which might be a benefit in terms of speed and memory consumption for large datasets:

address(list_of_df) # just for demonstration
for (df in list_of_df) {
  cols <- which(sapply(df, is.character))
  df[, (cols) := lapply(.SD, stringr::str_remove_all, "[,%]"), .SDcols = cols]
}
address(list_of_df)

The calls to address(list_of_df) before and after the for loop have been added to demonstrate that list_of_df still occupies the same storage location but has been changed in place.

list_of_df

$area.csv
   id        name
1:  1 normal name
2:  2   withcomma
3:  3 withpercent

$farmland.csv
   id        name
1:  1 normal name
2:  2   withcomma
3:  3 withpercent

$GDPpercapita.csv
   id        name
1:  1 normal name
2:  2   withcomma
3:  3 withpercent

In case the datasets read from file have a similar structure, i.e. same name, order and data type of columns, we can combine the single pieces into one large dataset using rbindlist()

My preferred workflow for this use case is along

library(data.table)
library(magrittr)
file_names <- list.files(pattern = "*.csv")
big_df <- lapply(file_names, fread) %>% 
  set_names(file_names) %>% 
  rbindlist(idcol = "file_name")
big_df

          file_name id         name
1:         area.csv  1  normal name
2:         area.csv  2   with,comma
3:         area.csv  3 with%percent
4:     farmland.csv  1  normal name
5:     farmland.csv  2   with,comma
6:     farmland.csv  3 with%percent
7: GDPpercapita.csv  1  normal name
8: GDPpercapita.csv  2   with,comma
9: GDPpercapita.csv  3 with%percent

Note that rbindlist() has created an id column from the names of the list elements. This allows for distinguishing the origin of each row.

Working with one uniform data structure simplifies subsequent processing

cols <- which(sapply(big_df, is.character))
big_df[, (cols) := lapply(.SD, stringr::str_remove_all, "[,%]"), .SDcols = cols]
big_df

          file_name id        name
1:         area.csv  1 normal name
2:         area.csv  2   withcomma
3:         area.csv  3 withpercent
4:     farmland.csv  1 normal name
5:     farmland.csv  2   withcomma
6:     farmland.csv  3 withpercent
7: GDPpercapita.csv  1 normal name
8: GDPpercapita.csv  2   withcomma
9: GDPpercapita.csv  3 withpercent

As the OP is using mutate() here is an all "tidyverse" approach. It does essentially the same as the data.table versions above:

library(purrr)
library(dplyr)
file_names <- list.files(pattern = "*.csv")
list_of_df <- map(file_names, readr::read_csv) %>% 
  set_names(file_names)

list_of_df %>% 
  map( ~ mutate(.x, across(where(is.character), ~ stringr::str_remove_all(.x, "[,%]"))))

$area.csv
# A tibble: 3 x 2
     id name       
  <dbl> <chr>      
1     1 normal name
2     2 withcomma  
3     3 withpercent

$farmland.csv
# A tibble: 3 x 2
     id name       
  <dbl> <chr>      
1     1 normal name
2     2 withcomma  
3     3 withpercent

$GDPpercapita.csv
# A tibble: 3 x 2
     id name       
  <dbl> <chr>      
1     1 normal name
2     2 withcomma  
3     3 withpercent

map() is the equivalent of base R's lapply(). Also readr::read_csv() is used instead of data.table's fread().

Data

Caveat: The code below will create 3 files in the current working directory!

library(data.table)
dummy <- data.table(id = 1:3, name = c("normal name", "with,comma", "with%percent"))
extern <- c("area.csv", "farmland.csv", "GDPpercapita.csv")
for (fn in extern) fwrite(dummy, fn)

The code saves a dummy data.table three times as csv file to disk using three different file names.

creating, directly, data.tables with column names from variables, and using variables for column names with :=

For the first question, I'm not absolutely sure, but you may want to try and see if fread is of any help creating an empty data.table with named columns.

As for the second question, try

DT[, c(nameOfCols) := 10]

Where nameOfCols is the vector with names of the columns you want to modify. See ?data.table

Use variable name to calculate or modify columns in a data.table

You can use :

library(data.table)
name = c("Bob","Mary","Jane","Kim")
weight = c(60,65,45,55)
height = c(170,165,140,135)
dft = data.table(name,weight,height)

col1 <- 'weight'
col2 <- 'height'

dft[, (col1) := get(col2) + 13]
dft

#   name weight height
#1:  Bob    183    170
#2: Mary    178    165
#3: Jane    153    140
#4:  Kim    148    135

r data.table row subset with column name as a variable

I guess you are looking for get:

library(data.table)

DT <- data.table(x1=1:11, x2=11:21)
var <- "x1"
DT[get(var)==1,]

data.table grouped operations with variable names of columns without slow DT[, mean(get(colName)), by = grp]

It would be better to pass the dataset name d to the FOO function instead of passing the character string "d". Also, you can use lapply combined with .SD so that you can benefit from internal optimization instead of using mean(get(colName)).

FOO2 = function(dataName=d, colName = "x") { # d instead of "d" passed to the first argument!
  dataName[, lapply(.SD, mean), by=grp, .SDcols=colName]
}

Benchmark: `FOO` vs `FOO2`

set.seed(147852)
n <- 1e7
d <- data.table(x = 1:n, grp = sample(1:1e5, n, replace = T))

microbenchmark::microbenchmark(
  FOO(),
  FOO2(),
  times=5L
)

Unit: milliseconds
   expr       min        lq      mean    median        uq       max neval
  FOO() 4632.4014 4672.7781 4787.4958 4707.9023 4846.7081 5077.6893     5
 FOO2()  255.0828  267.1322  297.0389  275.4467  281.9873  405.5456     5

Select/Assign to Data.Table When Variable Names Are Stored in a Character Vector