Weird Case with Data Tables in R, Column Names Are Mixed

Weird case with data tables in R, column names are mixed

See Hadley Wickham's Advanced R:

$ is a shorthand operator, where x$y is equivalent to x[["y", exact = FALSE]]. It’s often used to access variables in a data frame, as in mtcars$cyl or diamonds$carat.

So the exact=FALSE is the reason why $mc_split works despite there not being a column with that exact name.

As an aside, I don't believe mc_with_devices[,.(mc_split)] will work without doublequotes. The following will work:

mc_with_devices[,"mc_split_resp"]

creating, directly, data.tables with column names from variables, and using variables for column names with :=

For the first question, I'm not absolutely sure, but you may want to try and see if fread is of any help creating an empty data.table with named columns.

As for the second question, try

DT[, c(nameOfCols) := 10]

Where nameOfCols is the vector with names of the columns you want to modify. See ?data.table

Why am I getting X. in my column names when reading a data frame?

read.csv() is a wrapper around the more general read.table() function. That latter function has argument check.names which is documented as:

check.names: logical.  If ‘TRUE’ then the names of the variables in the
data frame are checked to ensure that they are syntactically
valid variable names. If necessary they are adjusted (by
‘make.names’) so that they are, and also to ensure that there
are no duplicates.

If your header contains labels that are not syntactically valid then make.names() will replace them with a valid name, based upon the invalid name, removing invalid characters and possibly prepending X:

R> make.names("$Foo")
[1] "X.Foo"

This is documented in ?make.names:

Details:

A syntactically valid name consists of letters, numbers and the
dot or underline characters and starts with a letter or the dot
not followed by a number. Names such as ‘".2way"’ are not valid,
and neither are the reserved words.

The definition of a _letter_ depends on the current locale, but
only ASCII digits are considered to be digits.

The character ‘"X"’ is prepended if necessary. All invalid
characters are translated to ‘"."’. A missing value is translated
to ‘"NA"’. Names which match R keywords have a dot appended to
them. Duplicated values are altered by ‘make.unique’.

The behaviour you are seeing is entirely consistent with the documented way read.table() loads in your data. That would suggest that you have syntactically invalid labels in the header row of your CSV file. Note the point above from ?make.names that what is a letter depends on the locale of your system; The CSV file might include a valid character that your text editor will display but if R is not running in the same locale that character may not be valid there, for example?

I would look at the CSV file and identify any non-ASCII characters in the header line; there are possibly non-visible characters (or escape sequences; \t?) in the header row also. A lot may be going on between reading in the file with the non-valid names and displaying it in the console which might be masking the non-valid characters, so don't take the fact that it doesn't show anything wrong without check.names as indicating that the file is OK.

Posting the output of sessionInfo() would also be useful.

Dangers of mixing [tidyverse] and [data.table] syntax in R?

I came across the same problem on a few occasions, which led me to avoid mixing dplyr with data.table syntax, as I didn't take the time to find out the reason. So thanks for providing a MRE.

Looks like dplyr::arrange is interfering with data.table auto-indexing :

  • index will be used when subsetting dataset with ==
    or %in%
    on a single variable
  • by default if index for a variable is not present on filtering, it is automatically created and used
  • indexes are lost if you change the order of data
  • you can check if you are using index with options(datatable.verbose=TRUE)

If we explicitely set auto-indexing :

library(dplyr); 
library(data.table)

DT <- fread(
"iso3c country income
MOZ Mozambique LIC
ZMB Zambia LMIC
ALB Albania UMIC
MOZ Mozambique LIC
ZMB Zambia LMIC
ALB Albania UMIC")
codes <- c("ALB", "ZMB")

options(datatable.auto.index = TRUE)

DT <- distinct(DT) %>% as.data.table()

# Index creation because %in% is used for the first time
DT[iso3c %in% codes,verbose=T]
#> Creating new index 'iso3c'
#> Creating index iso3c done in ... forder.c received 3 rows and 3 columns
#> forder took 0 sec
#> 0.060s elapsed (0.060s cpu)
#> Optimized subsetting with index 'iso3c'
#> forder.c received 2 rows and 1 columns
#> forder took 0 sec
#> x is already ordered by these columns, no need to call reorder
#> i.iso3c has same type (character) as x.iso3c. No coercion needed.
#> on= matches existing index, using index
#> Starting bmerge ...
#> bmerge done in 0.000s elapsed (0.000s cpu)
#> Constructing irows for '!byjoin || nqbyjoin' ... 0.000s elapsed (0.000s cpu)
#> Reordering 2 rows after bmerge done in ... forder.c received a vector type 'integer' length 2
#> 0 secs
#> iso3c country income
#> 1: ZMB Zambia LMIC
#> 2: ALB Albania UMIC

# Index mixed up by arrange
DT <- DT %>% arrange(iso3c) %>% as.data.table()

# this is wack because data.table possibly still uses the old index whereas row/references were rearranged:
DT[iso3c %in% codes,verbose=T]
#> Optimized subsetting with index 'iso3c'
#> forder.c received 2 rows and 1 columns
#> forder took 0 sec
#> x is already ordered by these columns, no need to call reorder
#> i.iso3c has same type (character) as x.iso3c. No coercion needed.
#> on= matches existing index, using index
#> Starting bmerge ...
#> bmerge done in 0.000s elapsed (0.000s cpu)
#> Constructing irows for '!byjoin || nqbyjoin' ... 0.000s elapsed (0.000s cpu)
#> iso3c country income
#> 1: ALB Albania UMIC

# this works because (...) prevents the parser to use auto-index
DT[(iso3c %in% codes)]
#> iso3c country income
#> 1: ALB Albania UMIC
#> 2: ZMB Zambia LMIC

To avoid this problem, you can disable auto-indexing:

library(dplyr); 
library(data.table)

DT <- fread(
"iso3c country income
MOZ Mozambique LIC
ZMB Zambia LMIC
ALB Albania UMIC
MOZ Mozambique LIC
ZMB Zambia LMIC
ALB Albania UMIC")
codes <- c("ALB", "ZMB")

options(datatable.auto.index = FALSE) # Disabled

DT <- distinct(DT) %>% as.data.table()

# No automatic index creation
DT[iso3c %in% codes,verbose=T]
#> iso3c country income
#> 1: ZMB Zambia LMIC
#> 2: ALB Albania UMIC

DT <- DT %>% arrange(iso3c) %>% as.data.table()

# This now works because auto-indexing is off:
DT[iso3c %in% codes,verbose=T]
#> iso3c country income
#> 1: ALB Albania UMIC
#> 2: ZMB Zambia LMIC

I reported this issue on data.table/issues/5042 and on dtplyr/issues/259 : integrated in 1.4.11 milestone.

Different functions over a list of columns and generate new column names automatically with data.table

If I understand correctly, the question is not about shiny in first place but about how to apply different aggregation functions to specific columns of a data.table.

The names of the columns and the functions which are to be applied on are given as list mylist which is created by the shiny app.

Among the various approaches my preferred option is to compute on the language, i.e., to create a complete expression from the contents of mylist and to evaluate it:

library(magrittr)
library(data.table)
mylist %>%
names() %>%
lapply(
function(.col) lapply(
mylist[[.col]],
function(.fct) sprintf("%s.%s = %s(%s)", .col, .fct, .fct, .col))) %>%
unlist() %>%
paste(collapse = ", ") %>%
sprintf("as.data.table(mtcars)[, .(%s), by = cyl]", .) %>%
parse(text = .) %>%
eval()

which yields the expected result

   cyl disp.sum disp.mean    hp.sd drat.sum drat.mean wt.max
1: 6 1283.2 183.3143 24.26049 25.10 3.585714 3.460
2: 4 1156.5 105.1364 20.93453 44.78 4.070909 3.190
3: 8 4943.4 353.1000 50.97689 45.21 3.229286 5.424

The character string which is parsed is created by

mylist %>%
names() %>%
lapply(
function(.col) lapply(
mylist[[.col]],
function(.fct) sprintf("%s.%s = %s(%s)", .col, .fct, .fct, .col))) %>%
unlist() %>%
paste(collapse = ", ") %>%
sprintf("as.data.table(mtcars)[, .(%s), by = cyl]", .)

and looks as if coded manually:

[1] "as.data.table(mtcars)[, .(disp.sum = sum(disp), disp.mean = mean(disp), hp.sd = sd(hp), drat.sum = sum(drat), drat.mean = mean(drat), wt.max = max(wt)), by = cyl]"

Data

For demonstration, mylist is provided "hard-coded":

mylist <- list(
disp = c("sum", "mean"),
hp = "sd",
drat = c("sum", "mean"),
wt = "max")

How to chain together a mix of data.table and base r functions?

If I understand correctly, the OP wants to

  • rename column Value_1 to Value (or in OP's words: create new column "Values", which equals "Values_1")
  • drop column Value_2
  • replace all occurrences of XX by HI in column State

Here is what I would do in data.table syntax:

setDT(data)[, State := ifelse(rev(cumsum(rev(Values_1 + Values_2))), State, "XX"), ID][
, Values_2 := NULL][
State == "XX", State := "HI"][]
setnames(data, "Values_1", "Values")
data
       ID Period Values  State
1: 1 1 5 X0
2: 1 2 0 X1
3: 1 3 0 X2
4: 1 4 0 X1
5: 2 1 1 X0
6: 2 2 0 HI
7: 2 3 0 HI
8: 2 4 0 HI
9: 3 1 0 X2
10: 3 2 0 X1
11: 3 3 0 X9
12: 3 4 0 X3
13: 4 1 1 X2
14: 4 2 2 X1
15: 4 3 3 X9
16: 4 4 0 HI

setnames() updates by reference, e.g., without copying. There is no need to create a copy of Values_1 and delete Values_1 later on.

Also, [State == "XX", State := "HI"] replaces XX by HI only in affected rows by reference while

[, State := gsub('XX','HI', State)] replaces the whole column.

data.table chaining is used where appropriate.

BTW: I wonder why the replacement of XX by HI cannot be done rightaway in the first statement:

setDT(data)[, State := ifelse(rev(cumsum(rev(Values_1 + Values_2))), State, "HI"), ID][
, Values_2 := NULL][]
setnames(data, "Values_1", "Values")

Combining multiple lists of variable names in data.table?

Another option is to construct the full call ahead of time:

varnames[4:5] <- expr[2:3]  # this results in `list(a, b, a * b, c + d)`
dt[, eval(varnames)]

produces:

   a b V3 V4
1: 1 4 4 17
2: 2 5 10 19
3: 3 6 18 21

More generically, suppose you have a list of quoted lists of expressions:

exprlist <- list(quote(list(a, b)), quote(list(c, c %% a)), quote(list(a + b)))
expr <- as.call(Reduce(function(x, y) c(as.list(x), as.list(y)[-1]), exprlist)) # @eddi
dt[, eval(expr)]


Related Topics



Leave a reply



Submit