Weird case with data tables in R, column names are mixed
See Hadley Wickham's Advanced R:
$ is a shorthand operator, where x$y is equivalent to x[["y", exact = FALSE]]. It’s often used to access variables in a data frame, as in mtcars$cyl or diamonds$carat.
So the exact=FALSE
is the reason why $mc_split
works despite there not being a column with that exact name.
As an aside, I don't believe mc_with_devices[,.(mc_split)]
will work without doublequotes. The following will work:
mc_with_devices[,"mc_split_resp"]
creating, directly, data.tables with column names from variables, and using variables for column names with :=
For the first question, I'm not absolutely sure, but you may want to try and see if fread
is of any help creating an empty data.table with named columns.
As for the second question, try
DT[, c(nameOfCols) := 10]
Where nameOfCols
is the vector with names of the columns you want to modify. See ?data.table
Why am I getting X. in my column names when reading a data frame?
read.csv()
is a wrapper around the more general read.table()
function. That latter function has argument check.names
which is documented as:
check.names: logical. If ‘TRUE’ then the names of the variables in the
data frame are checked to ensure that they are syntactically
valid variable names. If necessary they are adjusted (by
‘make.names’) so that they are, and also to ensure that there
are no duplicates.
If your header contains labels that are not syntactically valid then make.names()
will replace them with a valid name, based upon the invalid name, removing invalid characters and possibly prepending X
:
R> make.names("$Foo")
[1] "X.Foo"
This is documented in ?make.names
:
Details:
A syntactically valid name consists of letters, numbers and the
dot or underline characters and starts with a letter or the dot
not followed by a number. Names such as ‘".2way"’ are not valid,
and neither are the reserved words.
The definition of a _letter_ depends on the current locale, but
only ASCII digits are considered to be digits.
The character ‘"X"’ is prepended if necessary. All invalid
characters are translated to ‘"."’. A missing value is translated
to ‘"NA"’. Names which match R keywords have a dot appended to
them. Duplicated values are altered by ‘make.unique’.
The behaviour you are seeing is entirely consistent with the documented way read.table()
loads in your data. That would suggest that you have syntactically invalid labels in the header row of your CSV file. Note the point above from ?make.names
that what is a letter depends on the locale of your system; The CSV file might include a valid character that your text editor will display but if R is not running in the same locale that character may not be valid there, for example?
I would look at the CSV file and identify any non-ASCII characters in the header line; there are possibly non-visible characters (or escape sequences; \t
?) in the header row also. A lot may be going on between reading in the file with the non-valid names and displaying it in the console which might be masking the non-valid characters, so don't take the fact that it doesn't show anything wrong without check.names
as indicating that the file is OK.
Posting the output of sessionInfo()
would also be useful.
Dangers of mixing [tidyverse] and [data.table] syntax in R?
I came across the same problem on a few occasions, which led me to avoid mixing dplyr
with data.table
syntax, as I didn't take the time to find out the reason. So thanks for providing a MRE.
Looks like dplyr::arrange
is interfering with data.table
auto-indexing :
- index will be used when subsetting dataset with
==
or%in%
on a single variable- by default if index for a variable is not present on filtering, it is automatically created and used
- indexes are lost if you change the order of data
- you can check if you are using index with
options(datatable.verbose=TRUE)
If we explicitely set auto-indexing :
library(dplyr);
library(data.table)
DT <- fread(
"iso3c country income
MOZ Mozambique LIC
ZMB Zambia LMIC
ALB Albania UMIC
MOZ Mozambique LIC
ZMB Zambia LMIC
ALB Albania UMIC")
codes <- c("ALB", "ZMB")
options(datatable.auto.index = TRUE)
DT <- distinct(DT) %>% as.data.table()
# Index creation because %in% is used for the first time
DT[iso3c %in% codes,verbose=T]
#> Creating new index 'iso3c'
#> Creating index iso3c done in ... forder.c received 3 rows and 3 columns
#> forder took 0 sec
#> 0.060s elapsed (0.060s cpu)
#> Optimized subsetting with index 'iso3c'
#> forder.c received 2 rows and 1 columns
#> forder took 0 sec
#> x is already ordered by these columns, no need to call reorder
#> i.iso3c has same type (character) as x.iso3c. No coercion needed.
#> on= matches existing index, using index
#> Starting bmerge ...
#> bmerge done in 0.000s elapsed (0.000s cpu)
#> Constructing irows for '!byjoin || nqbyjoin' ... 0.000s elapsed (0.000s cpu)
#> Reordering 2 rows after bmerge done in ... forder.c received a vector type 'integer' length 2
#> 0 secs
#> iso3c country income
#> 1: ZMB Zambia LMIC
#> 2: ALB Albania UMIC
# Index mixed up by arrange
DT <- DT %>% arrange(iso3c) %>% as.data.table()
# this is wack because data.table possibly still uses the old index whereas row/references were rearranged:
DT[iso3c %in% codes,verbose=T]
#> Optimized subsetting with index 'iso3c'
#> forder.c received 2 rows and 1 columns
#> forder took 0 sec
#> x is already ordered by these columns, no need to call reorder
#> i.iso3c has same type (character) as x.iso3c. No coercion needed.
#> on= matches existing index, using index
#> Starting bmerge ...
#> bmerge done in 0.000s elapsed (0.000s cpu)
#> Constructing irows for '!byjoin || nqbyjoin' ... 0.000s elapsed (0.000s cpu)
#> iso3c country income
#> 1: ALB Albania UMIC
# this works because (...) prevents the parser to use auto-index
DT[(iso3c %in% codes)]
#> iso3c country income
#> 1: ALB Albania UMIC
#> 2: ZMB Zambia LMIC
To avoid this problem, you can disable auto-indexing:
library(dplyr);
library(data.table)
DT <- fread(
"iso3c country income
MOZ Mozambique LIC
ZMB Zambia LMIC
ALB Albania UMIC
MOZ Mozambique LIC
ZMB Zambia LMIC
ALB Albania UMIC")
codes <- c("ALB", "ZMB")
options(datatable.auto.index = FALSE) # Disabled
DT <- distinct(DT) %>% as.data.table()
# No automatic index creation
DT[iso3c %in% codes,verbose=T]
#> iso3c country income
#> 1: ZMB Zambia LMIC
#> 2: ALB Albania UMIC
DT <- DT %>% arrange(iso3c) %>% as.data.table()
# This now works because auto-indexing is off:
DT[iso3c %in% codes,verbose=T]
#> iso3c country income
#> 1: ALB Albania UMIC
#> 2: ZMB Zambia LMIC
I reported this issue on data.table/issues/5042 and on dtplyr/issues/259 : integrated in 1.4.11 milestone.
Different functions over a list of columns and generate new column names automatically with data.table
If I understand correctly, the question is not about shiny in first place but about how to apply different aggregation functions to specific columns of a data.table.
The names of the columns and the functions which are to be applied on are given as list mylist
which is created by the shiny app.
Among the various approaches my preferred option is to compute on the language, i.e., to create a complete expression from the contents of mylist
and to evaluate it:
library(magrittr)
library(data.table)
mylist %>%
names() %>%
lapply(
function(.col) lapply(
mylist[[.col]],
function(.fct) sprintf("%s.%s = %s(%s)", .col, .fct, .fct, .col))) %>%
unlist() %>%
paste(collapse = ", ") %>%
sprintf("as.data.table(mtcars)[, .(%s), by = cyl]", .) %>%
parse(text = .) %>%
eval()
which yields the expected result
cyl disp.sum disp.mean hp.sd drat.sum drat.mean wt.max
1: 6 1283.2 183.3143 24.26049 25.10 3.585714 3.460
2: 4 1156.5 105.1364 20.93453 44.78 4.070909 3.190
3: 8 4943.4 353.1000 50.97689 45.21 3.229286 5.424
The character string which is parsed is created by
mylist %>%
names() %>%
lapply(
function(.col) lapply(
mylist[[.col]],
function(.fct) sprintf("%s.%s = %s(%s)", .col, .fct, .fct, .col))) %>%
unlist() %>%
paste(collapse = ", ") %>%
sprintf("as.data.table(mtcars)[, .(%s), by = cyl]", .)
and looks as if coded manually:
[1] "as.data.table(mtcars)[, .(disp.sum = sum(disp), disp.mean = mean(disp), hp.sd = sd(hp), drat.sum = sum(drat), drat.mean = mean(drat), wt.max = max(wt)), by = cyl]"
Data
For demonstration, mylist
is provided "hard-coded":
mylist <- list(
disp = c("sum", "mean"),
hp = "sd",
drat = c("sum", "mean"),
wt = "max")
How to chain together a mix of data.table and base r functions?
If I understand correctly, the OP wants to
- rename column
Value_1
toValue
(or in OP's words: create new column "Values", which equals "Values_1") - drop column
Value_2
- replace all occurrences of
XX
byHI
in columnState
Here is what I would do in data.table syntax:
setDT(data)[, State := ifelse(rev(cumsum(rev(Values_1 + Values_2))), State, "XX"), ID][
, Values_2 := NULL][
State == "XX", State := "HI"][]
setnames(data, "Values_1", "Values")
data
ID Period Values State
1: 1 1 5 X0
2: 1 2 0 X1
3: 1 3 0 X2
4: 1 4 0 X1
5: 2 1 1 X0
6: 2 2 0 HI
7: 2 3 0 HI
8: 2 4 0 HI
9: 3 1 0 X2
10: 3 2 0 X1
11: 3 3 0 X9
12: 3 4 0 X3
13: 4 1 1 X2
14: 4 2 2 X1
15: 4 3 3 X9
16: 4 4 0 HI
setnames()
updates by reference, e.g., without copying. There is no need to create a copy of Values_1
and delete Values_1
later on.
Also, [State == "XX", State := "HI"]
replaces XX
by HI
only in affected rows by reference while[, State := gsub('XX','HI', State)]
replaces the whole column.
data.table chaining is used where appropriate.
BTW: I wonder why the replacement of XX
by HI
cannot be done rightaway in the first statement:
setDT(data)[, State := ifelse(rev(cumsum(rev(Values_1 + Values_2))), State, "HI"), ID][
, Values_2 := NULL][]
setnames(data, "Values_1", "Values")
Combining multiple lists of variable names in data.table?
Another option is to construct the full call ahead of time:
varnames[4:5] <- expr[2:3] # this results in `list(a, b, a * b, c + d)`
dt[, eval(varnames)]
produces:
a b V3 V4
1: 1 4 4 17
2: 2 5 10 19
3: 3 6 18 21
More generically, suppose you have a list of quoted lists of expressions:
exprlist <- list(quote(list(a, b)), quote(list(c, c %% a)), quote(list(a + b)))
expr <- as.call(Reduce(function(x, y) c(as.list(x), as.list(y)[-1]), exprlist)) # @eddi
dt[, eval(expr)]
Related Topics
Convert Numeric Vector to Binary (0/1) Based on Limit
How to Set Ggplot X-Label Equal to Variable Name During Lapply
Reshape Data for Values in One Column
Error Using T.Test() in R - Not Enough 'Y' Observations
Sum Columns Row-Wise with Similar Names
Find Closest Points (Lat/Lon) from One Data Set to a Second Data Set
Read Column Names as Date Format
Axis Does Not Plot with Date Labels
Cannot Install Library(Xlsx) in R and Look for an Alternative
"Non-Finite Function Value" When Using Integrate() in R
Stopping the Script Until a Value Is Entred from Keyboard in R
Calculating Inter-Purchase Time in R
Align Points and Error Bars in Ggplot When Using 'Jitterdodge'
R Dplyr Subset with Missing Columns
Ggplot: How to Produce a Gradient Fill Within a Geom_Polygon