Select columns by class (e.g. numeric) from a data.table
data.table
needs the with=FALSE
to grab column numbers.
tokeep <- which(sapply(x,is.numeric))
x[ , tokeep, with=FALSE]
best way to select columns from data.table by type
You can pass a logical/character vector to .SDcols
.
For character columns, we can do
library(data.table)
cols <- names(Filter(is.character, dt))
dt[, (cols) := lapply(.SD, tolower), .SDcols = cols]
Select multiple columns in data.table by their numeric indices
For versions of data.table >= 1.9.8
, the following all just work:
library(data.table)
dt <- data.table(a = 1, b = 2, c = 3)
# select single column by index
dt[, 2]
# b
# 1: 2
# select multiple columns by index
dt[, 2:3]
# b c
# 1: 2 3
# select single column by name
dt[, "a"]
# a
# 1: 1
# select multiple columns by name
dt[, c("a", "b")]
# a b
# 1: 1 2
For versions of data.table < 1.9.8
(for which numerical column selection required the use of with = FALSE
), see this previous version of this answer. See also NEWS on v1.9.8, POTENTIALLY BREAKING CHANGES, point 3.
Selecting only numeric columns from a data frame
EDIT: updated to avoid use of ill-advised sapply
.
Since a data frame is a list we can use the list-apply functions:
nums <- unlist(lapply(x, is.numeric), use.names = FALSE)
Then standard subsetting
x[ , nums]
## don't use sapply, even though it's less code
## nums <- sapply(x, is.numeric)
For a more idiomatic modern R I'd now recommend
x[ , purrr::map_lgl(x, is.numeric)]
Less codey, less reflecting R's particular quirks, and more straightforward, and robust to use on database-back-ended tibbles:
dplyr::select_if(x, is.numeric)
Newer versions of dplyr, also support the following syntax:
x %>% dplyr::select(where(is.numeric))
Set multiple column classes from a vector in data.table
Same idea as @RonakShah's answer but assuming the OP has explicitly named the columns rather than passing by position:
# different input format
cc <- setNames(col_classes, names(dtnew))
# usage
res = lapply(setNames(, names(cc)), function(n)
match.fun(sprintf("as.%s", cc[[n]]))(dtnew[[n]])
)
setDT(res)[]
Some other ways the problem might be solved:
If reading the data in, use the
colClasses=
argument tofread()
or a similar function.Maybe also consider
type.convert
which will automatically guess and apply a class to each column. It cannot return a mix of character and factor columns, however.
Convert column classes in data.table
For a single column:
dtnew <- dt[, Quarter:=as.character(Quarter)]
str(dtnew)
Classes ‘data.table’ and 'data.frame': 10 obs. of 3 variables:
$ ID : Factor w/ 2 levels "A","B": 1 1 1 1 1 2 2 2 2 2
$ Quarter: chr "1" "2" "3" "4" ...
$ value : num -0.838 0.146 -1.059 -1.197 0.282 ...
Using lapply
and as.character
:
dtnew <- dt[, lapply(.SD, as.character), by=ID]
str(dtnew)
Classes ‘data.table’ and 'data.frame': 10 obs. of 3 variables:
$ ID : Factor w/ 2 levels "A","B": 1 1 1 1 1 2 2 2 2 2
$ Quarter: chr "1" "2" "3" "4" ...
$ value : chr "1.487145280568" "-0.827845218358881" "0.028977182770002" "1.35392750102305" ...
Convert *some* column classes in data.table
Besides using the option as suggested by Matt Dowle, another way of changing the column classes is as follows:
dat[, (cols) := lapply(.SD, factor), .SDcols = cols]
By using the :=
operator you update the datatable by reference. A check whether this worked:
> sapply(dat,class)
ID Quarter value
"factor" "factor" "numeric"
As suggeted by @MattDowle in the comments, you can also use a combination of for(...) set(...)
as follows:
for (col in cols) set(dat, j = col, value = factor(dat[[col]]))
which will give the same result. A third alternative is:
for (col in cols) dat[, (col) := factor(dat[[col]])]
On a smaller datasets, the for(...) set(...)
option is about three times faster than the lapply
option (but that doesn't really matter, because it is a small dataset). On larger datasets (e.g. 2 million rows), each of these approaches takes about the same amount of time. For testing on a larger dataset, I used:
dat <- data.table(ID=c(rep("A", 1e6), rep("B",1e6)),
Quarter=c(1:1e6, 1:1e6),
value=rnorm(10))
Sometimes, you will have to do it a bit differently (for example when numeric values are stored as a factor). Then you have to use something like this:
dat[, (cols) := lapply(.SD, function(x) as.integer(as.character(x))), .SDcols = cols]
WARNING: The following explanation is not the data.table
-way of doing things. The datatable is not updated by reference because a copy is made and stored in memory (as pointed out by @Frank), which increases memory usage. It is more an addition in order to explain the working of with = FALSE
.
When you want to change the column classes the same way as you would do with a dataframe, you have to add with = FALSE
as follows:
dat[, cols] <- lapply(dat[, cols, with = FALSE], factor)
A check whether this worked:
> sapply(dat,class)
ID Quarter value
"factor" "factor" "numeric"
If you don't add with = FALSE
, datatable will evaluate dat[, cols]
as a vector. Check the difference in output between dat[, cols]
and dat[, cols, with = FALSE]
:
> dat[, cols]
[1] "ID" "Quarter"
> dat[, cols, with = FALSE]
ID Quarter
1: A 1
2: A 2
3: A 3
4: A 4
5: A 5
6: B 1
7: B 2
8: B 3
9: B 4
10: B 5
How to select columns in data.table using a character vector of certain column names?
We can use ..
notation to find myVector
as a vector of column positions, like it would work in data.frame
mtcarsDT[, ..myVector]
According to ?data.table
In case of overlapping variables names inside dataset and in parent scope you can use double dot prefix
..cols
to explicitly refer to 'cols variable parent scope and not from your dataset.
Issue converting multiple column classes in R data.table
factcols
in .SDcols=factcols
should be a length-4 logical vector or the vector of column name/position, e.g. .SDcols = c("Born_before_2016"),.SDcols = 1
, but factcols <- sapply(norw5[,..varls], is.numeric)
returns length-3 logical vector.
It can be fixed as
fact <- c('Born_before_2016','gender','payor')
factcols <- sapply(norw5[,..fact], is.numeric)
cols <- names(norw5)[1:3][factcols]
norw5new <- norw5[,(cols) := lapply(.SD,as.character),.SDcols=cols]
norw5new
# Born_before_2016 gender payor Age_in_day
# <char> <char> <char> <int>
#1: 1 2.Female 1:Private 0
#2: 1 1.Male 1:Private 0
#3: 1 2.Female 4:Other 0
#4: 1 1.Male 4:Other 4
#5: 1 1.Male 1:Private 5
Rounding selected columns of data.table
If you don't mind overwriting your original mydf
:
cols <- names(mydf)[1:2]
mydf[,(cols) := round(.SD,1), .SDcols=cols]
mydf
# vnum1 vnum2 vch1
#1: 0.6 0.7 B
#2: -1.4 0.5 E
#3: 0.7 0.9 A
#4: -0.3 0.8 C
#5: -0.8 0.6 C
Related Topics
Using Functions and Environments
How to Get Environment of a Variable in R
Find Elements Not in Smaller Character Vector List But in Big List
2 Knitr/R Markdown/Rstudio Issues: Highcharts and Morris.Js
Arranging Arrows Between Points Nicely in Ggplot2
How to Set Different Scale Limits for Different Facets
Random Sampling to Give an Exact Sum
Filled.Contour in R 3.0.X Throws Error
R:Loops to Process Large Dataset(Gbs) in Chunks
Unicode Characters Conversion in R
Extracting Data Used to Make a Smooth Plot in Mgcv
How to Install R-Packages Not in the Conda Repositories
Pivot_Wider, Count Number of Occurrences
Twitter Emoji Encoding Problems with Twitter and R
Problems with Dplyr and Posixlt Data
Separate a Column into 2 Columns at the Last Underscore in R