Select Columns by Class (E.G. Numeric) from a Data.Table

Select columns by class (e.g. numeric) from a data.table

data.table needs the with=FALSE to grab column numbers.

tokeep <- which(sapply(x,is.numeric))
x[ , tokeep, with=FALSE]

best way to select columns from data.table by type

You can pass a logical/character vector to .SDcols.

For character columns, we can do

library(data.table)
cols <- names(Filter(is.character, dt))
dt[, (cols) := lapply(.SD, tolower), .SDcols = cols]

Select multiple columns in data.table by their numeric indices

For versions of data.table >= 1.9.8, the following all just work:

library(data.table)
dt <- data.table(a = 1, b = 2, c = 3)

# select single column by index
dt[, 2]
# b
# 1: 2

# select multiple columns by index
dt[, 2:3]
# b c
# 1: 2 3

# select single column by name
dt[, "a"]
# a
# 1: 1

# select multiple columns by name
dt[, c("a", "b")]
# a b
# 1: 1 2

For versions of data.table < 1.9.8 (for which numerical column selection required the use of with = FALSE), see this previous version of this answer. See also NEWS on v1.9.8, POTENTIALLY BREAKING CHANGES, point 3.

Selecting only numeric columns from a data frame

EDIT: updated to avoid use of ill-advised sapply.

Since a data frame is a list we can use the list-apply functions:

nums <- unlist(lapply(x, is.numeric), use.names = FALSE)  

Then standard subsetting

x[ , nums]

## don't use sapply, even though it's less code
## nums <- sapply(x, is.numeric)

For a more idiomatic modern R I'd now recommend

x[ , purrr::map_lgl(x, is.numeric)]

Less codey, less reflecting R's particular quirks, and more straightforward, and robust to use on database-back-ended tibbles:

dplyr::select_if(x, is.numeric)

Newer versions of dplyr, also support the following syntax:

x %>% dplyr::select(where(is.numeric))

Set multiple column classes from a vector in data.table

Same idea as @RonakShah's answer but assuming the OP has explicitly named the columns rather than passing by position:

# different input format
cc <- setNames(col_classes, names(dtnew))

# usage
res = lapply(setNames(, names(cc)), function(n)
match.fun(sprintf("as.%s", cc[[n]]))(dtnew[[n]])
)
setDT(res)[]

Some other ways the problem might be solved:

  • If reading the data in, use the colClasses= argument to fread() or a similar function.

  • Maybe also consider type.convert which will automatically guess and apply a class to each column. It cannot return a mix of character and factor columns, however.

Convert column classes in data.table

For a single column:

dtnew <- dt[, Quarter:=as.character(Quarter)]
str(dtnew)

Classes ‘data.table’ and 'data.frame': 10 obs. of 3 variables:
$ ID : Factor w/ 2 levels "A","B": 1 1 1 1 1 2 2 2 2 2
$ Quarter: chr "1" "2" "3" "4" ...
$ value : num -0.838 0.146 -1.059 -1.197 0.282 ...

Using lapply and as.character:

dtnew <- dt[, lapply(.SD, as.character), by=ID]
str(dtnew)

Classes ‘data.table’ and 'data.frame': 10 obs. of 3 variables:
$ ID : Factor w/ 2 levels "A","B": 1 1 1 1 1 2 2 2 2 2
$ Quarter: chr "1" "2" "3" "4" ...
$ value : chr "1.487145280568" "-0.827845218358881" "0.028977182770002" "1.35392750102305" ...

Convert *some* column classes in data.table

Besides using the option as suggested by Matt Dowle, another way of changing the column classes is as follows:

dat[, (cols) := lapply(.SD, factor), .SDcols = cols]

By using the := operator you update the datatable by reference. A check whether this worked:

> sapply(dat,class)
ID Quarter value
"factor" "factor" "numeric"

As suggeted by @MattDowle in the comments, you can also use a combination of for(...) set(...) as follows:

for (col in cols) set(dat, j = col, value = factor(dat[[col]]))

which will give the same result. A third alternative is:

for (col in cols) dat[, (col) := factor(dat[[col]])]

On a smaller datasets, the for(...) set(...) option is about three times faster than the lapply option (but that doesn't really matter, because it is a small dataset). On larger datasets (e.g. 2 million rows), each of these approaches takes about the same amount of time. For testing on a larger dataset, I used:

dat <- data.table(ID=c(rep("A", 1e6), rep("B",1e6)),
Quarter=c(1:1e6, 1:1e6),
value=rnorm(10))

Sometimes, you will have to do it a bit differently (for example when numeric values are stored as a factor). Then you have to use something like this:

dat[, (cols) := lapply(.SD, function(x) as.integer(as.character(x))), .SDcols = cols]


WARNING: The following explanation is not the data.table-way of doing things. The datatable is not updated by reference because a copy is made and stored in memory (as pointed out by @Frank), which increases memory usage. It is more an addition in order to explain the working of with = FALSE.

When you want to change the column classes the same way as you would do with a dataframe, you have to add with = FALSE as follows:

dat[, cols] <- lapply(dat[, cols, with = FALSE], factor)

A check whether this worked:

> sapply(dat,class)
ID Quarter value
"factor" "factor" "numeric"

If you don't add with = FALSE, datatable will evaluate dat[, cols] as a vector. Check the difference in output between dat[, cols] and dat[, cols, with = FALSE]:

> dat[, cols]
[1] "ID" "Quarter"

> dat[, cols, with = FALSE]
ID Quarter
1: A 1
2: A 2
3: A 3
4: A 4
5: A 5
6: B 1
7: B 2
8: B 3
9: B 4
10: B 5

How to select columns in data.table using a character vector of certain column names?

We can use .. notation to find myVector as a vector of column positions, like it would work in data.frame

mtcarsDT[, ..myVector]

According to ?data.table

In case of overlapping variables names inside dataset and in parent scope you can use double dot prefix ..cols to explicitly refer to 'cols variable parent scope and not from your dataset.

Issue converting multiple column classes in R data.table

factcols in .SDcols=factcols should be a length-4 logical vector or the vector of column name/position, e.g. .SDcols = c("Born_before_2016"),.SDcols = 1, but factcols <- sapply(norw5[,..varls], is.numeric) returns length-3 logical vector.
It can be fixed as

fact <- c('Born_before_2016','gender','payor')
factcols <- sapply(norw5[,..fact], is.numeric)
cols <- names(norw5)[1:3][factcols]
norw5new <- norw5[,(cols) := lapply(.SD,as.character),.SDcols=cols]
norw5new

# Born_before_2016 gender payor Age_in_day
# <char> <char> <char> <int>
#1: 1 2.Female 1:Private 0
#2: 1 1.Male 1:Private 0
#3: 1 2.Female 4:Other 0
#4: 1 1.Male 4:Other 4
#5: 1 1.Male 1:Private 5

Rounding selected columns of data.table

If you don't mind overwriting your original mydf:

cols <- names(mydf)[1:2]
mydf[,(cols) := round(.SD,1), .SDcols=cols]
mydf

# vnum1 vnum2 vch1
#1: 0.6 0.7 B
#2: -1.4 0.5 E
#3: 0.7 0.9 A
#4: -0.3 0.8 C
#5: -0.8 0.6 C


Related Topics



Leave a reply



Submit