Convert column classes in data.table
For a single column:
dtnew <- dt[, Quarter:=as.character(Quarter)]
str(dtnew)
Classes ‘data.table’ and 'data.frame': 10 obs. of 3 variables:
$ ID : Factor w/ 2 levels "A","B": 1 1 1 1 1 2 2 2 2 2
$ Quarter: chr "1" "2" "3" "4" ...
$ value : num -0.838 0.146 -1.059 -1.197 0.282 ...
Using lapply
and as.character
:
dtnew <- dt[, lapply(.SD, as.character), by=ID]
str(dtnew)
Classes ‘data.table’ and 'data.frame': 10 obs. of 3 variables:
$ ID : Factor w/ 2 levels "A","B": 1 1 1 1 1 2 2 2 2 2
$ Quarter: chr "1" "2" "3" "4" ...
$ value : chr "1.487145280568" "-0.827845218358881" "0.028977182770002" "1.35392750102305" ...
Convert *some* column classes in data.table
Besides using the option as suggested by Matt Dowle, another way of changing the column classes is as follows:
dat[, (cols) := lapply(.SD, factor), .SDcols = cols]
By using the :=
operator you update the datatable by reference. A check whether this worked:
> sapply(dat,class)
ID Quarter value
"factor" "factor" "numeric"
As suggeted by @MattDowle in the comments, you can also use a combination of for(...) set(...)
as follows:
for (col in cols) set(dat, j = col, value = factor(dat[[col]]))
which will give the same result. A third alternative is:
for (col in cols) dat[, (col) := factor(dat[[col]])]
On a smaller datasets, the for(...) set(...)
option is about three times faster than the lapply
option (but that doesn't really matter, because it is a small dataset). On larger datasets (e.g. 2 million rows), each of these approaches takes about the same amount of time. For testing on a larger dataset, I used:
dat <- data.table(ID=c(rep("A", 1e6), rep("B",1e6)),
Quarter=c(1:1e6, 1:1e6),
value=rnorm(10))
Sometimes, you will have to do it a bit differently (for example when numeric values are stored as a factor). Then you have to use something like this:
dat[, (cols) := lapply(.SD, function(x) as.integer(as.character(x))), .SDcols = cols]
WARNING: The following explanation is not the data.table
-way of doing things. The datatable is not updated by reference because a copy is made and stored in memory (as pointed out by @Frank), which increases memory usage. It is more an addition in order to explain the working of with = FALSE
.
When you want to change the column classes the same way as you would do with a dataframe, you have to add with = FALSE
as follows:
dat[, cols] <- lapply(dat[, cols, with = FALSE], factor)
A check whether this worked:
> sapply(dat,class)
ID Quarter value
"factor" "factor" "numeric"
If you don't add with = FALSE
, datatable will evaluate dat[, cols]
as a vector. Check the difference in output between dat[, cols]
and dat[, cols, with = FALSE]
:
> dat[, cols]
[1] "ID" "Quarter"
> dat[, cols, with = FALSE]
ID Quarter
1: A 1
2: A 2
3: A 3
4: A 4
5: A 5
6: B 1
7: B 2
8: B 3
9: B 4
10: B 5
Issue converting multiple column classes in R data.table
factcols
in .SDcols=factcols
should be a length-4 logical vector or the vector of column name/position, e.g. .SDcols = c("Born_before_2016"),.SDcols = 1
, but factcols <- sapply(norw5[,..varls], is.numeric)
returns length-3 logical vector.
It can be fixed as
fact <- c('Born_before_2016','gender','payor')
factcols <- sapply(norw5[,..fact], is.numeric)
cols <- names(norw5)[1:3][factcols]
norw5new <- norw5[,(cols) := lapply(.SD,as.character),.SDcols=cols]
norw5new
# Born_before_2016 gender payor Age_in_day
# <char> <char> <char> <int>
#1: 1 2.Female 1:Private 0
#2: 1 1.Male 1:Private 0
#3: 1 2.Female 4:Other 0
#4: 1 1.Male 4:Other 4
#5: 1 1.Male 1:Private 5
Set multiple column classes from a vector in data.table
Same idea as @RonakShah's answer but assuming the OP has explicitly named the columns rather than passing by position:
# different input format
cc <- setNames(col_classes, names(dtnew))
# usage
res = lapply(setNames(, names(cc)), function(n)
match.fun(sprintf("as.%s", cc[[n]]))(dtnew[[n]])
)
setDT(res)[]
Some other ways the problem might be solved:
If reading the data in, use the
colClasses=
argument tofread()
or a similar function.Maybe also consider
type.convert
which will automatically guess and apply a class to each column. It cannot return a mix of character and factor columns, however.
Convert columns of arbitrary class to the class of matching columns in another data.table
Not very elegant but you may 'build' the as.*
call like this:
for (x in colnames(A)) { A[,x] <- eval( call( paste0("as.", class(B[,x])), A[,x]) )}
Change the class from factor to numeric of many columns in a data frame
Further to Ramnath's answer, the behaviour you are experiencing is that due to as.numeric(x)
returning the internal, numeric representation of the factor x
at the R level. If you want to preserve the numbers that are the levels of the factor (rather than their internal representation), you need to convert to character via as.character()
first as per Ramnath's example.
Your for
loop is just as reasonable as an apply
call and might be slightly more readable as to what the intention of the code is. Just change this line:
stats[,i] <- as.numeric(stats[,i])
to read
stats[,i] <- as.numeric(as.character(stats[,i]))
This is FAQ 7.10 in the R FAQ.
HTH
Converting all and only suitable character columns to numeric in data.table
My first thought was to use type.convert
, but that either converts character
and Date
to factor
, or with as.is=TRUE
it converts factor
to character
.
str(DT[, lapply(.SD, type.convert)])
# Classes 'data.table' and 'data.frame': 100 obs. of 13 variables:
# $ panelID : int 4 39 1 34 23 43 14 18 33 21 ...
# $ Country : Factor w/ 3 levels "Albania","Belarus",..: 1 1 1 1 1 1 1 1 1 1 ...
# $ some_NA : int 0 2 4 1 5 3 0 2 4 1 ...
# $ some_NA_factor: int 3 2 0 5 1 4 3 2 0 5 ...
# $ Group : int 1 1 1 1 1 1 1 1 1 1 ...
# $ Time : Factor w/ 20 levels "2010-01-02","2010-02-02",..: 1 2 3 4 5 6 7 8 9 10 ...
# $ wt : num 0.15 0.3 0.15 0.9 1.35 1.2 1.2 0.75 0.6 1.2 ...
# $ Income : num -4.4 -6.41 2.28 -3.85 -0.02 ...
# $ Happiness : int 3 10 6 9 5 7 4 1 2 8 ...
# $ Sex : num 0.61 1.18 0.55 0.69 0.63 0.65 0.67 0.9 0.7 0.6 ...
# $ Age : int 15 2 65 67 73 17 84 5 41 91 ...
# $ Educ : num 0.54 1.04 1.29 0.43 0.76 0.63 0.6 0.44 0.48 1.13 ...
# $ uniqueID : int 1 2 3 4 5 6 7 8 9 10 ...
# - attr(*, ".internal.selfref")=<externalptr>
str(DT[, lapply(.SD, type.convert, as.is = TRUE)])
# Classes 'data.table' and 'data.frame': 100 obs. of 13 variables:
# $ panelID : int 4 39 1 34 23 43 14 18 33 21 ...
# $ Country : chr "Albania" "Albania" "Albania" "Albania" ...
# $ some_NA : int 0 2 4 1 5 3 0 2 4 1 ...
# $ some_NA_factor: int 3 2 0 5 1 4 3 2 0 5 ...
# $ Group : int 1 1 1 1 1 1 1 1 1 1 ...
# $ Time : chr "2010-01-02" "2010-02-02" "2010-03-02" "2010-04-02" ...
# $ wt : num 0.15 0.3 0.15 0.9 1.35 1.2 1.2 0.75 0.6 1.2 ...
# $ Income : num -4.4 -6.41 2.28 -3.85 -0.02 ...
# $ Happiness : int 3 10 6 9 5 7 4 1 2 8 ...
# $ Sex : num 0.61 1.18 0.55 0.69 0.63 0.65 0.67 0.9 0.7 0.6 ...
# $ Age : int 15 2 65 67 73 17 84 5 41 91 ...
# $ Educ : num 0.54 1.04 1.29 0.43 0.76 0.63 0.6 0.44 0.48 1.13 ...
# $ uniqueID : int 1 2 3 4 5 6 7 8 9 10 ...
# - attr(*, ".internal.selfref")=<externalptr>
So I think we need our own function with similar intentions.
mytype <- function(z) if (is.character(z) && all(grepl("^-?[\\d.]+(?:e-?\\d+)?$", z, perl = TRUE))) as.numeric(z) else z
str(DT[, lapply(.SD, mytype)])
# Classes 'data.table' and 'data.frame': 100 obs. of 13 variables:
# $ panelID : int 4 39 1 34 23 43 14 18 33 21 ...
# $ Country : chr "Albania" "Albania" "Albania" "Albania" ...
# $ some_NA : int 0 2 4 1 5 3 0 2 4 1 ...
# $ some_NA_factor: Factor w/ 6 levels "0","1","2","3",..: 4 3 1 6 2 5 4 3 1 6 ...
# $ Group : num 1 1 1 1 1 1 1 1 1 1 ...
# $ Time : Date, format: "2010-01-02" "2010-02-02" "2010-03-02" ...
# $ wt : num 0.15 0.3 0.15 0.9 1.35 1.2 1.2 0.75 0.6 1.2 ...
# $ Income : num -4.4 -6.41 2.28 -3.85 -0.02 ...
# $ Happiness : int 3 10 6 9 5 7 4 1 2 8 ...
# $ Sex : num 0.61 1.18 0.55 0.69 0.63 0.65 0.67 0.9 0.7 0.6 ...
# $ Age : int 15 2 65 67 73 17 84 5 41 91 ...
# $ Educ : num 0.54 1.04 1.29 0.43 0.76 0.63 0.6 0.44 0.48 1.13 ...
# $ uniqueID : int 1 2 3 4 5 6 7 8 9 10 ...
# - attr(*, ".internal.selfref")=<externalptr>
With larger data, you may prefer to break the grepl
condition out so that you define which columns to work on:
mytypetest <- function(z) is.character(z) && all(grepl("^-?[\\d.]+(?:e-?\\d+)?$", z, perl = TRUE))
cols <- which(sapply(DT, mytypetest))
cols
# Group
# 5
DT[, (cols) := lapply(.SD, as.numeric), .SDcols = cols]
str(DT)
# Classes 'data.table' and 'data.frame': 100 obs. of 13 variables:
# $ panelID : int 4 39 1 34 23 43 14 18 33 21 ...
# $ Country : chr "Albania" "Albania" "Albania" "Albania" ...
# $ some_NA : int 0 2 4 1 5 3 0 2 4 1 ...
# $ some_NA_factor: Factor w/ 6 levels "0","1","2","3",..: 4 3 1 6 2 5 4 3 1 6 ...
# $ Group : num 1 1 1 1 1 1 1 1 1 1 ...
# $ Time : Date, format: "2010-01-02" "2010-02-02" "2010-03-02" ...
# $ wt : num 0.15 0.3 0.15 0.9 1.35 1.2 1.2 0.75 0.6 1.2 ...
# $ Income : num -4.4 -6.41 2.28 -3.85 -0.02 ...
# $ Happiness : int 3 10 6 9 5 7 4 1 2 8 ...
# $ Sex : num 0.61 1.18 0.55 0.69 0.63 0.65 0.67 0.9 0.7 0.6 ...
# $ Age : int 15 2 65 67 73 17 84 5 41 91 ...
# $ Educ : num 0.54 1.04 1.29 0.43 0.76 0.63 0.6 0.44 0.48 1.13 ...
# $ uniqueID : int 1 2 3 4 5 6 7 8 9 10 ...
# - attr(*, ".internal.selfref")=<externalptr>
This last one will be technically faster with any sized data, but it might be noticeable for larger (columns and/or rows) data.
Related Topics
What Leads the First Element of a Printed List to Be Enclosed with Backticks in R V3.5.1
What Is the Most Useful R Trick
Displaying a PDF from a Local Drive in Shiny
Show Frequencies Along with Barplot in Ggplot2
How to Count True Values in a Logical Vector
How to Append Rows to an R Data Frame
Use of Lapply .Sd in Data.Table R
Fully Reproducible Parallel Models Using Caret
Select Columns Based on String Match - Dplyr::Select
Merge Multiple Spaces to Single Space; Remove Trailing/Leading Spaces
Issue When Importing Dataset: 'Error in Scan(...): Line 1 Did Not Have 145 Elements'
What Do the %Op% Operators in Mean? for Example "%In%"
Generate Dynamic R Markdown Blocks
How to Show Only Part of the Plot Area of Polar Ggplot with Facet
Remove Duplicate Column Pairs, Sort Rows Based on 2 Columns