Convert column classes in data.table
For a single column:
dtnew <- dt[, Quarter:=as.character(Quarter)]
str(dtnew)
Classes ‘data.table’ and 'data.frame': 10 obs. of 3 variables:
$ ID : Factor w/ 2 levels "A","B": 1 1 1 1 1 2 2 2 2 2
$ Quarter: chr "1" "2" "3" "4" ...
$ value : num -0.838 0.146 -1.059 -1.197 0.282 ...
Using lapply
and as.character
:
dtnew <- dt[, lapply(.SD, as.character), by=ID]
str(dtnew)
Classes ‘data.table’ and 'data.frame': 10 obs. of 3 variables:
$ ID : Factor w/ 2 levels "A","B": 1 1 1 1 1 2 2 2 2 2
$ Quarter: chr "1" "2" "3" "4" ...
$ value : chr "1.487145280568" "-0.827845218358881" "0.028977182770002" "1.35392750102305" ...
Convert *some* column classes in data.table
Besides using the option as suggested by Matt Dowle, another way of changing the column classes is as follows:
dat[, (cols) := lapply(.SD, factor), .SDcols = cols]
By using the :=
operator you update the datatable by reference. A check whether this worked:
> sapply(dat,class)
ID Quarter value
"factor" "factor" "numeric"
As suggeted by @MattDowle in the comments, you can also use a combination of for(...) set(...)
as follows:
for (col in cols) set(dat, j = col, value = factor(dat[[col]]))
which will give the same result. A third alternative is:
for (col in cols) dat[, (col) := factor(dat[[col]])]
On a smaller datasets, the for(...) set(...)
option is about three times faster than the lapply
option (but that doesn't really matter, because it is a small dataset). On larger datasets (e.g. 2 million rows), each of these approaches takes about the same amount of time. For testing on a larger dataset, I used:
dat <- data.table(ID=c(rep("A", 1e6), rep("B",1e6)),
Quarter=c(1:1e6, 1:1e6),
value=rnorm(10))
Sometimes, you will have to do it a bit differently (for example when numeric values are stored as a factor). Then you have to use something like this:
dat[, (cols) := lapply(.SD, function(x) as.integer(as.character(x))), .SDcols = cols]
WARNING: The following explanation is not the data.table
-way of doing things. The datatable is not updated by reference because a copy is made and stored in memory (as pointed out by @Frank), which increases memory usage. It is more an addition in order to explain the working of with = FALSE
.
When you want to change the column classes the same way as you would do with a dataframe, you have to add with = FALSE
as follows:
dat[, cols] <- lapply(dat[, cols, with = FALSE], factor)
A check whether this worked:
> sapply(dat,class)
ID Quarter value
"factor" "factor" "numeric"
If you don't add with = FALSE
, datatable will evaluate dat[, cols]
as a vector. Check the difference in output between dat[, cols]
and dat[, cols, with = FALSE]
:
> dat[, cols]
[1] "ID" "Quarter"
> dat[, cols, with = FALSE]
ID Quarter
1: A 1
2: A 2
3: A 3
4: A 4
5: A 5
6: B 1
7: B 2
8: B 3
9: B 4
10: B 5
Issue converting multiple column classes in R data.table
factcols
in .SDcols=factcols
should be a length-4 logical vector or the vector of column name/position, e.g. .SDcols = c("Born_before_2016"),.SDcols = 1
, but factcols <- sapply(norw5[,..varls], is.numeric)
returns length-3 logical vector.
It can be fixed as
fact <- c('Born_before_2016','gender','payor')
factcols <- sapply(norw5[,..fact], is.numeric)
cols <- names(norw5)[1:3][factcols]
norw5new <- norw5[,(cols) := lapply(.SD,as.character),.SDcols=cols]
norw5new
# Born_before_2016 gender payor Age_in_day
# <char> <char> <char> <int>
#1: 1 2.Female 1:Private 0
#2: 1 1.Male 1:Private 0
#3: 1 2.Female 4:Other 0
#4: 1 1.Male 4:Other 4
#5: 1 1.Male 1:Private 5
Set multiple column classes from a vector in data.table
Same idea as @RonakShah's answer but assuming the OP has explicitly named the columns rather than passing by position:
# different input format
cc <- setNames(col_classes, names(dtnew))
# usage
res = lapply(setNames(, names(cc)), function(n)
match.fun(sprintf("as.%s", cc[[n]]))(dtnew[[n]])
)
setDT(res)[]
Some other ways the problem might be solved:
If reading the data in, use the
colClasses=
argument tofread()
or a similar function.Maybe also consider
type.convert
which will automatically guess and apply a class to each column. It cannot return a mix of character and factor columns, however.
Convert columns of arbitrary class to the class of matching columns in another data.table
Not very elegant but you may 'build' the as.*
call like this:
for (x in colnames(A)) { A[,x] <- eval( call( paste0("as.", class(B[,x])), A[,x]) )}
Change the class from factor to numeric of many columns in a data frame
Further to Ramnath's answer, the behaviour you are experiencing is that due to as.numeric(x)
returning the internal, numeric representation of the factor x
at the R level. If you want to preserve the numbers that are the levels of the factor (rather than their internal representation), you need to convert to character via as.character()
first as per Ramnath's example.
Your for
loop is just as reasonable as an apply
call and might be slightly more readable as to what the intention of the code is. Just change this line:
stats[,i] <- as.numeric(stats[,i])
to read
stats[,i] <- as.numeric(as.character(stats[,i]))
This is FAQ 7.10 in the R FAQ.
HTH
Converting all and only suitable character columns to numeric in data.table
My first thought was to use type.convert
, but that either converts character
and Date
to factor
, or with as.is=TRUE
it converts factor
to character
.
str(DT[, lapply(.SD, type.convert)])
# Classes 'data.table' and 'data.frame': 100 obs. of 13 variables:
# $ panelID : int 4 39 1 34 23 43 14 18 33 21 ...
# $ Country : Factor w/ 3 levels "Albania","Belarus",..: 1 1 1 1 1 1 1 1 1 1 ...
# $ some_NA : int 0 2 4 1 5 3 0 2 4 1 ...
# $ some_NA_factor: int 3 2 0 5 1 4 3 2 0 5 ...
# $ Group : int 1 1 1 1 1 1 1 1 1 1 ...
# $ Time : Factor w/ 20 levels "2010-01-02","2010-02-02",..: 1 2 3 4 5 6 7 8 9 10 ...
# $ wt : num 0.15 0.3 0.15 0.9 1.35 1.2 1.2 0.75 0.6 1.2 ...
# $ Income : num -4.4 -6.41 2.28 -3.85 -0.02 ...
# $ Happiness : int 3 10 6 9 5 7 4 1 2 8 ...
# $ Sex : num 0.61 1.18 0.55 0.69 0.63 0.65 0.67 0.9 0.7 0.6 ...
# $ Age : int 15 2 65 67 73 17 84 5 41 91 ...
# $ Educ : num 0.54 1.04 1.29 0.43 0.76 0.63 0.6 0.44 0.48 1.13 ...
# $ uniqueID : int 1 2 3 4 5 6 7 8 9 10 ...
# - attr(*, ".internal.selfref")=<externalptr>
str(DT[, lapply(.SD, type.convert, as.is = TRUE)])
# Classes 'data.table' and 'data.frame': 100 obs. of 13 variables:
# $ panelID : int 4 39 1 34 23 43 14 18 33 21 ...
# $ Country : chr "Albania" "Albania" "Albania" "Albania" ...
# $ some_NA : int 0 2 4 1 5 3 0 2 4 1 ...
# $ some_NA_factor: int 3 2 0 5 1 4 3 2 0 5 ...
# $ Group : int 1 1 1 1 1 1 1 1 1 1 ...
# $ Time : chr "2010-01-02" "2010-02-02" "2010-03-02" "2010-04-02" ...
# $ wt : num 0.15 0.3 0.15 0.9 1.35 1.2 1.2 0.75 0.6 1.2 ...
# $ Income : num -4.4 -6.41 2.28 -3.85 -0.02 ...
# $ Happiness : int 3 10 6 9 5 7 4 1 2 8 ...
# $ Sex : num 0.61 1.18 0.55 0.69 0.63 0.65 0.67 0.9 0.7 0.6 ...
# $ Age : int 15 2 65 67 73 17 84 5 41 91 ...
# $ Educ : num 0.54 1.04 1.29 0.43 0.76 0.63 0.6 0.44 0.48 1.13 ...
# $ uniqueID : int 1 2 3 4 5 6 7 8 9 10 ...
# - attr(*, ".internal.selfref")=<externalptr>
So I think we need our own function with similar intentions.
mytype <- function(z) if (is.character(z) && all(grepl("^-?[\\d.]+(?:e-?\\d+)?$", z, perl = TRUE))) as.numeric(z) else z
str(DT[, lapply(.SD, mytype)])
# Classes 'data.table' and 'data.frame': 100 obs. of 13 variables:
# $ panelID : int 4 39 1 34 23 43 14 18 33 21 ...
# $ Country : chr "Albania" "Albania" "Albania" "Albania" ...
# $ some_NA : int 0 2 4 1 5 3 0 2 4 1 ...
# $ some_NA_factor: Factor w/ 6 levels "0","1","2","3",..: 4 3 1 6 2 5 4 3 1 6 ...
# $ Group : num 1 1 1 1 1 1 1 1 1 1 ...
# $ Time : Date, format: "2010-01-02" "2010-02-02" "2010-03-02" ...
# $ wt : num 0.15 0.3 0.15 0.9 1.35 1.2 1.2 0.75 0.6 1.2 ...
# $ Income : num -4.4 -6.41 2.28 -3.85 -0.02 ...
# $ Happiness : int 3 10 6 9 5 7 4 1 2 8 ...
# $ Sex : num 0.61 1.18 0.55 0.69 0.63 0.65 0.67 0.9 0.7 0.6 ...
# $ Age : int 15 2 65 67 73 17 84 5 41 91 ...
# $ Educ : num 0.54 1.04 1.29 0.43 0.76 0.63 0.6 0.44 0.48 1.13 ...
# $ uniqueID : int 1 2 3 4 5 6 7 8 9 10 ...
# - attr(*, ".internal.selfref")=<externalptr>
With larger data, you may prefer to break the grepl
condition out so that you define which columns to work on:
mytypetest <- function(z) is.character(z) && all(grepl("^-?[\\d.]+(?:e-?\\d+)?$", z, perl = TRUE))
cols <- which(sapply(DT, mytypetest))
cols
# Group
# 5
DT[, (cols) := lapply(.SD, as.numeric), .SDcols = cols]
str(DT)
# Classes 'data.table' and 'data.frame': 100 obs. of 13 variables:
# $ panelID : int 4 39 1 34 23 43 14 18 33 21 ...
# $ Country : chr "Albania" "Albania" "Albania" "Albania" ...
# $ some_NA : int 0 2 4 1 5 3 0 2 4 1 ...
# $ some_NA_factor: Factor w/ 6 levels "0","1","2","3",..: 4 3 1 6 2 5 4 3 1 6 ...
# $ Group : num 1 1 1 1 1 1 1 1 1 1 ...
# $ Time : Date, format: "2010-01-02" "2010-02-02" "2010-03-02" ...
# $ wt : num 0.15 0.3 0.15 0.9 1.35 1.2 1.2 0.75 0.6 1.2 ...
# $ Income : num -4.4 -6.41 2.28 -3.85 -0.02 ...
# $ Happiness : int 3 10 6 9 5 7 4 1 2 8 ...
# $ Sex : num 0.61 1.18 0.55 0.69 0.63 0.65 0.67 0.9 0.7 0.6 ...
# $ Age : int 15 2 65 67 73 17 84 5 41 91 ...
# $ Educ : num 0.54 1.04 1.29 0.43 0.76 0.63 0.6 0.44 0.48 1.13 ...
# $ uniqueID : int 1 2 3 4 5 6 7 8 9 10 ...
# - attr(*, ".internal.selfref")=<externalptr>
This last one will be technically faster with any sized data, but it might be noticeable for larger (columns and/or rows) data.
Related Topics
Missing Legend with Ggplot2 and Geom_Line
Converting Nested List to Dataframe
Using Multiple Criteria in Subset Function and Logical Operators
How to Generate All Possible Combinations of Vectors Without Caring for Order
How to Calculate Combination and Permutation in R
Plotting Multiple Time-Series in Ggplot
How to Use the Strsplit Function with a Period
How to Name the "Row Names" Column in R
Error in Grid.Call(L_Textbounds, As.Graphicsannot(X$Label), X$X, X$Y,:Polygon Edge Not Found
Multiplying All Elements of a Vector in R
Logical Operators (And, Or) with Na, True and False
Align Ggplot2 Plots Vertically
Collapsing Data Frame by Selecting One Row Per Group
How to Work with Large Numbers in R
Avoid Ggplot Sorting the X-Axis While Plotting Geom_Bar()
Why Is Apply() Method Slower Than a for Loop in R
Remove Everything After Space in String
How to Use Spread on Multiple Columns in Tidyr Similar to Dcast