Convert *Some* Column Classes in Data.Table

Convert column classes in data.table

For a single column:

dtnew <- dt[, Quarter:=as.character(Quarter)]
str(dtnew)

Classes ‘data.table’ and 'data.frame': 10 obs. of 3 variables:
$ ID : Factor w/ 2 levels "A","B": 1 1 1 1 1 2 2 2 2 2
$ Quarter: chr "1" "2" "3" "4" ...
$ value : num -0.838 0.146 -1.059 -1.197 0.282 ...

Using lapply and as.character:

dtnew <- dt[, lapply(.SD, as.character), by=ID]
str(dtnew)

Classes ‘data.table’ and 'data.frame': 10 obs. of 3 variables:
$ ID : Factor w/ 2 levels "A","B": 1 1 1 1 1 2 2 2 2 2
$ Quarter: chr "1" "2" "3" "4" ...
$ value : chr "1.487145280568" "-0.827845218358881" "0.028977182770002" "1.35392750102305" ...

Convert *some* column classes in data.table

Besides using the option as suggested by Matt Dowle, another way of changing the column classes is as follows:

dat[, (cols) := lapply(.SD, factor), .SDcols = cols]

By using the := operator you update the datatable by reference. A check whether this worked:

> sapply(dat,class)
ID Quarter value
"factor" "factor" "numeric"

As suggeted by @MattDowle in the comments, you can also use a combination of for(...) set(...) as follows:

for (col in cols) set(dat, j = col, value = factor(dat[[col]]))

which will give the same result. A third alternative is:

for (col in cols) dat[, (col) := factor(dat[[col]])]

On a smaller datasets, the for(...) set(...) option is about three times faster than the lapply option (but that doesn't really matter, because it is a small dataset). On larger datasets (e.g. 2 million rows), each of these approaches takes about the same amount of time. For testing on a larger dataset, I used:

dat <- data.table(ID=c(rep("A", 1e6), rep("B",1e6)),
Quarter=c(1:1e6, 1:1e6),
value=rnorm(10))

Sometimes, you will have to do it a bit differently (for example when numeric values are stored as a factor). Then you have to use something like this:

dat[, (cols) := lapply(.SD, function(x) as.integer(as.character(x))), .SDcols = cols]


WARNING: The following explanation is not the data.table-way of doing things. The datatable is not updated by reference because a copy is made and stored in memory (as pointed out by @Frank), which increases memory usage. It is more an addition in order to explain the working of with = FALSE.

When you want to change the column classes the same way as you would do with a dataframe, you have to add with = FALSE as follows:

dat[, cols] <- lapply(dat[, cols, with = FALSE], factor)

A check whether this worked:

> sapply(dat,class)
ID Quarter value
"factor" "factor" "numeric"

If you don't add with = FALSE, datatable will evaluate dat[, cols] as a vector. Check the difference in output between dat[, cols] and dat[, cols, with = FALSE]:

> dat[, cols]
[1] "ID" "Quarter"

> dat[, cols, with = FALSE]
ID Quarter
1: A 1
2: A 2
3: A 3
4: A 4
5: A 5
6: B 1
7: B 2
8: B 3
9: B 4
10: B 5

Issue converting multiple column classes in R data.table

factcols in .SDcols=factcols should be a length-4 logical vector or the vector of column name/position, e.g. .SDcols = c("Born_before_2016"),.SDcols = 1, but factcols <- sapply(norw5[,..varls], is.numeric) returns length-3 logical vector.
It can be fixed as

fact <- c('Born_before_2016','gender','payor')
factcols <- sapply(norw5[,..fact], is.numeric)
cols <- names(norw5)[1:3][factcols]
norw5new <- norw5[,(cols) := lapply(.SD,as.character),.SDcols=cols]
norw5new

# Born_before_2016 gender payor Age_in_day
# <char> <char> <char> <int>
#1: 1 2.Female 1:Private 0
#2: 1 1.Male 1:Private 0
#3: 1 2.Female 4:Other 0
#4: 1 1.Male 4:Other 4
#5: 1 1.Male 1:Private 5

Set multiple column classes from a vector in data.table

Same idea as @RonakShah's answer but assuming the OP has explicitly named the columns rather than passing by position:

# different input format
cc <- setNames(col_classes, names(dtnew))

# usage
res = lapply(setNames(, names(cc)), function(n)
match.fun(sprintf("as.%s", cc[[n]]))(dtnew[[n]])
)
setDT(res)[]

Some other ways the problem might be solved:

  • If reading the data in, use the colClasses= argument to fread() or a similar function.

  • Maybe also consider type.convert which will automatically guess and apply a class to each column. It cannot return a mix of character and factor columns, however.

Convert columns of arbitrary class to the class of matching columns in another data.table

Not very elegant but you may 'build' the as.* call like this:

for (x in colnames(A)) { A[,x] <- eval( call( paste0("as.", class(B[,x])), A[,x]) )}

Change the class from factor to numeric of many columns in a data frame

Further to Ramnath's answer, the behaviour you are experiencing is that due to as.numeric(x) returning the internal, numeric representation of the factor x at the R level. If you want to preserve the numbers that are the levels of the factor (rather than their internal representation), you need to convert to character via as.character() first as per Ramnath's example.

Your for loop is just as reasonable as an apply call and might be slightly more readable as to what the intention of the code is. Just change this line:

stats[,i] <- as.numeric(stats[,i])

to read

stats[,i] <- as.numeric(as.character(stats[,i]))

This is FAQ 7.10 in the R FAQ.

HTH

Converting all and only suitable character columns to numeric in data.table

My first thought was to use type.convert, but that either converts character and Date to factor, or with as.is=TRUE it converts factor to character.

str(DT[, lapply(.SD, type.convert)])
# Classes 'data.table' and 'data.frame': 100 obs. of 13 variables:
# $ panelID : int 4 39 1 34 23 43 14 18 33 21 ...
# $ Country : Factor w/ 3 levels "Albania","Belarus",..: 1 1 1 1 1 1 1 1 1 1 ...
# $ some_NA : int 0 2 4 1 5 3 0 2 4 1 ...
# $ some_NA_factor: int 3 2 0 5 1 4 3 2 0 5 ...
# $ Group : int 1 1 1 1 1 1 1 1 1 1 ...
# $ Time : Factor w/ 20 levels "2010-01-02","2010-02-02",..: 1 2 3 4 5 6 7 8 9 10 ...
# $ wt : num 0.15 0.3 0.15 0.9 1.35 1.2 1.2 0.75 0.6 1.2 ...
# $ Income : num -4.4 -6.41 2.28 -3.85 -0.02 ...
# $ Happiness : int 3 10 6 9 5 7 4 1 2 8 ...
# $ Sex : num 0.61 1.18 0.55 0.69 0.63 0.65 0.67 0.9 0.7 0.6 ...
# $ Age : int 15 2 65 67 73 17 84 5 41 91 ...
# $ Educ : num 0.54 1.04 1.29 0.43 0.76 0.63 0.6 0.44 0.48 1.13 ...
# $ uniqueID : int 1 2 3 4 5 6 7 8 9 10 ...
# - attr(*, ".internal.selfref")=<externalptr>
str(DT[, lapply(.SD, type.convert, as.is = TRUE)])
# Classes 'data.table' and 'data.frame': 100 obs. of 13 variables:
# $ panelID : int 4 39 1 34 23 43 14 18 33 21 ...
# $ Country : chr "Albania" "Albania" "Albania" "Albania" ...
# $ some_NA : int 0 2 4 1 5 3 0 2 4 1 ...
# $ some_NA_factor: int 3 2 0 5 1 4 3 2 0 5 ...
# $ Group : int 1 1 1 1 1 1 1 1 1 1 ...
# $ Time : chr "2010-01-02" "2010-02-02" "2010-03-02" "2010-04-02" ...
# $ wt : num 0.15 0.3 0.15 0.9 1.35 1.2 1.2 0.75 0.6 1.2 ...
# $ Income : num -4.4 -6.41 2.28 -3.85 -0.02 ...
# $ Happiness : int 3 10 6 9 5 7 4 1 2 8 ...
# $ Sex : num 0.61 1.18 0.55 0.69 0.63 0.65 0.67 0.9 0.7 0.6 ...
# $ Age : int 15 2 65 67 73 17 84 5 41 91 ...
# $ Educ : num 0.54 1.04 1.29 0.43 0.76 0.63 0.6 0.44 0.48 1.13 ...
# $ uniqueID : int 1 2 3 4 5 6 7 8 9 10 ...
# - attr(*, ".internal.selfref")=<externalptr>

So I think we need our own function with similar intentions.

mytype <- function(z) if (is.character(z) && all(grepl("^-?[\\d.]+(?:e-?\\d+)?$", z, perl = TRUE))) as.numeric(z) else z
str(DT[, lapply(.SD, mytype)])
# Classes 'data.table' and 'data.frame': 100 obs. of 13 variables:
# $ panelID : int 4 39 1 34 23 43 14 18 33 21 ...
# $ Country : chr "Albania" "Albania" "Albania" "Albania" ...
# $ some_NA : int 0 2 4 1 5 3 0 2 4 1 ...
# $ some_NA_factor: Factor w/ 6 levels "0","1","2","3",..: 4 3 1 6 2 5 4 3 1 6 ...
# $ Group : num 1 1 1 1 1 1 1 1 1 1 ...
# $ Time : Date, format: "2010-01-02" "2010-02-02" "2010-03-02" ...
# $ wt : num 0.15 0.3 0.15 0.9 1.35 1.2 1.2 0.75 0.6 1.2 ...
# $ Income : num -4.4 -6.41 2.28 -3.85 -0.02 ...
# $ Happiness : int 3 10 6 9 5 7 4 1 2 8 ...
# $ Sex : num 0.61 1.18 0.55 0.69 0.63 0.65 0.67 0.9 0.7 0.6 ...
# $ Age : int 15 2 65 67 73 17 84 5 41 91 ...
# $ Educ : num 0.54 1.04 1.29 0.43 0.76 0.63 0.6 0.44 0.48 1.13 ...
# $ uniqueID : int 1 2 3 4 5 6 7 8 9 10 ...
# - attr(*, ".internal.selfref")=<externalptr>

With larger data, you may prefer to break the grepl condition out so that you define which columns to work on:

mytypetest <- function(z) is.character(z) && all(grepl("^-?[\\d.]+(?:e-?\\d+)?$", z, perl = TRUE))
cols <- which(sapply(DT, mytypetest))
cols
# Group
# 5
DT[, (cols) := lapply(.SD, as.numeric), .SDcols = cols]
str(DT)
# Classes 'data.table' and 'data.frame': 100 obs. of 13 variables:
# $ panelID : int 4 39 1 34 23 43 14 18 33 21 ...
# $ Country : chr "Albania" "Albania" "Albania" "Albania" ...
# $ some_NA : int 0 2 4 1 5 3 0 2 4 1 ...
# $ some_NA_factor: Factor w/ 6 levels "0","1","2","3",..: 4 3 1 6 2 5 4 3 1 6 ...
# $ Group : num 1 1 1 1 1 1 1 1 1 1 ...
# $ Time : Date, format: "2010-01-02" "2010-02-02" "2010-03-02" ...
# $ wt : num 0.15 0.3 0.15 0.9 1.35 1.2 1.2 0.75 0.6 1.2 ...
# $ Income : num -4.4 -6.41 2.28 -3.85 -0.02 ...
# $ Happiness : int 3 10 6 9 5 7 4 1 2 8 ...
# $ Sex : num 0.61 1.18 0.55 0.69 0.63 0.65 0.67 0.9 0.7 0.6 ...
# $ Age : int 15 2 65 67 73 17 84 5 41 91 ...
# $ Educ : num 0.54 1.04 1.29 0.43 0.76 0.63 0.6 0.44 0.48 1.13 ...
# $ uniqueID : int 1 2 3 4 5 6 7 8 9 10 ...
# - attr(*, ".internal.selfref")=<externalptr>

This last one will be technically faster with any sized data, but it might be noticeable for larger (columns and/or rows) data.



Related Topics



Leave a reply



Submit