How Does One Change the Levels of a Factor Column in a Data.Table

How does one change the levels of a factor column in a data.table

You can still set them the traditional way:

levels(mydt$value) <- c(...)

This should be plenty fast unless mydt is very large since that traditional syntax copies the entire object. You could also play the un-factoring and refactoring game... but no one likes that game anyway.

To change the levels by reference with no copy of mydt :

setattr(mydt$value,"levels",c(...))

but be sure to assign a valid levels vector (type character of sufficient length) otherwise you'll end up with an invalid factor (levels<- does some checking as well as copying).

Updating factor levels from another table by data.table

You could do the following:

subset_table[, 
(nonnumeric_column) :=
lapply(nonnumeric_column, \(x) factor(get(x), levels = unique(bigger_table[[x]])))
]

Resulting in

> lapply(subset_table, levels)
$region
[1] "region_1" "region_3" "region_2" "region_4"

$factor_column
[1] "C" "B" "A"

$numeric_column
NULL

The problem in your original solution is that x is not returning the name of the column but the actual column. You can see this with:

subset_table[, lapply(.SD, \(x) print(x)), .SDcols=nonnumeric_column]

Change factor levels in data.tables by name

This would work:

setattr(mydt[["value"]],"levels",c(...))

R change datatable factor levels using column index

You could still do data.frame way:

levels(DT[[2]]) <- c("d", "e", "f")

Note however it is usually not recommended to update by column index though..

R: Reorder factor levels with data table (for use with Plotly)

To modify a column of a data.table object by reference, i.e., without copying the whole object, the := operator can be used as follows:

col <- "country"
DT[, (col) := factor(get(col), levels = rev(levels(get(col))))]
str(DT)
Classes ‘data.table’ and 'data.frame':    6 obs. of  4 variables:
$ indicator: Factor w/ 1 level "EN.ATM.CO2E.PC": 1 1 1 1 1 1
$ country : Factor w/ 6 levels "United States",..: 6 5 4 3 2 1
$ year : Factor w/ 1 level "2011": 1 1 1 1 1 1
$ value : num 15.64 7.24 7.08 1.48 17.7 ...
DT
        indicator        country year     value
1: EN.ATM.CO2E.PC Canada 2011 15.639760
2: EN.ATM.CO2E.PC China 2011 7.241515
3: EN.ATM.CO2E.PC European Union 2011 7.079374
4: EN.ATM.CO2E.PC India 2011 1.476686
5: EN.ATM.CO2E.PC Saudi Arabia 2011 17.702307
6: EN.ATM.CO2E.PC United States 2011 16.972417

Note that DT is used as name of the data.table object to avoid name conflicts with the data() function.

As factor() sorts the levels alphabetically by default, rev() is used to reverse the order of the existing factor levels.

The column name is given in variable col. Therefore, get() is used to access the columns. Alternatively, this could be written as

DT[, (col) := lapply(.SD, factor, levels = rev(levels(DT[[col]]))), .SDcols = col]

using the special symbol .SD and the .SDcols parameter.

To verify that DT is updated by reference, address(DT) can be used.



Why does setattr() not work as expected?

setattr() seems only to change the labels of the levels but not the numbering of the levels as the OP wants.

DT
        indicator        country year     value
1: EN.ATM.CO2E.PC Canada 2011 15.639760
2: EN.ATM.CO2E.PC China 2011 7.241515
3: EN.ATM.CO2E.PC European Union 2011 7.079374
4: EN.ATM.CO2E.PC India 2011 1.476686
5: EN.ATM.CO2E.PC Saudi Arabia 2011 17.702307
6: EN.ATM.CO2E.PC United States 2011 16.972417
DT[, as.integer(country)]
[1] 1 2 3 4 5 6
setattr(DT[[col]], "levels", rev(levels(DT[[col]])))
DT
        indicator        country year     value
1: EN.ATM.CO2E.PC United States 2011 15.639760
2: EN.ATM.CO2E.PC Saudi Arabia 2011 7.241515
3: EN.ATM.CO2E.PC India 2011 7.079374
4: EN.ATM.CO2E.PC European Union 2011 1.476686
5: EN.ATM.CO2E.PC China 2011 17.702307
6: EN.ATM.CO2E.PC Canada 2011 16.972417
DT[, as.integer(country)]
[1] 1 2 3 4 5 6

If the above code is used, the numbering of the levels is changed accordingly:

DT[, (col) := factor(get(col), levels = rev(levels(get(col))))]
DT[, as.integer(country)]
[1] 6 5 4 3 2 1

(As DT is modified in place, please, use always a fresh copy of DT)

Changing factor levels on a column with setattr is sensitive for how the column was created

It might help to understand if you look at the address from both expressions:

address(d$x)
# [1] "0x10e4ac4d8"
address(d$x)
# [1] "0x10e4ac4d8"

address(d[,x])
# [1] "0x105e0b520"
address(d[,x])
# [1] "0x105e0a600"

Note that the address from the first expression doesn't change when you call it multiple times, while the second expression does which indicates it is making a copy of the column due to the dynamic nature of the address, so setattr on it will have no effect on the original data.table.

How to make the levels of a factor in a data frame consistent across all columns?

You could change the levels of the dataset "df" to be in the same order by looping (lapply) and convert to factor again with the specified levels and assign it back to the corresponding columns.

lvls <- c('PASS', 'WARN', 'FAIL')
df[] <- lapply(df, factor, levels=lvls)
str(df)
# 'data.frame': 5 obs. of 5 variables:
# $ Test1: Factor w/ 3 levels "PASS","WARN",..: 1 1 1 1 1
# $ Test2: Factor w/ 3 levels "PASS","WARN",..: 1 1 3 3 2
# $ Test3: Factor w/ 3 levels "PASS","WARN",..: 3 3 3 3 3
# $ Test4: Factor w/ 3 levels "PASS","WARN",..: 2 1 1 1 2
# $ Test5: Factor w/ 3 levels "PASS","WARN",..: 2 2 2 2 2

If you opt to use data.table

library(data.table)
setDT(df)[, names(df):= lapply(.SD, factor, levels=lvls)]

setDT converts to "data.frame" to "data.table", assign (:=) the column names of the dataset to the reconverted factor columns (lapply(..)). .SD denotes "Subset of Datatable".

data

df <- structure(list(Test1 = structure(c(1L, 1L, 1L, 1L, 1L), 
.Label = "PASS", class = "factor"),
Test2 = structure(c(2L, 2L, 1L, 1L, 3L), .Label = c("FAIL",
"PASS", "WARN"), class = "factor"), Test3 = structure(c(1L,
1L, 1L, 1L, 1L), .Label = "FAIL", class = "factor"), Test4 =
structure(c(2L, 1L, 1L, 1L, 2L), .Label = c("PASS", "WARN", "FAIL"),
class = "factor"), Test5 = structure(c(1L, 1L, 1L, 1L, 1L), .Label =
"WARN", class = "factor")), .Names = c("Test1",
"Test2", "Test3", "Test4", "Test5"), row.names = c("Sample1",
"Sample2", "Sample3", "Sample4", "Sample5"), class = "data.frame")

Changing levels of dataframe column changes value in dataframe

x_value <- factor("yes", levels = c("no", "yes"))
df <- data.frame(
x = x_value
)

df

x
1 yes

Why the example in the question is showing this "weird" behaviour:

The dataframe created has a factor with one level. The corresponding number of that level is one, and this is the element that is associated with, when you set levels().

Here is a quick example:

If we create a dataframe like this

x_value <- c("somethingElse", "more", "more")
df <- data.frame(
x = x_value
)

df$x

shows us that the levels are

[1] somethingElse more          more         
Levels: more somethingElse

Note, that the first level is "more" even though "somethingElse" occurs first. This is because when sorted "more"comes first.
So, if we assign now

levels(df$x) <- c("yes", "somethingElse", "more")

the first factor level gets "yes", the second gets "somethingElse", resulting in (maybe unintuitively)

              x
1 somethingElse
2 yes
3 yes

use dataframe column to change levels of factor

Your approach is correct but since you want to order the rows use it in arrange :

library(dplyr)
df %>% arrange(factor(X, levels = ndf$X))

# X Y
#1 TUG 5000
#2 NJK 4000
#3 WQD 3000
#4 DFV 2000
#5 PRF 1000

You can also use match :

df %>% arrange(match(X, ndf$X))


Related Topics



Leave a reply



Submit