How does one change the levels of a factor column in a data.table
You can still set them the traditional way:
levels(mydt$value) <- c(...)
This should be plenty fast unless mydt
is very large since that traditional syntax copies the entire object. You could also play the un-factoring and refactoring game... but no one likes that game anyway.
To change the levels by reference with no copy of mydt
:
setattr(mydt$value,"levels",c(...))
but be sure to assign a valid levels vector (type character
of sufficient length) otherwise you'll end up with an invalid factor (levels<-
does some checking as well as copying).
Updating factor levels from another table by data.table
You could do the following:
subset_table[,
(nonnumeric_column) :=
lapply(nonnumeric_column, \(x) factor(get(x), levels = unique(bigger_table[[x]])))
]
Resulting in
> lapply(subset_table, levels)
$region
[1] "region_1" "region_3" "region_2" "region_4"
$factor_column
[1] "C" "B" "A"
$numeric_column
NULL
The problem in your original solution is that x
is not returning the name of the column but the actual column. You can see this with:
subset_table[, lapply(.SD, \(x) print(x)), .SDcols=nonnumeric_column]
Change factor levels in data.tables by name
This would work:
setattr(mydt[["value"]],"levels",c(...))
R change datatable factor levels using column index
You could still do data.frame
way:
levels(DT[[2]]) <- c("d", "e", "f")
Note however it is usually not recommended to update by column index though..
R: Reorder factor levels with data table (for use with Plotly)
To modify a column of a data.table object by reference, i.e., without copying the whole object, the :=
operator can be used as follows:
col <- "country"
DT[, (col) := factor(get(col), levels = rev(levels(get(col))))]
str(DT)
Classes ‘data.table’ and 'data.frame': 6 obs. of 4 variables:
$ indicator: Factor w/ 1 level "EN.ATM.CO2E.PC": 1 1 1 1 1 1
$ country : Factor w/ 6 levels "United States",..: 6 5 4 3 2 1
$ year : Factor w/ 1 level "2011": 1 1 1 1 1 1
$ value : num 15.64 7.24 7.08 1.48 17.7 ...
DT
indicator country year value
1: EN.ATM.CO2E.PC Canada 2011 15.639760
2: EN.ATM.CO2E.PC China 2011 7.241515
3: EN.ATM.CO2E.PC European Union 2011 7.079374
4: EN.ATM.CO2E.PC India 2011 1.476686
5: EN.ATM.CO2E.PC Saudi Arabia 2011 17.702307
6: EN.ATM.CO2E.PC United States 2011 16.972417
Note that DT
is used as name of the data.table object to avoid name conflicts with the data()
function.
As factor()
sorts the levels alphabetically by default, rev()
is used to reverse the order of the existing factor levels.
The column name is given in variable col
. Therefore, get()
is used to access the columns. Alternatively, this could be written as
DT[, (col) := lapply(.SD, factor, levels = rev(levels(DT[[col]]))), .SDcols = col]
using the special symbol .SD
and the .SDcols
parameter.
To verify that DT
is updated by reference, address(DT)
can be used.
Why does setattr()
not work as expected?
setattr()
seems only to change the labels of the levels but not the numbering of the levels as the OP wants.
DT
indicator country year value
1: EN.ATM.CO2E.PC Canada 2011 15.639760
2: EN.ATM.CO2E.PC China 2011 7.241515
3: EN.ATM.CO2E.PC European Union 2011 7.079374
4: EN.ATM.CO2E.PC India 2011 1.476686
5: EN.ATM.CO2E.PC Saudi Arabia 2011 17.702307
6: EN.ATM.CO2E.PC United States 2011 16.972417
DT[, as.integer(country)]
[1] 1 2 3 4 5 6
setattr(DT[[col]], "levels", rev(levels(DT[[col]])))
DT
indicator country year value
1: EN.ATM.CO2E.PC United States 2011 15.639760
2: EN.ATM.CO2E.PC Saudi Arabia 2011 7.241515
3: EN.ATM.CO2E.PC India 2011 7.079374
4: EN.ATM.CO2E.PC European Union 2011 1.476686
5: EN.ATM.CO2E.PC China 2011 17.702307
6: EN.ATM.CO2E.PC Canada 2011 16.972417
DT[, as.integer(country)]
[1] 1 2 3 4 5 6
If the above code is used, the numbering of the levels is changed accordingly:
DT[, (col) := factor(get(col), levels = rev(levels(get(col))))]
DT[, as.integer(country)]
[1] 6 5 4 3 2 1
(As DT
is modified in place, please, use always a fresh copy of DT
)
Changing factor levels on a column with setattr is sensitive for how the column was created
It might help to understand if you look at the address from both expressions:
address(d$x)
# [1] "0x10e4ac4d8"
address(d$x)
# [1] "0x10e4ac4d8"
address(d[,x])
# [1] "0x105e0b520"
address(d[,x])
# [1] "0x105e0a600"
Note that the address from the first expression doesn't change when you call it multiple times, while the second expression does which indicates it is making a copy of the column due to the dynamic nature of the address, so setattr
on it will have no effect on the original data.table.
How to make the levels of a factor in a data frame consistent across all columns?
You could change the levels of the dataset "df" to be in the same order by looping (lapply
) and convert to factor
again with the specified levels
and assign it back to the corresponding columns.
lvls <- c('PASS', 'WARN', 'FAIL')
df[] <- lapply(df, factor, levels=lvls)
str(df)
# 'data.frame': 5 obs. of 5 variables:
# $ Test1: Factor w/ 3 levels "PASS","WARN",..: 1 1 1 1 1
# $ Test2: Factor w/ 3 levels "PASS","WARN",..: 1 1 3 3 2
# $ Test3: Factor w/ 3 levels "PASS","WARN",..: 3 3 3 3 3
# $ Test4: Factor w/ 3 levels "PASS","WARN",..: 2 1 1 1 2
# $ Test5: Factor w/ 3 levels "PASS","WARN",..: 2 2 2 2 2
If you opt to use data.table
library(data.table)
setDT(df)[, names(df):= lapply(.SD, factor, levels=lvls)]
setDT
converts to "data.frame" to "data.table", assign (:=
) the column names of the dataset to the reconverted factor columns (lapply(..)
). .SD
denotes "Subset of Datatable".
data
df <- structure(list(Test1 = structure(c(1L, 1L, 1L, 1L, 1L),
.Label = "PASS", class = "factor"),
Test2 = structure(c(2L, 2L, 1L, 1L, 3L), .Label = c("FAIL",
"PASS", "WARN"), class = "factor"), Test3 = structure(c(1L,
1L, 1L, 1L, 1L), .Label = "FAIL", class = "factor"), Test4 =
structure(c(2L, 1L, 1L, 1L, 2L), .Label = c("PASS", "WARN", "FAIL"),
class = "factor"), Test5 = structure(c(1L, 1L, 1L, 1L, 1L), .Label =
"WARN", class = "factor")), .Names = c("Test1",
"Test2", "Test3", "Test4", "Test5"), row.names = c("Sample1",
"Sample2", "Sample3", "Sample4", "Sample5"), class = "data.frame")
Changing levels of dataframe column changes value in dataframe
x_value <- factor("yes", levels = c("no", "yes"))
df <- data.frame(
x = x_value
)
df
x
1 yes
Why the example in the question is showing this "weird" behaviour:
The dataframe created has a factor with one level. The corresponding number of that level is one, and this is the element that is associated with, when you set levels()
.
Here is a quick example:
If we create a dataframe like this
x_value <- c("somethingElse", "more", "more")
df <- data.frame(
x = x_value
)
df$x
shows us that the levels are
[1] somethingElse more more
Levels: more somethingElse
Note, that the first level is "more"
even though "somethingElse"
occurs first. This is because when sorted "more"
comes first.
So, if we assign now
levels(df$x) <- c("yes", "somethingElse", "more")
the first factor level gets "yes"
, the second gets "somethingElse"
, resulting in (maybe unintuitively)
x
1 somethingElse
2 yes
3 yes
use dataframe column to change levels of factor
Your approach is correct but since you want to order the rows use it in arrange
:
library(dplyr)
df %>% arrange(factor(X, levels = ndf$X))
# X Y
#1 TUG 5000
#2 NJK 4000
#3 WQD 3000
#4 DFV 2000
#5 PRF 1000
You can also use match
:
df %>% arrange(match(X, ndf$X))
Related Topics
Extracting Coefficient Variable Names from Glmnet into a Data.Frame
How to Print R Variables in Middle of String
Differencebetween Names and Colnames
Check If Each Row of a Data Frame Is Contained in Another Data Frame
Daily Time Series with Ts.. How to Specify Start and End
Using a Static (Prebuilt) PDF Vignette in R Package
How to Show Matrix Values on Levelplot
Regular Analysis Over Irregular Time Series
How to Create Base R Plot 'Type = B' Equivalent in Ggplot2
In Ggplot2, How to Add Additional Legend
Parallel Execution of Random Forest in R
"Factor Has New Levels" Error for Variable I'm Not Using
Implementation of Parallel Coordinates