Grouping Factor Levels in a Data.Table

Grouping factor levels in a data.table

Update:

I recently learned of a much simpler way to re-associate factor levels from this question and a closer reading of ?levels. No merges, correspondence table, etc. necessary, just pass a named list to levels:

levels(DT$ind) = list(A = c(1, 3, 8), B = c(2, 4), C = 5:7)


Original Answer:

As suggested by @Arun we have the option of creating the correspondence as a separate data.table, then joining it to the original:

match_dt = data.table(ind = as.factor(1:12),
grp = as.factor(c("A", "B", "A", "B", "C", "C",
"C", "A", "D", "E", "F", "D")))
setkey(DT, ind)
setkey(match_dt, ind)
DT = match_dt[DT]

We can also do this in (what I consider to be) the more readable fashion like so (with marginal speed costs):

levels <- letters[1:12]
levels[c(1, 3, 8)] <- "A"
levels[c(2, 4)] <- "B"
levels[5:7] <- "C"
levels[c(9, 12)] <- "D"
levels[10] <- "E"
levels[11] <- "F"
match_dt <- data.table(ind = as.factor(1:12),
grp = as.factor(levels))
setkey(DT, ind)
setkey(match_dt, ind)
DT = match_dt[DT]

Factor levels by group

A data.table solution:

dt[, height_cat := cut(Height, breaks = c(0, 165, 180, 300), right = FALSE)]
dt[, height_f :=
factor(
paste(Sex, height_cat, sep = ":"),
levels = dt[, CJ(Sex, height_cat, unique = TRUE)][, paste(Sex, height_cat, sep = ":")]
)]

table(dt$height_f)
# F:[0,165) F:[165,180) F:[180,300) M:[0,165) M:[165,180) M:[180,300)
# 2 2 0 0 2 2

How do you group factor levels in R?

Grouping factor levels can easily be done by assigning the grouping in a list. Here an example with toy data:

levels(mydata$value)
# [1] "not likely" "slightly likely" "likely" "very likely"

levels(mydata$value) <- list("unlikely"=c("not likely", "slightly likely"),
"likely"=c("likely", "very likely"))
levels(mydata$value)
# [1] "unlikely" "likely"

After that you probably want to do this:

(Statistical_Testing.aov <- aov(as.integer(value) ~ question, data = mydata))
# Call:
# aov(formula = as.integer(value) ~ question, data = mydata)
#
# Terms:
# question Residuals
# Sum of Squares 0.18 5.82
# Deg. of Freedom 1 23
#
# Residual standard error: 0.5030343
# Estimated effects may be unbalanced

(Statistical_Testing.anova <- anova(Statistical_Testing.aov))
# Analysis of Variance Table
#
# Response: as.integer(value)
# Df Sum Sq Mean Sq F value Pr(>F)
# question 1 0.18 0.18000 0.7113 0.4077
# Residuals 23 5.82 0.25304

Toy data:

set.seed(42)
mydata <- transform(expand.grid(question=1:5, id=1:5),
value=factor(sample(1:4, 25, rep=T),
labels=c("not likely", "slightly likely",
"likely", "very likely")))

Updating factor levels from another table by data.table

You could do the following:

subset_table[, 
(nonnumeric_column) :=
lapply(nonnumeric_column, \(x) factor(get(x), levels = unique(bigger_table[[x]])))
]

Resulting in

> lapply(subset_table, levels)
$region
[1] "region_1" "region_3" "region_2" "region_4"

$factor_column
[1] "C" "B" "A"

$numeric_column
NULL

The problem in your original solution is that x is not returning the name of the column but the actual column. You can see this with:

subset_table[, lapply(.SD, \(x) print(x)), .SDcols=nonnumeric_column]

How to group factor levels?

I made up an example character vector with all of the abbreviations:

my_example <- c("C","OG","OT","TE","DT","DE","CB","WR","FS", 
"FB","ILB","OLB","P","QB","RB","SS","WR")
class(my_example)

[1] "character"

Then I substituted the desired levels for their abbreviations (you could also use gsub here or any of many, many different approaches):

my_example[my_example %in% c("C","OG","OT","TE","DT","DE")] <- "Linemen"
my_example[my_example %in% c("CB","WR","FS")] <- "Small Backs"
my_example[my_example %in% c("FB","ILB","OLB","P",
"QB","RB","SS","WR")] <- "Big Backs"

Then I made it into a factor:

my_example <- as.factor(my_example)
head(my_example)
[1] Linemen Linemen Linemen Linemen Linemen Linemen
Levels: Big Backs Linemen Small Backs
tail(my_example)
[1] Big Backs   Big Backs   Big Backs   Big Backs   Big Backs   Small Backs
Levels: Big Backs Linemen Small Backs
class(my_example)

[1] "factor"

Grouping 2 levels of a factor in R

Use levels(x) <- ... to specify new levels, and to combine some previous levels. For example:

f <- factor(LETTERS[c(1:3, 3:1)])
f
[1] A B C C B A
Levels: A B C

Now combine "A" and "B" into a single level:

levels(f) <- c("A", "A", "C")
f
[1] A A C C A A
Levels: A C

R grouping data with factors and levels

You first need to transform the vector so that it has an unique entry for, then you can add the missing levels in the factor() function:

X <- c(1,2,3,4,3,9,20)
X <- ifelse(X>5,">5",X)
X <- factor(X,levels=c(0:5,">5"))

This results in:

X
[1] 1 2 3 4 3 >5 >5
Levels: 0 1 2 3 4 5 >5

Reorder factor levels within group

To reorder the factor levels you can use forcats (part of the tidyverse), and do something like this...

library(forcats)
df2 <- df %>% mutate(a_factor = fct_reorder(a_factor,
value*(-1 + 2 * (group=="group1"))))

levels(df2$a_factor)
[1] "f" "e" "d" "a" "b" "c"

This does not rearrange the dataframe itself...

df2
a_factor group value
1 a group1 1
2 b group1 2
3 c group1 3
4 d group2 4
5 e group2 5
6 f group2 6


Related Topics



Leave a reply



Submit