Drop Unused Factor Levels in a Subsetted Data Frame

Drop unused factor levels in a subsetted data frame

All you should have to do is to apply factor() to your variable again after subsetting:

> subdf$letters
[1] a b c
Levels: a b c d e
subdf$letters <- factor(subdf$letters)
> subdf$letters
[1] a b c
Levels: a b c


From the factor page example:

factor(ff)      # drops the levels that do not occur

For dropping levels from all factor columns in a dataframe, you can use:

subdf <- subset(df, numbers <= 3)
subdf[] <- lapply(subdf, function(x) if(is.factor(x)) factor(x) else x)

How can I drop unused levels from a data frame?

There's a recently added function in R for this:

y <- droplevels(y)

Override [.data.frame to drop unused factor levels by default

I'd be really wary of changing the default behavior; you never know when another function you use depends on the usual default behavior. I'd instead write a similar function to your subsetDrop but for [, like

sel <- function(x, ...) droplevels(x[...])


> d <- data.frame(a=factor(LETTERS[1:5]), b=factor(letters[1:5]))
> str(d[1:2,])
'data.frame': 2 obs. of 2 variables:
$ a: Factor w/ 5 levels "A","B","C","D",..: 1 2
$ b: Factor w/ 5 levels "a","b","c","d",..: 1 2
> str(sel(d,1:2,))
'data.frame': 2 obs. of 2 variables:
$ a: Factor w/ 2 levels "A","B": 1 2
$ b: Factor w/ 2 levels "a","b": 1 2

If you really want to change the default, you could do something like

foo <- `[.data.frame`
`[.data.frame` <- function(...) droplevels(foo(...))

but make sure you know how namespaces work as this will work for anything called from the global namespace but the version in the base namespace is unchanged. Which might be a good thing, but it's something you want to make sure you understand. After this change the output is as you'd like.

> str(d[1:2,])
'data.frame': 2 obs. of 2 variables:
$ a: Factor w/ 2 levels "A","B": 1 2
$ b: Factor w/ 2 levels "a","b": 1 2

Dropping unused factor levels in data.table

We can use .SDcols to specify the columns of interest. It can take a vector of columns names (length of 1 or greater than 1) or column index. Now, the .SD i.e. Subset of Data.table would have those columns specified in the .SDcols. As there is only a single column, extract that column with [[, apply the droplevels on the vector and assign (:=) it back to the column of interest. Not the parens around the object identifier v1. It is to evaluate the object to get the value in it instead of creating a column 'v1'

x[, (v1) := droplevels(.SD[[1]]), .SDcols = v1]

Usually, the syntax would be

x[, (v1) := lapply(.SD, droplevels), .SDcols = v1]

It can take one column or multiple columns. The only reason to extract ([[) is because we know it is a single column

Another option is get

x[, (v1) :=  droplevels(get(v1))]


v1 <- "y"

Subsetting a data.frame based on factor levels in a second data.frame


           A            C
1 0.8924861 0.7149490854
2 0.5711894 0.7200819517
3 0.7049629 0.0004052017
4 0.9188677 0.5007302717
5 0.3440664 0.9138259818
6 0.8657903 0.2724015017
7 0.7631228 0.5686033906
8 0.8388003 0.7377064163
9 0.0796059 0.6196693045
10 0.5029824 0.8717568610

Change factor levels and rearrange dataframe

This mistakes is easy to make. You have to supply the column vector to fct_relevel. Like so:

library(dplyr,warn.conflicts = F)

df <-
list(layer = structure(
.Label = c(
'CEOS and managers',
'Clerks and services',
class = 'factor'
row.names = c(NA,-5L),
class = c('tbl_df', 'tbl', 'data.frame')

df %>%
mutate(layer = forcats::fct_relevel(
'CEOS and managers',
'Clerks and services',
'Production'))) %>%
#> # A tibble: 5 x 1
#> layer
#> <fct>
#> 1 CEOS and managers
#> 2 Professionals
#> 3 Technicians
#> 4 Clerks and services
#> 5 Production

Created on 2021-01-11 by the reprex package (v0.3.0)

Related Topics

Leave a reply