How to Drop Unused Levels from a Data Frame

How can I drop unused levels from a data frame?

There's a recently added function in R for this:

y <- droplevels(y)

Drop unused factor levels in a subsetted data frame

All you should have to do is to apply factor() to your variable again after subsetting:

> subdf$letters
[1] a b c
Levels: a b c d e
subdf$letters <- factor(subdf$letters)
> subdf$letters
[1] a b c
Levels: a b c

EDIT

From the factor page example:

factor(ff)      # drops the levels that do not occur

For dropping levels from all factor columns in a dataframe, you can use:

subdf <- subset(df, numbers <= 3)
subdf[] <- lapply(subdf, function(x) if(is.factor(x)) factor(x) else x)

Dropping unused factor levels in data.table

We can use .SDcols to specify the columns of interest. It can take a vector of columns names (length of 1 or greater than 1) or column index. Now, the .SD i.e. Subset of Data.table would have those columns specified in the .SDcols. As there is only a single column, extract that column with [[, apply the droplevels on the vector and assign (:=) it back to the column of interest. Not the parens around the object identifier v1. It is to evaluate the object to get the value in it instead of creating a column 'v1'

x[, (v1) := droplevels(.SD[[1]]), .SDcols = v1]

Usually, the syntax would be

x[, (v1) := lapply(.SD, droplevels), .SDcols = v1]

It can take one column or multiple columns. The only reason to extract ([[) is because we know it is a single column

Another option is get

x[, (v1) :=  droplevels(get(v1))]

where,

v1 <- "y"

Override [.data.frame to drop unused factor levels by default

I'd be really wary of changing the default behavior; you never know when another function you use depends on the usual default behavior. I'd instead write a similar function to your subsetDrop but for [, like

sel <- function(x, ...) droplevels(x[...])

Then

> d <- data.frame(a=factor(LETTERS[1:5]), b=factor(letters[1:5]))
> str(d[1:2,])
'data.frame':   2 obs. of  2 variables:
 $ a: Factor w/ 5 levels "A","B","C","D",..: 1 2
 $ b: Factor w/ 5 levels "a","b","c","d",..: 1 2
> str(sel(d,1:2,))
'data.frame':   2 obs. of  2 variables:
 $ a: Factor w/ 2 levels "A","B": 1 2
 $ b: Factor w/ 2 levels "a","b": 1 2

If you really want to change the default, you could do something like

foo <- `[.data.frame`
`[.data.frame` <- function(...) droplevels(foo(...))

but make sure you know how namespaces work as this will work for anything called from the global namespace but the version in the base namespace is unchanged. Which might be a good thing, but it's something you want to make sure you understand. After this change the output is as you'd like.

> str(d[1:2,])
'data.frame':   2 obs. of  2 variables:
 $ a: Factor w/ 2 levels "A","B": 1 2
 $ b: Factor w/ 2 levels "a","b": 1 2

add back unused levels in factor

via @DavidArenburg:

myData[] <- lapply(myData, factor, levels = L)

R: drop factors with certain values

You can use add_count() to get the counts for each value of the factor, then filter() to keep rows where the count is >= 8. You then can drop levels with droplevels and mutate.

library(dplyr)

# Example factor
df <- data.frame(fac = as.factor(c(rep("a", 3), rep("b", 8), rep("c", 9))))
df$fac %>% table()
#> .
#> a b c 
#> 3 8 9

# Keep only rows where the value of `fac` for that row is observed in at least
# 8 rows and drop unused levels
result <- df %>%
  add_count(fac) %>%
  filter(n >= 8) %>%
  mutate(fac = droplevels(fac))

print(result)
#>    fac n
#> 1    b 8
#> 2    b 8
#> 3    b 8
#> 4    b 8
#> 5    b 8
#> 6    b 8
#> 7    b 8
#> 8    b 8
#> 9    c 9
#> 10   c 9
#> 11   c 9
#> 12   c 9
#> 13   c 9
#> 14   c 9
#> 15   c 9
#> 16   c 9
#> 17   c 9

levels(result$fac)
#> [1] "b" "c"