How can I drop unused levels from a data frame?
There's a recently added function in R for this:
y <- droplevels(y)
Drop unused factor levels in a subsetted data frame
All you should have to do is to apply factor() to your variable again after subsetting:
> subdf$letters
[1] a b c
Levels: a b c d e
subdf$letters <- factor(subdf$letters)
> subdf$letters
[1] a b c
Levels: a b c
EDIT
From the factor page example:
factor(ff) # drops the levels that do not occur
For dropping levels from all factor columns in a dataframe, you can use:
subdf <- subset(df, numbers <= 3)
subdf[] <- lapply(subdf, function(x) if(is.factor(x)) factor(x) else x)
Dropping unused factor levels in data.table
We can use .SDcols
to specify the columns of interest. It can take a vector of columns names (length of 1 or greater than 1) or column index. Now, the .SD
i.e. Subset of Data.table would have those columns specified in the .SDcols
. As there is only a single column, extract that column with [[
, apply the droplevels
on the vector
and assign (:=
) it back to the column of interest. Not the parens around the object identifier v1. It is to evaluate the object to get the value in it instead of creating a column 'v1'
x[, (v1) := droplevels(.SD[[1]]), .SDcols = v1]
Usually, the syntax would be
x[, (v1) := lapply(.SD, droplevels), .SDcols = v1]
It can take one column or multiple columns. The only reason to extract ([[
) is because we know it is a single column
Another option is get
x[, (v1) := droplevels(get(v1))]
where,
v1 <- "y"
Override [.data.frame to drop unused factor levels by default
I'd be really wary of changing the default behavior; you never know when another function you use depends on the usual default behavior. I'd instead write a similar function to your subsetDrop
but for [
, like
sel <- function(x, ...) droplevels(x[...])
Then
> d <- data.frame(a=factor(LETTERS[1:5]), b=factor(letters[1:5]))
> str(d[1:2,])
'data.frame': 2 obs. of 2 variables:
$ a: Factor w/ 5 levels "A","B","C","D",..: 1 2
$ b: Factor w/ 5 levels "a","b","c","d",..: 1 2
> str(sel(d,1:2,))
'data.frame': 2 obs. of 2 variables:
$ a: Factor w/ 2 levels "A","B": 1 2
$ b: Factor w/ 2 levels "a","b": 1 2
If you really want to change the default, you could do something like
foo <- `[.data.frame`
`[.data.frame` <- function(...) droplevels(foo(...))
but make sure you know how namespaces work as this will work for anything called from the global namespace but the version in the base namespace is unchanged. Which might be a good thing, but it's something you want to make sure you understand. After this change the output is as you'd like.
> str(d[1:2,])
'data.frame': 2 obs. of 2 variables:
$ a: Factor w/ 2 levels "A","B": 1 2
$ b: Factor w/ 2 levels "a","b": 1 2
add back unused levels in factor
via @DavidArenburg:
myData[] <- lapply(myData, factor, levels = L)
R: drop factors with certain values
You can use add_count()
to get the counts for each value of the factor, then filter()
to keep rows where the count is >= 8
. You then can drop levels with droplevels
and mutate
.
library(dplyr)
# Example factor
df <- data.frame(fac = as.factor(c(rep("a", 3), rep("b", 8), rep("c", 9))))
df$fac %>% table()
#> .
#> a b c
#> 3 8 9
# Keep only rows where the value of `fac` for that row is observed in at least
# 8 rows and drop unused levels
result <- df %>%
add_count(fac) %>%
filter(n >= 8) %>%
mutate(fac = droplevels(fac))
print(result)
#> fac n
#> 1 b 8
#> 2 b 8
#> 3 b 8
#> 4 b 8
#> 5 b 8
#> 6 b 8
#> 7 b 8
#> 8 b 8
#> 9 c 9
#> 10 c 9
#> 11 c 9
#> 12 c 9
#> 13 c 9
#> 14 c 9
#> 15 c 9
#> 16 c 9
#> 17 c 9
levels(result$fac)
#> [1] "b" "c"
Related Topics
Fitting Linear Model/Anova by Group
How to Make a Matrix from a List of Vectors in R
How to Install R Package from Private Repo Using Devtools Install_Github
How to Change the Resolution of a Raster Layer in R
Solving for the Inverse of a Function in R
Combining Multiple Complex Plots as Panels in a Single Figure
Ggplot2 Increase Space Between Legend Keys
How to Extract Substring Between Patterns "_" and "." in R
Data.Table Inner/Outer Join with Na in Join Column of Type Double Bug
Grouping & Visualizing Cumulative Features in R
Compare Two Character Vectors in R
How to Drop Unused Levels from a Data Frame
R: Losing Column Names When Adding Rows to an Empty Data Frame
Conditional Assignment of One Variable to the Value of One of Two Other Variables
Include Data Examples in Developing R Packages
Run a Bash Script from an R Script
What's the Difference in Using a Semicolon or Explicit New Line in R Code