Use a Factor Column in "By" and Do Not Drop Empty Factors

Use a factor column in by and do not drop empty factors

If you are willing to run through the factor levels by enumerating them in i (rather than by setting by="group"), this will get you the hoped for results.

setkey(x, "group")
x[levels(group), .N, by=.EACHI]
# group N
# 1: a 2
# 2: b 1
# 3: c 0

Empty factors in by data.table

library(data.table)
set.seed(42)
dtr <- data.table(v1=sample(1:15),
v2=factor(sample(letters[1:3], 15, replace = TRUE),levels=letters[1:5]),
v3=sample(c("yes", "no"), 15, replace = TRUE))

res <- dtr[,list(freq=.N,mm=sum(v1,na.rm=T)),by=list(v2,v3)]

You can use CJ (a cross join). Doing this after aggregation avoids setting the key for the big table and should be faster.

setkey(res,c("v2","v3"))
res[CJ(levels(dtr[,v2]),unique(dtr[,v3])),]

# v2 v3 freq mm
# 1: a no 1 9
# 2: a yes 2 11
# 3: b no 2 11
# 4: b yes 3 23
# 5: c no 4 40
# 6: c yes 3 26
# 7: d no NA NA
# 8: d yes NA NA
# 9: e no NA NA
# 10: e yes NA NA

Drop unused factor levels in a subsetted data frame

All you should have to do is to apply factor() to your variable again after subsetting:

> subdf$letters
[1] a b c
Levels: a b c d e
subdf$letters <- factor(subdf$letters)
> subdf$letters
[1] a b c
Levels: a b c

EDIT

From the factor page example:

factor(ff)      # drops the levels that do not occur

For dropping levels from all factor columns in a dataframe, you can use:

subdf <- subset(df, numbers <= 3)
subdf[] <- lapply(subdf, function(x) if(is.factor(x)) factor(x) else x)

Why are empty levels in my factor tabulated after I assign NAs to missing values?

Check whether your dataframe indeed has no missing values, because it does look to be that way. Try this:

# works because factor-levels are integers, internally; "" seems to be level 1
which(as.integer(df$MF) == 1)

# works if your missing value is just ""
which(df$MF == "")

You should then clean up your dataframe to properly refeclet missing values. A factor will handle NA:

df <- data.frame("rest" = c(1:5), "sex" = c("M", "F", "F", "M", ""))
df$sex[which(as.integer(df$sex) == 1)] <- NA

Once you have cleaned your data, you will have to drop unused levels to avoid tabulations such as table counting occurences of the empty level.

Observe this sequence of steps and its outputs:

# Build a dataframe to reproduce your behaviour
> df <- data.frame("Restaurant" = c(1:5), "MF" = c("M", "F", "F", "M", ""))
# notice the empty level "" for the missing value
> levels(df$MF)
[1] "" "F" "M"

# notice how a tabulation counts the empty level;
# this is the first column with a 1 (it has no label because
# there is no label, it is "")
> table(df$MF)

F M
1 2 2

# find the culprit and change it to NA
> df$MF[which(as.integer(df$MF) == 1)] <- as.factor(NA)

# AHA! So despite us changing the value, the original factor
# was not updated! I wonder what happens if we tabulate the column...
> levels(df$MF)
[1] "" "F" "M"

# Indeed, the empty level is present in the factor, but there are
# no occurences!
> table(df$MF)

F M
0 2 2

# droplevels to the rescue:
# it is used to drop unused levels from a factor or, more commonly,
# from factors in a data frame.
> df$MF <- droplevels(df$MF)

# factors fixed
> levels(df$MF)
[1] "F" "M"

# tabulation fixed
> table(df$MF)

F M
2 2

Move empty factor levels while maintaining order of non-empty levels in ggplot2

This was surprisingly tricky - given that all you need to do is order your levels correctly. I couldn't find anything in forcats that was directly appropriate, but we can write our own reordering function.

my_reorder <- function (fac, var) {
fac <- fct_reorder(fac, {{var}})
l <- levels(fac)
nonempty <- levels(factor(fac)) # I got this idea from droplevels()
empty <- setdiff(l, nonempty)
fct_relevel(fac, empty, nonempty)
fct_relevel(fac, empty, nonempty)
}

mtcars %>%
mutate(cyl = as.factor(cyl),
cyl = fct_expand(cyl, c("2", "4", "6", "8"))) %>%
group_by(cyl) %>%
summarize(meanMPG = mean(mpg)) %>%
ungroup() %>%
mutate(cyl = my_reorder(cyl, meanMPG)) %>%
ggplot(aes(x = cyl, y = meanMPG)) +
geom_col() +
scale_x_discrete(drop = FALSE, ) +
coord_flip() # shows empty level "2" on the top


Related Topics



Leave a reply



Submit