Use a factor column in by and do not drop empty factors
If you are willing to run through the factor levels by enumerating them in i
(rather than by setting by="group"
), this will get you the hoped for results.
setkey(x, "group")
x[levels(group), .N, by=.EACHI]
# group N
# 1: a 2
# 2: b 1
# 3: c 0
Empty factors in by data.table
library(data.table)
set.seed(42)
dtr <- data.table(v1=sample(1:15),
v2=factor(sample(letters[1:3], 15, replace = TRUE),levels=letters[1:5]),
v3=sample(c("yes", "no"), 15, replace = TRUE))
res <- dtr[,list(freq=.N,mm=sum(v1,na.rm=T)),by=list(v2,v3)]
You can use CJ
(a cross join). Doing this after aggregation avoids setting the key for the big table and should be faster.
setkey(res,c("v2","v3"))
res[CJ(levels(dtr[,v2]),unique(dtr[,v3])),]
# v2 v3 freq mm
# 1: a no 1 9
# 2: a yes 2 11
# 3: b no 2 11
# 4: b yes 3 23
# 5: c no 4 40
# 6: c yes 3 26
# 7: d no NA NA
# 8: d yes NA NA
# 9: e no NA NA
# 10: e yes NA NA
Drop unused factor levels in a subsetted data frame
All you should have to do is to apply factor() to your variable again after subsetting:
> subdf$letters
[1] a b c
Levels: a b c d e
subdf$letters <- factor(subdf$letters)
> subdf$letters
[1] a b c
Levels: a b c
EDIT
From the factor page example:
factor(ff) # drops the levels that do not occur
For dropping levels from all factor columns in a dataframe, you can use:
subdf <- subset(df, numbers <= 3)
subdf[] <- lapply(subdf, function(x) if(is.factor(x)) factor(x) else x)
Why are empty levels in my factor tabulated after I assign NAs to missing values?
Check whether your dataframe indeed has no missing values, because it does look to be that way. Try this:
# works because factor-levels are integers, internally; "" seems to be level 1
which(as.integer(df$MF) == 1)
# works if your missing value is just ""
which(df$MF == "")
You should then clean up your dataframe to properly refeclet missing values. A factor
will handle NA
:
df <- data.frame("rest" = c(1:5), "sex" = c("M", "F", "F", "M", ""))
df$sex[which(as.integer(df$sex) == 1)] <- NA
Once you have cleaned your data, you will have to drop unused levels to avoid tabulations such as table
counting occurences of the empty level.
Observe this sequence of steps and its outputs:
# Build a dataframe to reproduce your behaviour
> df <- data.frame("Restaurant" = c(1:5), "MF" = c("M", "F", "F", "M", ""))
# notice the empty level "" for the missing value
> levels(df$MF)
[1] "" "F" "M"
# notice how a tabulation counts the empty level;
# this is the first column with a 1 (it has no label because
# there is no label, it is "")
> table(df$MF)
F M
1 2 2
# find the culprit and change it to NA
> df$MF[which(as.integer(df$MF) == 1)] <- as.factor(NA)
# AHA! So despite us changing the value, the original factor
# was not updated! I wonder what happens if we tabulate the column...
> levels(df$MF)
[1] "" "F" "M"
# Indeed, the empty level is present in the factor, but there are
# no occurences!
> table(df$MF)
F M
0 2 2
# droplevels to the rescue:
# it is used to drop unused levels from a factor or, more commonly,
# from factors in a data frame.
> df$MF <- droplevels(df$MF)
# factors fixed
> levels(df$MF)
[1] "F" "M"
# tabulation fixed
> table(df$MF)
F M
2 2
Move empty factor levels while maintaining order of non-empty levels in ggplot2
This was surprisingly tricky - given that all you need to do is order your levels correctly. I couldn't find anything in forcats
that was directly appropriate, but we can write our own reordering function.
my_reorder <- function (fac, var) {
fac <- fct_reorder(fac, {{var}})
l <- levels(fac)
nonempty <- levels(factor(fac)) # I got this idea from droplevels()
empty <- setdiff(l, nonempty)
fct_relevel(fac, empty, nonempty)
fct_relevel(fac, empty, nonempty)
}
mtcars %>%
mutate(cyl = as.factor(cyl),
cyl = fct_expand(cyl, c("2", "4", "6", "8"))) %>%
group_by(cyl) %>%
summarize(meanMPG = mean(mpg)) %>%
ungroup() %>%
mutate(cyl = my_reorder(cyl, meanMPG)) %>%
ggplot(aes(x = cyl, y = meanMPG)) +
geom_col() +
scale_x_discrete(drop = FALSE, ) +
coord_flip() # shows empty level "2" on the top
Related Topics
How to Loop Through a Folder of CSV Files in R
Get the Last Row of a Previous Group in Data.Table
Align Edges of Ggplot Choropleth (Legend Title Varies)
R: Serialize Objects to Text File and Back Again
How to Read Knitr/Rmd Cache in Interactive Session
How to Increase the Resolution of My Plot in R
Plotting Continuous and Discrete Series in Ggplot with Facet
Want Only the Time Portion of a Date-Time Object in R
How to Convert by the Minute Data to Hourly Average Data
Getting Both Column Counts and Proportions in the Same Table in R
References Truncated in Beamer Presentation Prepared in Knitr/Rmarkdown
Filter Dataframe by Maximum Values in Each Group
Warning When Defining Factor: Duplicated Levels in Factors Are Deprecated
How to Split a Character Vector into Data Frame
Import Multiple Text Files in R and Assign Them Names from a Predetermined List