Factors in R: more than an annoyance?
You should use factors. Yes they can be a pain, but my theory is that 90% of why they're a pain is because in read.table
and read.csv
, the argument stringsAsFactors = TRUE
by default (and most users miss this subtlety). I say they are useful because model fitting packages like lme4 use factors and ordered factors to differentially fit models and determine the type of contrasts to use. And graphing packages also use them to group by. ggplot
and most model fitting functions coerce character vectors to factors, so the result is the same. However, you end up with warnings in your code:
lm(Petal.Length ~ -1 + Species, data=iris)
# Call:
# lm(formula = Petal.Length ~ -1 + Species, data = iris)
# Coefficients:
# Speciessetosa Speciesversicolor Speciesvirginica
# 1.462 4.260 5.552
iris.alt <- iris
iris.alt$Species <- as.character(iris.alt$Species)
lm(Petal.Length ~ -1 + Species, data=iris.alt)
# Call:
# lm(formula = Petal.Length ~ -1 + Species, data = iris.alt)
# Coefficients:
# Speciessetosa Speciesversicolor Speciesvirginica
# 1.462 4.260 5.552
Warning message: In
model.matrix.default(mt, mf, contrasts)
:variable
Species
converted to afactor
One tricky thing is the whole drop=TRUE
bit. In vectors this works well to remove levels of factors that aren't in the data. For example:
s <- iris$Species
s[s == 'setosa', drop=TRUE]
# [1] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [11] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [21] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [31] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [41] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# Levels: setosa
s[s == 'setosa', drop=FALSE]
# [1] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [11] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [21] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [31] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [41] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# Levels: setosa versicolor virginica
However, with data.frame
s, the behavior of [.data.frame()
is different: see this email or ?"[.data.frame"
. Using drop=TRUE
on data.frame
s does not work as you'd imagine:
x <- subset(iris, Species == 'setosa', drop=TRUE) # susbetting with [ behaves the same way
x$Species
# [1] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [11] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [21] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [31] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [41] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# Levels: setosa versicolor virginica
Luckily you can drop factors easily with droplevels()
to drop unused factor levels for an individual factor or for every factor in a data.frame
(since R 2.12):
x <- subset(iris, Species == 'setosa')
levels(x$Species)
# [1] "setosa" "versicolor" "virginica"
x <- droplevels(x)
levels(x$Species)
# [1] "setosa"
This is how to keep levels you've selected out from getting in ggplot
legends.
Internally, factor
s are integers with an attribute level character vector (see attributes(iris$Species)
and class(attributes(iris$Species)$levels)
), which is clean. If you had to change a level name (and you were using character strings), this would be a much less efficient operation. And I change level names a lot, especially for ggplot
legends. If you fake factors with character vectors, there's the risk that you'll change just one element, and accidentally create a separate new level.
Are factors stored more efficiently in data.table than characters?
You may be remembering data.table FAQ 2.17 which contains :
stringsAsFactors is by default TRUE in data.frame but FALSE in data.table, for efficiency. Since a global string cache was added to R, characters items are a pointer to the single cached string and there is no longer a performance benefit of converting to factor.
(That part was added to the FAQ in v1.8.2 in July 2012.)
Using character rather than factor helps a lot in tasks like stacking (rbindlist). Since a c()
of two character vectors is just the concatenation whereas a c()
of two factor columns needs to traverse and union the two factor levels which is harder to code and takes longer to execute.
What you've noticed is a difference in RAM consumption on 64bit machines. Factors are stored as an integer
vector lookup of the items in the levels. Type integer
is 32bit, even on 64bit platforms. But pointers (what a character
vector is) are 64bit on 64bit machines. So a character column will use twice as much RAM than a factor column on 64bit machine. No difference on 32bit. However, usually this cost will be outweighed by the simpler and faster instructions possible on a character vector. [Aside: since factors are integer
they can't contain more than 2 billion unique strings. character
columns don't have that limitation.]
It depends on what you're doing but operations have been optimized for character
in data.table and so that's what we advise. Basically it saves a hop (to levels) and we can compare two character columns in different tables just by comparing the pointer values without hopping at all, even to the global cache.
It depends on the cardinality of the column, too. Say the column is 1 million rows and contains 1 million unique strings. Storing it as a factor will need a 1 million character vector for the levels plus a 1 million integer vector pointing to the level's elements. That's (4+8)*1e6 bytes. A character vector on the other hand won't need the levels and it's just 8*1e6 bytes. In both cases the global cache stores the 1 million unique strings in the same way so that happens anyway. In this case, the character column will use less RAM than if it were a factor. Careful to check that the memory tool used to calculate the RAM usage is calculating this appropriately.
There are more factors than what I have in the X axis being labeled
This happened because Year
is a numeric variable, and ggplot made that separation based on the values, if you had more years probably that would not happen. Here are two solutions.
Example data
df <-
tibble(
x = rep(2014:2015, each = 100),
y = c(rnorm(100),rexp(100))
)
Original code
df %>%
ggplot(aes(x,y))+
geom_violin(aes(fill = factor(x)))
Solutions
Transform Year in a factor/character
That is a nice solution, because also solves the aesthetic fill
, it could complicate some other geometry but that is very specific.
df %>%
mutate(x = as.factor(x)) %>%
ggplot(aes(x,y))+
geom_violin(aes(fill = x))
Add a scale for x axis
Using scale you can set whatever labels
or breaks
, but you still need to transform the variable for the aesthetic fill
.
df %>%
ggplot(aes(x,y))+
geom_violin(aes(fill = factor(x)))+
scale_x_continuous(breaks = 2014:2015)
Result
Related Topics
Find K Nearest Neighbors, Starting from a Distance Matrix
Find Start and End Positions/Indices of Runs/Consecutive Values
Remove Duplicate Column Pairs, Sort Rows Based on 2 Columns
Split Dataframe by Levels of a Factor and Name Dataframes by Those Levels
Ggplot2: Changing the Order of Stacks on a Bar Graph
R Keep Rows with at Least One Column Greater Than Value
Long Numbers as a Character String
Command Lines Error in Rstudio Console
Splitting a Data.Frame by a Variable
Merge Dataframes of Different Sizes
Reshape from Long to Wide and Create Columns with Binary Value
Convert from Billion to Million and Vice Versa
Expand Spacing Between Tick Marks on X Axis
What's the Difference Between Lapply and Do.Call
Select Rows of a Matrix That Meet a Condition
Why Is Message() a Better Choice Than Print() in R for Writing a Package