Are factors stored more efficiently in data.table than characters?
You may be remembering data.table FAQ 2.17 which contains :
stringsAsFactors is by default TRUE in data.frame but FALSE in data.table, for efficiency. Since a global string cache was added to R, characters items are a pointer to the single cached string and there is no longer a performance benefit of converting to factor.
(That part was added to the FAQ in v1.8.2 in July 2012.)
Using character rather than factor helps a lot in tasks like stacking (rbindlist). Since a c()
of two character vectors is just the concatenation whereas a c()
of two factor columns needs to traverse and union the two factor levels which is harder to code and takes longer to execute.
What you've noticed is a difference in RAM consumption on 64bit machines. Factors are stored as an integer
vector lookup of the items in the levels. Type integer
is 32bit, even on 64bit platforms. But pointers (what a character
vector is) are 64bit on 64bit machines. So a character column will use twice as much RAM than a factor column on 64bit machine. No difference on 32bit. However, usually this cost will be outweighed by the simpler and faster instructions possible on a character vector. [Aside: since factors are integer
they can't contain more than 2 billion unique strings. character
columns don't have that limitation.]
It depends on what you're doing but operations have been optimized for character
in data.table and so that's what we advise. Basically it saves a hop (to levels) and we can compare two character columns in different tables just by comparing the pointer values without hopping at all, even to the global cache.
It depends on the cardinality of the column, too. Say the column is 1 million rows and contains 1 million unique strings. Storing it as a factor will need a 1 million character vector for the levels plus a 1 million integer vector pointing to the level's elements. That's (4+8)*1e6 bytes. A character vector on the other hand won't need the levels and it's just 8*1e6 bytes. In both cases the global cache stores the 1 million unique strings in the same way so that happens anyway. In this case, the character column will use less RAM than if it were a factor. Careful to check that the memory tool used to calculate the RAM usage is calculating this appropriately.
Efficiency of factor vs. characters - object size
Roughly speaking, a factor is an integer vector with a levels
attribute (a character vector) listing the category names and a class
attribute (another character vector) telling R that it's a factor.
A short factor tends to require more memory than a character vector of the same length, because the cost of storing the factor's attributes more than offsets the saving due to storing integers instead of strings. Here is an extreme example illustrating this point:
x <- c("a", "b")
f <- factor(x)
class(f)
# [1] "factor"
unclass(f)
# [1] 1 2
# attr(,"levels")
# [1] "a" "b"
Storing f
requires storing both the integer vector c(1L, 2L)
and the character vector c("a", "b")
. In this case, the integer vector is completely redundant, because c("a", "b")
encodes all of the information we needed in the first place.
object.size(f)
# 568 bytes
object.size(x)
# 176 bytes
It becomes more efficient to store factors when the levels have a large number of repetitions.
g <- gl(2L, 1e06L, labels = c("a", "b"))
y <- as.character(g)
object.size(g)
# 8000560 bytes
object.size(y)
# 16000160 bytes
Some things to keep in mind:
- Many R functions that handle categorical variables (
table
,split
, etc.) convert character vector arguments to factors before doing anything else with them. Thus, actually doing stuff with a categorical variable almost always involves committing a factor to memory anyway. - Factors clearly communicate to users that the variable is categorical and not merely a sequence of strings.
So, there are many good reasons to prefer factors, even if they are short.
why character is often preferred to factor in data.table for key?
Isn't factor just integer which should be easier to do counting sort
than character?
Yes, if you're given a factor already. But the time to create that factor can be significant and that's what setkey
(and ad hoc by
) aim to beat. Try timing factor()
on a randomly ordered character vector, say 1e6 long with 1e4 levels. Then compare to setkey
or ad hoc by
on the original randomly ordered character vector.
agstudy's comment is correct too; i.e., character vectors (being pointers to R cached strings) are quite similar to factors anyway. On 32bit systems character vectors are the same size as the factor's integer vector but the factor has the levels attribute to store (and sometimes copy) too. On 64bit systems the pointers are twice as big. But on the other hand R's string cache can be looked up directly from character vector pointers, whereas the factor has an extra hop via levels. (The levels attribute is a character vector of R string cache pointers too.)
Factors in R: more than an annoyance?
You should use factors. Yes they can be a pain, but my theory is that 90% of why they're a pain is because in read.table
and read.csv
, the argument stringsAsFactors = TRUE
by default (and most users miss this subtlety). I say they are useful because model fitting packages like lme4 use factors and ordered factors to differentially fit models and determine the type of contrasts to use. And graphing packages also use them to group by. ggplot
and most model fitting functions coerce character vectors to factors, so the result is the same. However, you end up with warnings in your code:
lm(Petal.Length ~ -1 + Species, data=iris)
# Call:
# lm(formula = Petal.Length ~ -1 + Species, data = iris)
# Coefficients:
# Speciessetosa Speciesversicolor Speciesvirginica
# 1.462 4.260 5.552
iris.alt <- iris
iris.alt$Species <- as.character(iris.alt$Species)
lm(Petal.Length ~ -1 + Species, data=iris.alt)
# Call:
# lm(formula = Petal.Length ~ -1 + Species, data = iris.alt)
# Coefficients:
# Speciessetosa Speciesversicolor Speciesvirginica
# 1.462 4.260 5.552
Warning message: In
model.matrix.default(mt, mf, contrasts)
:variable
Species
converted to afactor
One tricky thing is the whole drop=TRUE
bit. In vectors this works well to remove levels of factors that aren't in the data. For example:
s <- iris$Species
s[s == 'setosa', drop=TRUE]
# [1] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [11] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [21] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [31] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [41] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# Levels: setosa
s[s == 'setosa', drop=FALSE]
# [1] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [11] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [21] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [31] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [41] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# Levels: setosa versicolor virginica
However, with data.frame
s, the behavior of [.data.frame()
is different: see this email or ?"[.data.frame"
. Using drop=TRUE
on data.frame
s does not work as you'd imagine:
x <- subset(iris, Species == 'setosa', drop=TRUE) # susbetting with [ behaves the same way
x$Species
# [1] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [11] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [21] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [31] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [41] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# Levels: setosa versicolor virginica
Luckily you can drop factors easily with droplevels()
to drop unused factor levels for an individual factor or for every factor in a data.frame
(since R 2.12):
x <- subset(iris, Species == 'setosa')
levels(x$Species)
# [1] "setosa" "versicolor" "virginica"
x <- droplevels(x)
levels(x$Species)
# [1] "setosa"
This is how to keep levels you've selected out from getting in ggplot
legends.
Internally, factor
s are integers with an attribute level character vector (see attributes(iris$Species)
and class(attributes(iris$Species)$levels)
), which is clean. If you had to change a level name (and you were using character strings), this would be a much less efficient operation. And I change level names a lot, especially for ggplot
legends. If you fake factors with character vectors, there's the risk that you'll change just one element, and accidentally create a separate new level.
R: when to use stringAsFactors
There was one major point you omitted from your discussion, which all looks completely accurate and correct to me. One reason why factors were created is that they enable a potentially massive reduction in the amount of storage space required for a variable. Consider, for example a column of a data frame with very low cardinality (uniqueness of values). When storing this information as character
data, it would require whatever storage is needed to store each string, across the entire column. However, with factors, the storage requirement is massively reduced. As a factor, R only needs to store the actual string values once, and can then represent all the values using numerical levels. And, in the case of a column with low cardinality, only a few strings would actually need to be stored, as compared to the large size of the column.
Given that R is mainly an in-memory tool, and memory is precious, factors represent a good opportunity to optimize any R script. Speaking purely from the point of view of storage/memory, leaving stringsAsFactors
set to TRUE
would make good sense. Obviously, if you have an API or need which requires otherwise, then you need to make a decision.
How to choose between column types for efficiency in R?
Under the hood, an R vector object is actually a symbol bound to a pointer (a VECSXP
). The VECSXP
points to the actual data-containing structure. The data we see in R as numeric vectors are stored as REALSXP
objects. These contain header flags, some pointers (e.g. to attributes), a couple of integers giving information about the length of the vector, and finally the actual numbers: an array of double-precision floating point numbers.
For character vectors, the data have to be stored in a slightly more complicated way. The VECSXP
points to a STRSXP
, which again has header flags, some pointers and a couple of numbers to describe the length of the vector, but what then follows is not an array of characters, but an array of pointers to character strings (more precisely, an array of SEXP
s pointing to CHARSXP
s). A CHARSXP
itself contains flags, pointers and length information, then an array of characters representing a string. Even for short strings, a CHARSXP
will take up a minimum of about 56 bytes on a 64-bit system.
The CHARSXP
objects are re-used, so if you have a vector of 1 million strings each saying "code1", the array of pointers in the STRSXP
should all point to the same CHARSXP
. There is therefore only a very small memory overhead of approximately 56 bytes between a one-million length vector of 1s and a one-million length vector of "1"s.
a <- rep(1, 1e6)
object.size(a)
#> 8000048 bytes
b <- rep("1", 1e6)
object.size(b)
#> 8000104 bytes
This is not the case when you have many different strings, since each different string will require its own CHARSXP
. For example, if we have 26 different strings within our 1-million long vector rather than just a single string, we will take up an extra 56 * (26 - 1) = 1400 bytes of memory:
c <- rep(letters, length.out = 1e6)
object.size(c)
#> 8001504 bytes
So the short answer to your question is that as long as the number of unique elements is small, there is little difference in the size of the underlying memory usage. However, a character vector will always require more memory than a numeric vector - even if the difference is very small.
Related Topics
Addressing Multiple Inputs in Shiny
Makecluster Function in R Snow Hangs Indefinitely
Quickest Way to Read a Subset of Rows of a CSV
Top to Bottom Alignment of Two Ggplot2 Figures
Generate All Combinations, of All Lengths, in R, from a Vector
Unique.Data.Table Select Last Row in Place of the First
Changing Class and Mode from Character to Numeric
How to Start Ggplot2 Geom_Bar from Different Origin
Possible Issue About Random Number Generator
Subset() a Factor by Its Number of Observation
R Ggplot2: Labeling a Horizontal Line Without Associating the Label with a Series
Create Line Graph with Ggplot2, Using Time Periods as X-Variable
Integrate a Very Peaked Function
Updating a Subset of a Dataframe