Does converting character columns to factors save memory?
Converting to factor won't save space because characters are stored in a hash table. See section 1.10 The CHARSXP cache of R Internals.
Converting to factor may improve processing time if your code would need to convert to factor (running a regression, classification, etc.), but it won't improve processing time if you're doing string manipulation because it would have to convert the factor back to a character. So it really depends on what you're doing.
Are factors stored more efficiently in data.table than characters?
You may be remembering data.table FAQ 2.17 which contains :
stringsAsFactors is by default TRUE in data.frame but FALSE in data.table, for efficiency. Since a global string cache was added to R, characters items are a pointer to the single cached string and there is no longer a performance benefit of converting to factor.
(That part was added to the FAQ in v1.8.2 in July 2012.)
Using character rather than factor helps a lot in tasks like stacking (rbindlist). Since a c()
of two character vectors is just the concatenation whereas a c()
of two factor columns needs to traverse and union the two factor levels which is harder to code and takes longer to execute.
What you've noticed is a difference in RAM consumption on 64bit machines. Factors are stored as an integer
vector lookup of the items in the levels. Type integer
is 32bit, even on 64bit platforms. But pointers (what a character
vector is) are 64bit on 64bit machines. So a character column will use twice as much RAM than a factor column on 64bit machine. No difference on 32bit. However, usually this cost will be outweighed by the simpler and faster instructions possible on a character vector. [Aside: since factors are integer
they can't contain more than 2 billion unique strings. character
columns don't have that limitation.]
It depends on what you're doing but operations have been optimized for character
in data.table and so that's what we advise. Basically it saves a hop (to levels) and we can compare two character columns in different tables just by comparing the pointer values without hopping at all, even to the global cache.
It depends on the cardinality of the column, too. Say the column is 1 million rows and contains 1 million unique strings. Storing it as a factor will need a 1 million character vector for the levels plus a 1 million integer vector pointing to the level's elements. That's (4+8)*1e6 bytes. A character vector on the other hand won't need the levels and it's just 8*1e6 bytes. In both cases the global cache stores the 1 million unique strings in the same way so that happens anyway. In this case, the character column will use less RAM than if it were a factor. Careful to check that the memory tool used to calculate the RAM usage is calculating this appropriately.
Efficiency of factor vs. characters - object size
Roughly speaking, a factor is an integer vector with a levels
attribute (a character vector) listing the category names and a class
attribute (another character vector) telling R that it's a factor.
A short factor tends to require more memory than a character vector of the same length, because the cost of storing the factor's attributes more than offsets the saving due to storing integers instead of strings. Here is an extreme example illustrating this point:
x <- c("a", "b")
f <- factor(x)
class(f)
# [1] "factor"
unclass(f)
# [1] 1 2
# attr(,"levels")
# [1] "a" "b"
Storing f
requires storing both the integer vector c(1L, 2L)
and the character vector c("a", "b")
. In this case, the integer vector is completely redundant, because c("a", "b")
encodes all of the information we needed in the first place.
object.size(f)
# 568 bytes
object.size(x)
# 176 bytes
It becomes more efficient to store factors when the levels have a large number of repetitions.
g <- gl(2L, 1e06L, labels = c("a", "b"))
y <- as.character(g)
object.size(g)
# 8000560 bytes
object.size(y)
# 16000160 bytes
Some things to keep in mind:
- Many R functions that handle categorical variables (
table
,split
, etc.) convert character vector arguments to factors before doing anything else with them. Thus, actually doing stuff with a categorical variable almost always involves committing a factor to memory anyway. - Factors clearly communicate to users that the variable is categorical and not merely a sequence of strings.
So, there are many good reasons to prefer factors, even if they are short.
Related Topics
Replacing All Missing Values in R Data.Table with a Value
How to Automatically Include All 2-Way Interactions in a Glm Model in R
Efficient Alternatives to Merge for Larger Data.Frames R
How to Install R Package from Private Repo Using Devtools Install_Github
How to Combine Multiple Ggplot2 Elements into the Return of a Function
Can't Load X11 in R After Os X Yosemite Upgrade
Using Predict with a List of Lm() Objects
What Algorithm I Need to Find N-Grams
Listing R Package Dependencies Without Installing Packages
Insert Layer Underneath Existing Layers in Ggplot2 Object
Return Most Frequent String Value for Each Group
Shaded Area Under Two Curves Using R