why character is often preferred to factor in data.table for key?
Isn't factor just integer which should be easier to do counting sort
than character?
Yes, if you're given a factor already. But the time to create that factor can be significant and that's what setkey
(and ad hoc by
) aim to beat. Try timing factor()
on a randomly ordered character vector, say 1e6 long with 1e4 levels. Then compare to setkey
or ad hoc by
on the original randomly ordered character vector.
agstudy's comment is correct too; i.e., character vectors (being pointers to R cached strings) are quite similar to factors anyway. On 32bit systems character vectors are the same size as the factor's integer vector but the factor has the levels attribute to store (and sometimes copy) too. On 64bit systems the pointers are twice as big. But on the other hand R's string cache can be looked up directly from character vector pointers, whereas the factor has an extra hop via levels. (The levels attribute is a character vector of R string cache pointers too.)
Factors in RSQLite
It looks pretty simple to me: factor
is a concept only S and R know. Full stop.
So to get them into a DB and back, you need to write mappers. Either be simplistic and do everything as.character
(and assume most DB backend will hash strings just as R does). Or be DB-centric and split the factor into just the (unsigned) (and possibly short) integers, and the labels.
Factors in R: more than an annoyance?
You should use factors. Yes they can be a pain, but my theory is that 90% of why they're a pain is because in read.table
and read.csv
, the argument stringsAsFactors = TRUE
by default (and most users miss this subtlety). I say they are useful because model fitting packages like lme4 use factors and ordered factors to differentially fit models and determine the type of contrasts to use. And graphing packages also use them to group by. ggplot
and most model fitting functions coerce character vectors to factors, so the result is the same. However, you end up with warnings in your code:
lm(Petal.Length ~ -1 + Species, data=iris)
# Call:
# lm(formula = Petal.Length ~ -1 + Species, data = iris)
# Coefficients:
# Speciessetosa Speciesversicolor Speciesvirginica
# 1.462 4.260 5.552
iris.alt <- iris
iris.alt$Species <- as.character(iris.alt$Species)
lm(Petal.Length ~ -1 + Species, data=iris.alt)
# Call:
# lm(formula = Petal.Length ~ -1 + Species, data = iris.alt)
# Coefficients:
# Speciessetosa Speciesversicolor Speciesvirginica
# 1.462 4.260 5.552
Warning message: In
model.matrix.default(mt, mf, contrasts)
:variable
Species
converted to afactor
One tricky thing is the whole drop=TRUE
bit. In vectors this works well to remove levels of factors that aren't in the data. For example:
s <- iris$Species
s[s == 'setosa', drop=TRUE]
# [1] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [11] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [21] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [31] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [41] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# Levels: setosa
s[s == 'setosa', drop=FALSE]
# [1] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [11] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [21] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [31] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [41] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# Levels: setosa versicolor virginica
However, with data.frame
s, the behavior of [.data.frame()
is different: see this email or ?"[.data.frame"
. Using drop=TRUE
on data.frame
s does not work as you'd imagine:
x <- subset(iris, Species == 'setosa', drop=TRUE) # susbetting with [ behaves the same way
x$Species
# [1] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [11] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [21] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [31] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [41] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# Levels: setosa versicolor virginica
Luckily you can drop factors easily with droplevels()
to drop unused factor levels for an individual factor or for every factor in a data.frame
(since R 2.12):
x <- subset(iris, Species == 'setosa')
levels(x$Species)
# [1] "setosa" "versicolor" "virginica"
x <- droplevels(x)
levels(x$Species)
# [1] "setosa"
This is how to keep levels you've selected out from getting in ggplot
legends.
Internally, factor
s are integers with an attribute level character vector (see attributes(iris$Species)
and class(attributes(iris$Species)$levels)
), which is clean. If you had to change a level name (and you were using character strings), this would be a much less efficient operation. And I change level names a lot, especially for ggplot
legends. If you fake factors with character vectors, there's the risk that you'll change just one element, and accidentally create a separate new level.
data.table::fread's stringsAsFactors=TRUE argument doesn't convert character columns to factor type - what's the workaround?
Just implemented stringsAsFactors
argument for fread
in v 1.9.6+
From NEWS:
- Implemented
stringsAsFactors
argument forfread()
. WhenTRUE
, character columns are converted to factors. Default isFALSE
. Thanks to Artem Klevtsov for filing #501, and to @hmi2015 for this SO post.
tidy: create key without rowwise()?
Here is a vectorized option using pmin/pmap
. Take the min/max
for each row of columns 'grp1', 'grp3' with pmin/pmax
and concatenate together (str_c
)
library(dplyr)
library(stringr)
df %>%
mutate(key = str_c(pmin(grp1, grp3), pmax(grp1, grp3)))
# A tibble: 5 x 5
# grp1 grp2 grp3 value key
# <chr> <chr> <chr> <dbl> <chr>
#1 E k A 24.7 AE
#2 D l B 5.66 BD
#3 C m C 16.3 CC
#4 B n D 5.88 BD
#5 A o E -9.22 AE
data
df <- tibble(grp1=rev(LETTERS[1:5]),grp2=letters[11:15],grp3=LETTERS[1:5],
value=rnorm(5,10,10))
NOTE: cbind
converts to matrix
and matrix can hold only a single class. By converting to tibble
with as_tibble
doesn't change the class automatically. Instead, use tibble/data.frame
directly instead of cbind
route
Related Topics
Standard Eval with Ggplot2 Without 'Aes_String()'
Error When Mapping in Ggmap with API Key (403 Forbidden)
How to Format the X-Axis of the Hard Coded Plotting Function of Spei Package in R
Convert Byte Encoding to Unicode
Selecting Unique Rows in Matrix Using R
Combining Rows Based on a Column
How Can One Mix 2 or More Color Palettes to Show a Combined Color Value
Finding Number of Elements in One Vector That Are Less Than an Element in Another Vector
Finding If Boolean Is Ever True by Groups in R
Web Scraping Data Table with R Rvest
R: How to Get a Sum of Two Distributions
Sum Columns Row-Wise with Similar Names
Efficient Way to Fill Time-Series Per Group
Importing Multiple .CSV Files with Variable Column Types into R
How to Get Mean of Every N Rows and Keep the Date Index
In R, Switch Uppercase to Lowercase and Vice-Versa in a String