Why "Character Is Often Preferred to Factor" in Data.Table for Key

why character is often preferred to factor in data.table for key?

Isn't factor just integer which should be easier to do counting sort
than character?

Yes, if you're given a factor already. But the time to create that factor can be significant and that's what setkey (and ad hoc by) aim to beat. Try timing factor() on a randomly ordered character vector, say 1e6 long with 1e4 levels. Then compare to setkey or ad hoc by on the original randomly ordered character vector.

agstudy's comment is correct too; i.e., character vectors (being pointers to R cached strings) are quite similar to factors anyway. On 32bit systems character vectors are the same size as the factor's integer vector but the factor has the levels attribute to store (and sometimes copy) too. On 64bit systems the pointers are twice as big. But on the other hand R's string cache can be looked up directly from character vector pointers, whereas the factor has an extra hop via levels. (The levels attribute is a character vector of R string cache pointers too.)

Factors in RSQLite

It looks pretty simple to me: factor is a concept only S and R know. Full stop.

So to get them into a DB and back, you need to write mappers. Either be simplistic and do everything as.character (and assume most DB backend will hash strings just as R does). Or be DB-centric and split the factor into just the (unsigned) (and possibly short) integers, and the labels.

Factors in R: more than an annoyance?

You should use factors. Yes they can be a pain, but my theory is that 90% of why they're a pain is because in read.table and read.csv, the argument stringsAsFactors = TRUE by default (and most users miss this subtlety). I say they are useful because model fitting packages like lme4 use factors and ordered factors to differentially fit models and determine the type of contrasts to use. And graphing packages also use them to group by. ggplot and most model fitting functions coerce character vectors to factors, so the result is the same. However, you end up with warnings in your code:

lm(Petal.Length ~ -1 + Species, data=iris)

# Call:
# lm(formula = Petal.Length ~ -1 + Species, data = iris)

# Coefficients:
# Speciessetosa Speciesversicolor Speciesvirginica
# 1.462 4.260 5.552

iris.alt <- iris
iris.alt$Species <- as.character(iris.alt$Species)
lm(Petal.Length ~ -1 + Species, data=iris.alt)

# Call:
# lm(formula = Petal.Length ~ -1 + Species, data = iris.alt)

# Coefficients:
# Speciessetosa Speciesversicolor Speciesvirginica
# 1.462 4.260 5.552

Warning message: In model.matrix.default(mt, mf, contrasts) :

variable Species converted to a factor

One tricky thing is the whole drop=TRUE bit. In vectors this works well to remove levels of factors that aren't in the data. For example:

s <- iris$Species
s[s == 'setosa', drop=TRUE]
# [1] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [11] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [21] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [31] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [41] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# Levels: setosa
s[s == 'setosa', drop=FALSE]
# [1] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [11] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [21] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [31] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [41] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# Levels: setosa versicolor virginica

However, with data.frames, the behavior of [.data.frame() is different: see this email or ?"[.data.frame". Using drop=TRUE on data.frames does not work as you'd imagine:

x <- subset(iris, Species == 'setosa', drop=TRUE)  # susbetting with [ behaves the same way
x$Species
# [1] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [11] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [21] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [31] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# [41] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
# Levels: setosa versicolor virginica

Luckily you can drop factors easily with droplevels() to drop unused factor levels for an individual factor or for every factor in a data.frame (since R 2.12):

x <- subset(iris, Species == 'setosa')
levels(x$Species)
# [1] "setosa" "versicolor" "virginica"
x <- droplevels(x)
levels(x$Species)
# [1] "setosa"

This is how to keep levels you've selected out from getting in ggplot legends.

Internally, factors are integers with an attribute level character vector (see attributes(iris$Species) and class(attributes(iris$Species)$levels)), which is clean. If you had to change a level name (and you were using character strings), this would be a much less efficient operation. And I change level names a lot, especially for ggplot legends. If you fake factors with character vectors, there's the risk that you'll change just one element, and accidentally create a separate new level.

data.table::fread's stringsAsFactors=TRUE argument doesn't convert character columns to factor type - what's the workaround?

Just implemented stringsAsFactors argument for fread in v 1.9.6+

From NEWS:


  1. Implemented stringsAsFactors argument for fread(). When TRUE, character columns are converted to factors. Default is FALSE. Thanks to Artem Klevtsov for filing #501, and to @hmi2015 for this SO post.

tidy: create key without rowwise()?

Here is a vectorized option using pmin/pmap. Take the min/max for each row of columns 'grp1', 'grp3' with pmin/pmax and concatenate together (str_c)

library(dplyr)
library(stringr)
df %>%
mutate(key = str_c(pmin(grp1, grp3), pmax(grp1, grp3)))
# A tibble: 5 x 5
# grp1 grp2 grp3 value key
# <chr> <chr> <chr> <dbl> <chr>
#1 E k A 24.7 AE
#2 D l B 5.66 BD
#3 C m C 16.3 CC
#4 B n D 5.88 BD
#5 A o E -9.22 AE

data

df <- tibble(grp1=rev(LETTERS[1:5]),grp2=letters[11:15],grp3=LETTERS[1:5],
value=rnorm(5,10,10))

NOTE: cbind converts to matrix and matrix can hold only a single class. By converting to tibble with as_tibble doesn't change the class automatically. Instead, use tibble/data.frame directly instead of cbind route



Related Topics



Leave a reply



Submit