Why does as.factor return a character when used inside apply?
apply
converts your data.frame to a character matrix. Use lapply
:
lapply(a, class)
# $x1
# [1] "numeric"
# $x2
# [1] "factor"
# $x3
# [1] "factor"
In second command apply converts result to character matrix, using lapply
:
a2 <- lapply(a, as.factor)
lapply(a2, class)
# $x1
# [1] "factor"
# $x2
# [1] "factor"
# $x3
# [1] "factor"
But for simple lookout you could use str
:
str(a)
# 'data.frame': 100 obs. of 3 variables:
# $ x1: num -1.79 -1.091 1.307 1.142 -0.972 ...
# $ x2: Factor w/ 2 levels "a","b": 2 1 1 1 2 1 1 1 1 2 ...
# $ x3: Factor w/ 2 levels "a","b": 1 1 1 1 1 1 1 1 1 1 ...
Additional explanation according to comments:
Why does the lapply work while apply doesn't?
The first thing that apply
does is to convert an argument to a matrix. So apply(a)
is equivalent to apply(as.matrix(a))
. As you can see str(as.matrix(a))
gives you:
chr [1:100, 1:3] " 0.075124364" "-1.608618269" "-1.487629526" ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:3] "x1" "x2" "x3"
There are no more factors, so class
return "character"
for all columns.lapply
works on columns so gives you what you want (it does something like class(a$column_name)
for each column).
You can see in help to apply
why apply
and as.factor
doesn't work :
In all cases the result is coerced by
as.vector to one of the basic vector
types before the dimensions are set,
so that (for example) factor results
will be coerced to a character array.
Why sapply
and as.factor
doesn't work you can see in help to sapply
:
Value (...) An atomic vector or matrix
or list of the same length as X (...)
If simplification occurs, the output
type is determined from the highest
type of the return values in the
hierarchy NULL < raw < logical <
integer < real < complex < character <
list < expression, after coercion of
pairlists to lists.
You never get matrix of factors or data.frame.
How to convert output to data.frame
?
Simple, use as.data.frame
as you wrote in comment:
a2 <- as.data.frame(lapply(a, as.factor))
str(a2)
'data.frame': 100 obs. of 3 variables:
$ x1: Factor w/ 100 levels "-2.49629293159922",..: 60 6 7 63 45 93 56 98 40 61 ...
$ x2: Factor w/ 2 levels "a","b": 1 1 2 2 2 2 2 1 2 2 ...
$ x3: Factor w/ 2 levels "a","b": 1 1 1 1 1 1 1 1 1 1 ...
But if you want to replace selected character columns with factor
there is a trick:
a3 <- data.frame(x1=letters, x2=LETTERS, x3=LETTERS, stringsAsFactors=FALSE)
str(a3)
'data.frame': 26 obs. of 3 variables:
$ x1: chr "a" "b" "c" "d" ...
$ x2: chr "A" "B" "C" "D" ...
$ x3: chr "A" "B" "C" "D" ...
columns_to_change <- c("x1","x2")
a3[, columns_to_change] <- lapply(a3[, columns_to_change], as.factor)
str(a3)
'data.frame': 26 obs. of 3 variables:
$ x1: Factor w/ 26 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...
$ x2: Factor w/ 26 levels "A","B","C","D",..: 1 2 3 4 5 6 7 8 9 10 ...
$ x3: chr "A" "B" "C" "D" ...
You could use it to replace all columns using:
a3 <- data.frame(x1=letters, x2=LETTERS, x3=LETTERS, stringsAsFactors=FALSE)
a3[, ] <- lapply(a3, as.factor)
str(a3)
'data.frame': 26 obs. of 3 variables:
$ x1: Factor w/ 26 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...
$ x2: Factor w/ 26 levels "A","B","C","D",..: 1 2 3 4 5 6 7 8 9 10 ...
$ x3: Factor w/ 26 levels "A","B","C","D",..: 1 2 3 4 5 6 7 8 9 10 ...
Why is.factor() used in apply() and sapply() returns different values?
From the reference of apply
:
Returns a vector or array or list of values obtained by applying a
function to margins of an array or matrix.
Therefore, it converts your input object to a matrix (array) first which must have the same atomic data type. This means that your data get coerced to character
, because factor
is not an atomic vector type.
> as.matrix(X)
X1 X2
[1,] "1" "f1"
[2,] "2" "f2"
[3,] "3" "f3"
[4,] "4" "f4"
Why `as.factor` does not work when applied via `apply` function in R?
I think this is because of how apply
simplifies the result to return a matrix. From ?apply
:
If ‘X’ is not an array but an object of a class with a non-null
‘dim’ value (such as a data frame), ‘apply’ attempts to coerce it
to an array via ‘as.matrix’ if it is two-dimensional (e.g., a data
frame) or via ‘as.array’.
In fact your original data frame is as you wish. Try str(df)
or sapply(df, is.factor)
to verify it. Basically character vectors are always coerced to factors, unless stringsAsFactors=FALSE
.
lapply(x, as.factor) returning just one level
You want
df[] <- lapply(df, factor)
Why use as.factor() instead of just factor()
as.factor
is a wrapper for factor
, but it allows quick return if the input vector is already a factor:
function (x)
{
if (is.factor(x))
x
else if (!is.object(x) && is.integer(x)) {
levels <- sort(unique.default(x))
f <- match(x, levels)
levels(f) <- as.character(levels)
if (!is.null(nx <- names(x)))
names(f) <- nx
class(f) <- "factor"
f
}
else factor(x)
}
Comment from Frank: it's not a mere wrapper, since this "quick return" will leave factor levels as they are while factor()
will not:
f = factor("a", levels = c("a", "b"))
#[1] a
#Levels: a b
factor(f)
#[1] a
#Levels: a
as.factor(f)
#[1] a
#Levels: a b
Expanded answer two years later, including the following:
- What does the manual say?
- Performance:
as.factor
>factor
when input is a factor - Performance:
as.factor
>factor
when input is integer - Unused levels or NA levels
- Caution when using R's group-by functions: watch for unused or NA levels
What does the manual say?
The documentation for ?factor
mentions the following:
‘factor(x, exclude = NULL)’ applied to a factor without ‘NA’s is a
no-operation unless there are unused levels: in that case, a
factor with the reduced level set is returned.
‘as.factor’ coerces its argument to a factor. It is an
abbreviated (sometimes faster) form of ‘factor’.
Performance: as.factor
> factor
when input is a factor
The word "no-operation" is a bit ambiguous. Don't take it as "doing nothing"; in fact, it means "doing a lot of things but essentially changing nothing". Here is an example:
set.seed(0)
## a randomized long factor with 1e+6 levels, each repeated 10 times
f <- sample(gl(1e+6, 10))
system.time(f1 <- factor(f)) ## default: exclude = NA
# user system elapsed
# 7.640 0.216 7.887
system.time(f2 <- factor(f, exclude = NULL))
# user system elapsed
# 7.764 0.028 7.791
system.time(f3 <- as.factor(f))
# user system elapsed
# 0 0 0
identical(f, f1)
#[1] TRUE
identical(f, f2)
#[1] TRUE
identical(f, f3)
#[1] TRUE
as.factor
does give a quick return, but factor
is not a real "no-op". Let's profile factor
to see what it has done.
Rprof("factor.out")
f1 <- factor(f)
Rprof(NULL)
summaryRprof("factor.out")[c(1, 4)]
#$by.self
# self.time self.pct total.time total.pct
#"factor" 4.70 58.90 7.98 100.00
#"unique.default" 1.30 16.29 4.42 55.39
#"as.character" 1.18 14.79 1.84 23.06
#"as.character.factor" 0.66 8.27 0.66 8.27
#"order" 0.08 1.00 0.08 1.00
#"unique" 0.06 0.75 4.54 56.89
#
#$sampling.time
#[1] 7.98
It first sort
the unique
values of the input vector f
, then converts f
to a character vector, finally uses factor
to coerces the character vector back to a factor. Here is the source code of factor
for confirmation.
function (x = character(), levels, labels = levels, exclude = NA,
ordered = is.ordered(x), nmax = NA)
{
if (is.null(x))
x <- character()
nx <- names(x)
if (missing(levels)) {
y <- unique(x, nmax = nmax)
ind <- sort.list(y)
levels <- unique(as.character(y)[ind])
}
force(ordered)
if (!is.character(x))
x <- as.character(x)
levels <- levels[is.na(match(levels, exclude))]
f <- match(x, levels)
if (!is.null(nx))
names(f) <- nx
nl <- length(labels)
nL <- length(levels)
if (!any(nl == c(1L, nL)))
stop(gettextf("invalid 'labels'; length %d should be 1 or %d",
nl, nL), domain = NA)
levels(f) <- if (nl == nL)
as.character(labels)
else paste0(labels, seq_along(levels))
class(f) <- c(if (ordered) "ordered", "factor")
f
}
So function factor
is really designed to work with a character vector and it applies as.character
to its input to ensure that. We can at least learn two performance-related issues from above:
- For a data frame
DF
,lapply(DF, as.factor)
is much faster thanlapply(DF, factor)
for type conversion, if many columns are readily factors. - That function
factor
is slow can explain why some important R functions are slow, saytable
: R: table function suprisingly slow
Performance: as.factor
> factor
when input is integer
A factor variable is the next of kin of an integer variable.
unclass(gl(2, 2, labels = letters[1:2]))
#[1] 1 1 2 2
#attr(,"levels")
#[1] "a" "b"
storage.mode(gl(2, 2, labels = letters[1:2]))
#[1] "integer"
This means that converting an integer to a factor is easier than converting a numeric / character to a factor. as.factor
just takes care of this.
x <- sample.int(1e+6, 1e+7, TRUE)
system.time(as.factor(x))
# user system elapsed
# 4.592 0.252 4.845
system.time(factor(x))
# user system elapsed
# 22.236 0.264 22.659
Unused levels or NA levels
Now let's see a few examples on factor
and as.factor
's influence on factor levels (if the input is a factor already). Frank has given one with unused factor level, I will provide one with NA
level.
f <- factor(c(1, NA), exclude = NULL)
#[1] 1 <NA>
#Levels: 1 <NA>
as.factor(f)
#[1] 1 <NA>
#Levels: 1 <NA>
factor(f, exclude = NULL)
#[1] 1 <NA>
#Levels: 1 <NA>
factor(f)
#[1] 1 <NA>
#Levels: 1
There is a (generic) function droplevels
that can be used to drop unused levels of a factor. But NA
levels can not be dropped by default.
## "factor" method of `droplevels`
droplevels.factor
#function (x, exclude = if (anyNA(levels(x))) NULL else NA, ...)
#factor(x, exclude = exclude)
droplevels(f)
#[1] 1 <NA>
#Levels: 1 <NA>
droplevels(f, exclude = NA)
#[1] 1 <NA>
#Levels: 1
Caution when using R's group-by functions: watch for unused or NA levels
R functions doing group-by operations, like split
, tapply
expect us to provide factor variables as "by" variables. But often we just provide character or numeric variables. So internally, these functions need to convert them into factors and probably most of them would use as.factor
in the first place (at least this is so for split.default
and tapply
). The table
function looks like an exception and I spot factor
instead of as.factor
inside. There might be some special consideration which is unfortunately not obvious to me when I inspect its source code.
Since most group-by R functions use as.factor
, if they are given a factor with unused or NA
levels, such group will appear in the result.
x <- c(1, 2)
f <- factor(letters[1:2], levels = letters[1:3])
split(x, f)
#$a
#[1] 1
#
#$b
#[1] 2
#
#$c
#numeric(0)
tapply(x, f, FUN = mean)
# a b c
# 1 2 NA
Interestingly, although table
does not rely on as.factor
, it preserves those unused levels, too:
table(f)
#a b c
#1 1 0
Sometimes this kind of behavior can be undesired. A classic example is barplot(table(f))
:
If this is really undesired, we need to manually remove unused or NA
levels from our factor variable, using droplevels
or factor
.
Hint:
split
has an argumentdrop
which defaults toFALSE
henceas.factor
is used; bydrop = TRUE
functionfactor
is used instead.aggregate
relies onsplit
, so it also has adrop
argument and it defaults toTRUE
.tapply
does not havedrop
although it also relies onsplit
. In particular the documentation?tapply
says thatas.factor
is (always) used.
Why gsub automatically changes a Factor into Character
Yes, gsub
performs as.character
. If you type gsub
in the console you can see the function
function (pattern, replacement, x, ignore.case = FALSE, perl = FALSE,
fixed = FALSE, useBytes = FALSE)
{
if (!is.character(x))
x <- as.character(x)
.Internal(gsub(as.character(pattern), as.character(replacement),
x, ignore.case, perl, fixed, useBytes))
}
And no, it will not convert to integer directly as it always returns a character vector. From ?gsub
sub and gsub return a character vector of the same length and with the same attributes as x (after possible coercion to character).
Related Topics
The Condition Has Length > 1 and Only the First Element Will Be Used in If Else Statement
Linear Regression with a Known Fixed Intercept in R
Count Number of Rows Matching a Criteria
Setting Function Defaults R on a Project Specific Basis
How to Access the Help/Documentation .Rd Source Files in R
Ggplot2: Changing the Order of Stacks on a Bar Graph
How to Reorder Data.Table Columns (Without Copying)
Time Out an R Command via Something Like Try()
Creating a Symmetric Matrix in R
Alternative to R's 'Memory.Size()' in Linux
R: How to Filter/Subset a Sequence of Dates
Error in Model.Frame.Default: Variable Lengths Differ
Changing the Line Type in the Ggplot Legend
Transforming a Time-Series into a Data Frame and Back