Why use as.factor() instead of just factor()
as.factor
is a wrapper for factor
, but it allows quick return if the input vector is already a factor:
function (x)
{
if (is.factor(x))
x
else if (!is.object(x) && is.integer(x)) {
levels <- sort(unique.default(x))
f <- match(x, levels)
levels(f) <- as.character(levels)
if (!is.null(nx <- names(x)))
names(f) <- nx
class(f) <- "factor"
f
}
else factor(x)
}
Comment from Frank: it's not a mere wrapper, since this "quick return" will leave factor levels as they are while factor()
will not:
f = factor("a", levels = c("a", "b"))
#[1] a
#Levels: a b
factor(f)
#[1] a
#Levels: a
as.factor(f)
#[1] a
#Levels: a b
Expanded answer two years later, including the following:
- What does the manual say?
- Performance:
as.factor
>factor
when input is a factor - Performance:
as.factor
>factor
when input is integer - Unused levels or NA levels
- Caution when using R's group-by functions: watch for unused or NA levels
What does the manual say?
The documentation for ?factor
mentions the following:
‘factor(x, exclude = NULL)’ applied to a factor without ‘NA’s is a
no-operation unless there are unused levels: in that case, a
factor with the reduced level set is returned.
‘as.factor’ coerces its argument to a factor. It is an
abbreviated (sometimes faster) form of ‘factor’.
Performance: as.factor
> factor
when input is a factor
The word "no-operation" is a bit ambiguous. Don't take it as "doing nothing"; in fact, it means "doing a lot of things but essentially changing nothing". Here is an example:
set.seed(0)
## a randomized long factor with 1e+6 levels, each repeated 10 times
f <- sample(gl(1e+6, 10))
system.time(f1 <- factor(f)) ## default: exclude = NA
# user system elapsed
# 7.640 0.216 7.887
system.time(f2 <- factor(f, exclude = NULL))
# user system elapsed
# 7.764 0.028 7.791
system.time(f3 <- as.factor(f))
# user system elapsed
# 0 0 0
identical(f, f1)
#[1] TRUE
identical(f, f2)
#[1] TRUE
identical(f, f3)
#[1] TRUE
as.factor
does give a quick return, but factor
is not a real "no-op". Let's profile factor
to see what it has done.
Rprof("factor.out")
f1 <- factor(f)
Rprof(NULL)
summaryRprof("factor.out")[c(1, 4)]
#$by.self
# self.time self.pct total.time total.pct
#"factor" 4.70 58.90 7.98 100.00
#"unique.default" 1.30 16.29 4.42 55.39
#"as.character" 1.18 14.79 1.84 23.06
#"as.character.factor" 0.66 8.27 0.66 8.27
#"order" 0.08 1.00 0.08 1.00
#"unique" 0.06 0.75 4.54 56.89
#
#$sampling.time
#[1] 7.98
It first sort
the unique
values of the input vector f
, then converts f
to a character vector, finally uses factor
to coerces the character vector back to a factor. Here is the source code of factor
for confirmation.
function (x = character(), levels, labels = levels, exclude = NA,
ordered = is.ordered(x), nmax = NA)
{
if (is.null(x))
x <- character()
nx <- names(x)
if (missing(levels)) {
y <- unique(x, nmax = nmax)
ind <- sort.list(y)
levels <- unique(as.character(y)[ind])
}
force(ordered)
if (!is.character(x))
x <- as.character(x)
levels <- levels[is.na(match(levels, exclude))]
f <- match(x, levels)
if (!is.null(nx))
names(f) <- nx
nl <- length(labels)
nL <- length(levels)
if (!any(nl == c(1L, nL)))
stop(gettextf("invalid 'labels'; length %d should be 1 or %d",
nl, nL), domain = NA)
levels(f) <- if (nl == nL)
as.character(labels)
else paste0(labels, seq_along(levels))
class(f) <- c(if (ordered) "ordered", "factor")
f
}
So function factor
is really designed to work with a character vector and it applies as.character
to its input to ensure that. We can at least learn two performance-related issues from above:
- For a data frame
DF
,lapply(DF, as.factor)
is much faster thanlapply(DF, factor)
for type conversion, if many columns are readily factors. - That function
factor
is slow can explain why some important R functions are slow, saytable
: R: table function suprisingly slow
Performance: as.factor
> factor
when input is integer
A factor variable is the next of kin of an integer variable.
unclass(gl(2, 2, labels = letters[1:2]))
#[1] 1 1 2 2
#attr(,"levels")
#[1] "a" "b"
storage.mode(gl(2, 2, labels = letters[1:2]))
#[1] "integer"
This means that converting an integer to a factor is easier than converting a numeric / character to a factor. as.factor
just takes care of this.
x <- sample.int(1e+6, 1e+7, TRUE)
system.time(as.factor(x))
# user system elapsed
# 4.592 0.252 4.845
system.time(factor(x))
# user system elapsed
# 22.236 0.264 22.659
Unused levels or NA levels
Now let's see a few examples on factor
and as.factor
's influence on factor levels (if the input is a factor already). Frank has given one with unused factor level, I will provide one with NA
level.
f <- factor(c(1, NA), exclude = NULL)
#[1] 1 <NA>
#Levels: 1 <NA>
as.factor(f)
#[1] 1 <NA>
#Levels: 1 <NA>
factor(f, exclude = NULL)
#[1] 1 <NA>
#Levels: 1 <NA>
factor(f)
#[1] 1 <NA>
#Levels: 1
There is a (generic) function droplevels
that can be used to drop unused levels of a factor. But NA
levels can not be dropped by default.
## "factor" method of `droplevels`
droplevels.factor
#function (x, exclude = if (anyNA(levels(x))) NULL else NA, ...)
#factor(x, exclude = exclude)
droplevels(f)
#[1] 1 <NA>
#Levels: 1 <NA>
droplevels(f, exclude = NA)
#[1] 1 <NA>
#Levels: 1
Caution when using R's group-by functions: watch for unused or NA levels
R functions doing group-by operations, like split
, tapply
expect us to provide factor variables as "by" variables. But often we just provide character or numeric variables. So internally, these functions need to convert them into factors and probably most of them would use as.factor
in the first place (at least this is so for split.default
and tapply
). The table
function looks like an exception and I spot factor
instead of as.factor
inside. There might be some special consideration which is unfortunately not obvious to me when I inspect its source code.
Since most group-by R functions use as.factor
, if they are given a factor with unused or NA
levels, such group will appear in the result.
x <- c(1, 2)
f <- factor(letters[1:2], levels = letters[1:3])
split(x, f)
#$a
#[1] 1
#
#$b
#[1] 2
#
#$c
#numeric(0)
tapply(x, f, FUN = mean)
# a b c
# 1 2 NA
Interestingly, although table
does not rely on as.factor
, it preserves those unused levels, too:
table(f)
#a b c
#1 1 0
Sometimes this kind of behavior can be undesired. A classic example is barplot(table(f))
:
If this is really undesired, we need to manually remove unused or NA
levels from our factor variable, using droplevels
or factor
.
Hint:
split
has an argumentdrop
which defaults toFALSE
henceas.factor
is used; bydrop = TRUE
functionfactor
is used instead.aggregate
relies onsplit
, so it also has adrop
argument and it defaults toTRUE
.tapply
does not havedrop
although it also relies onsplit
. In particular the documentation?tapply
says thatas.factor
is (always) used.
Different result in the case of `factor` or `as.factor`?
Problem is as stated in the error message. You are passing arguments which are not present for as.factor
. If you read ?as.factor
you see the parameter to as.factor
is only x
. levels
, exclude
, ordered
, nmax
are arguments for factor
and not as.factor
. Hence, it is giving you error that you are passing arguments which you are not using.
If you remove those arguments and run the function then it works without any error message.
lapply(df['cola'], function(x) as.factor(x))
#$cola
# [1] a b c d e e 1 <NA> c d
#Levels: 1 a b c d e
OR just
lapply(df['cola'], as.factor)
and if you have just one column no need for lapply
as.factor(df$cola)
lapply(x, as.factor) returning just one level
You want
df[] <- lapply(df, factor)
why do I get a levels error in as.factor() R?
You need
yearsPostEDI<-factor(yearsPostEDI, levels = c("HIV_neg", "<1 Year", "1 Year", "2 Years", "3 Years", "4 Years", ">4 Years"))
factor()
not as.factor()
R: use of factor
Factors vs character vectors when doing stats:
In terms of doing statistics, there's no difference in how R treats factors and character vectors. In fact, its often easier to leave factor variables as character vectors.
If you do a regression or ANOVA with lm() with a character vector as a categorical
variable you'll get normal model output but with the message:
Warning message:
In model.matrix.default(mt, mf, contrasts) :
variable 'character_x' converted to a factor
Factors vs character vectors when manipulating dataframes:
When manipulating dataframes, however, character vectors and factors are treated very differently. Some information on the annoyances of R & factors can be found on the Quantum Forest blog, R pitfall #3: friggin’ factors.
Its useful to use stringsAsFactors = FALSE
when reading data in from a .csv or .txt using read.table
or read.csv
. As noted in another reply you have to make sure that everything in your character vector is consistent, or else every typo will be designated as a different factor. You can use the function gsub() to fix typos.
Here is a worked example showing how lm() gives you the same results with
a character vector and a factor.
A random independent variable:
continuous_x <- rnorm(10,10,3)
A random categorical variable as a character vector:
character_x <- (rep(c("dog","cat"),5))
Convert the character vector to a factor variable.
factor_x <- as.factor(character_x)
Give the two categories random values:
character_x_value <- ifelse(character_x == "dog", 5*rnorm(1,0,1), rnorm(1,0,2))
Create a random relationship between the indepdent variables and a dependent variable
continuous_y <- continuous_x*10*rnorm(1,0) + character_x_value
Compare the output of a linear model with the factor variable and the character
vector. Note the warning that is given with the character vector.
summary(lm(continuous_y ~ continuous_x + factor_x))
summary(lm(continuous_y ~ continuous_x + character_x))
R considers factor name as a level
Use stringsAsFactors=T
when you read data and header = T
:
db_nouns <- read.table("Final_Database.txt", stringsAsFactors = T, header = T)
colnames(db_nouns) <- c ("category", "space")
new_order <- c( "Ground", "Building", "Tool_precise_grip", "Tool_power_grip", "Food", "Clothes", "Animal", "Object", "Transport", "Action", "Body_Part", "Sense_Phys", "Sound", "Sense_Emotion", "Intelligence", "Space")
db_nouns$category <- factor(db_nouns$category, levels = new_order)
as.factor not working with INT values on R
Checking the documentation for as.factor
(by typing ?as.factor
), you'll see it says that the first argument x
is "a vector of data, usually taking a small number of distinct values". If you supply multiple columns of a data frame, they are treated as one vector. In your example, as.factor
creates a unique factor level for each unique value in the entire vectorized, concatenation of columns 4 through 7 of your data frame above.
You should use:
data[4:7] <- lapply(data[4:7], as.factor)
or (requiring tidyverse
packages)
data <- data %>% mutate_at(4:7, as.factor)
Both of these solutions will correctly treat each column supplied, here columns 4, 5, 6, and 7, as their own vectors, individually. Each one is converted to a factor separately, and re-assigned appropriately.
Related Topics
How to Find Out Which Package Version Is Loaded in R
How to Deal with "Data of Class Uneval" Error from Ggplot2
Ggplot2 Multiple Scales/Legends Per Aesthetic, Revisited
Join Two Data Frames in R Based on Closest Timestamp
Different Breaks Per Facet in Ggplot2 Histogram
Inputting Na Where There Are Missing Values When Scraping with Rvest
How to Delete the First Row of a Dataframe in R
Reading Text File with Multiple Space as Delimiter in R
How to Use Multiple Versions of the Same R Package
Rstudio Shiny List from Checking Rows in Datatables
How to Align the Bars of a Histogram with the X Axis
Replace Duplicated Elements with Na, Instead of Removing Them
How to Sort All Dataframes in a List of Dataframes on the Same Column
Combine Several Data Frames in the Global Environment by Row (Rbind)