Why Is the Terminology of Labels and Levels in Factors So Weird

Why is the terminology of labels and levels in factors so weird?

I think the way to think about the difference between labels and levels (ignoring the labels() function that Tommy describes in his answer) is that levels is intended to tell R which values to look for in the input (x) and what order to use in the levels of the resulting factor object, and labels is to change the values of the levels after the input has been coded as a factor ... as suggested by Tommy's answer, there is no part of the factor object returned by factor() that is called labels ... just the levels, which have been adjusted by the labels argument ... (clear as mud).

For example:

> f <- factor(x=c("a","b","c"),levels=c("c","d","e"))
> f
[1] <NA> <NA> c  
Levels: c d e
> str(f)
Factor w/ 3 levels "c","d","e": NA NA 1

Because the first two elements of x were not found in levels, the first two elements of f are NA. Because "d" and "e" were included in levels, they show up in the levels of f even though they did not occur in x.

Now with labels:

> f <- factor(c("a","b","c"),levels=c("c","d","e"),labels=c("C","D","E"))
> f
[1] <NA> <NA> C   
Levels: C D E

After R figures out what should be in the factor, it re-codes the levels. One can of course use this to do brain-frying things such as:

> f <- factor(c("a","b","c"),levels=c("c","d","e"),labels=c("a","b","c"))
> f
[1] <NA> <NA> a   
Levels: a b c

Another way to think about levels is that factor(x,levels=L1,labels=L2) is equivalent to

f <- factor(x,levels=L1)
levels(f) <- L2

I think an appropriately phrased version of this example might be nice for Pat Burns's R inferno -- there are plenty of factor puzzles in section 8.2, but not this particular one ...

How levels() is different from labels() with respect to factors in R?

You can assign a label to a variable to give the variable a readable and understandable name. Calculations are done with the variable name not with the label of the variable. The label for the variable los can be "Length of stay".

Levels are important with factor variables. Each value of a factor variable is a level.
For example the factor variable eye_color can have the levels "blue", "brown" and "green".
So a factor variable may have a label, but has always levels (at least one).
Example:
The factor variable eye_color may have the label "Eye Color" and has the above mentioned 3 levels ("blue", "brown", and "green).

Confusion between factor levels and factor labels

Very short : levels are the input, labels are the output in the factor() function. A factor has only a level attribute, which is set by the labels argument in the factor() function. This is different from the concept of labels in statistical packages like SPSS, and can be confusing in the beginning.

What you do in this line of code

df$f <- factor(df$f, levels=c('a','b','c'),
  labels=c('Treatment A: XYZ','Treatment B: YZX','Treatment C: ZYX'))

is telling to R that there is a vector df$f

which you want to transform into a factor,
in which the different levels are coded as a, b, and c
and for which you want the levels to be labeled as Treatment A etc.

The factor function will look for the values a, b and c, convert them to numerical factor classes, and add the label values to the level attribute of the factor. This attribute is used to convert the internal numerical values to the correct labels. But as you see, there is no label attribute.

> df <- data.frame(v=c(1,2,3),f=c('a','b','c'))    
> attributes(df$f)
$levels
[1] "a" "b" "c"

$class
[1] "factor"

> df$f <- factor(df$f, levels=c('a','b','c'),
+   labels=c('Treatment A: XYZ','Treatment B: YZX','Treatment C: ZYX'))    
> attributes(df$f)
$levels
[1] "Treatment A: XYZ" "Treatment B: YZX" "Treatment C: ZYX"

$class
[1] "factor"

Inquiry: Why does base R behave this way with factor()?

You must assign the levels attribute the values in your vector. And those values are zeros and ones, not "male" and "female".

factor(c(0, 1, 0, 1), levels = 0:1, labels = c("male", "female"))
#[1] male   female male   female
#Levels: male female

Displaying of factor levels and labels in R

Factors start with the first level being represented internally by 1.

Your two options:

1) Adjust for 1-based index of levels:

as.numeric(drug) - 1

2) Take the labels of the factors and convert to numeric:

as.numeric(as.character(drug))

Some people will point you in the direction of the faster option that does the same thing:

as.numeric(levels(drug))[drug]

I'd also consider using logical values instead of factor in the first place.

mydat$drug = as.logical(mydat$drug)

Setting levels when creating a factor vs. `levels() -`

F1 uses numeric sorting, as you figured out yourself.

F2 uses lexicographic sorting, first comparing the first character, breaking ties using the second, and so on, which is why "10 years" is between "1 years" and "2 years".

F4 is created from a character vector, but with an explicit list of possible factors. So that list is taken (without sorting) and identified with the numbers 1 through 6. Then every item of your input is compared against the set of possible levels, and the associated number is stored. After all, a factor is simply a bunch of numbers (as.numeric will show them to you) associated with a list of levels used for printing. So F4 gets printed just like F2, but its levels are sorted differently.

F3 was created from F2, so its levels were unsorted initially. The assignment only replaces the set of level names, not the numbers in the vector. So you can think of this as renaming existing levels. If you look at the numbers, they will match those from F2, whereas the names associated, and the order of names in particular, matches that from F4.

As your question claims that this was not purely a relabel: yes, it is a pure relabel, you obtain F3 from F2 using the following changes (in both rows of the printout):

10 → 2
2 → 3
20 → 10
25 → 20
3 → 25

The str function is also a good tool to look at the internal representation of a factor.

rename levels in ggplot because they are too long

There are a variety of ways to do this.

class_survey_info <-  c(None = "None.", Very_Little = "Very little. I've just dipped my toe in.",
                        Bit = "A bit. I've used coding in a limited capacity, such as for a small",
                        Some = "Some. I've had to write code for one or two classes and am co",
                        Good_Deal = "A good deal. I have several years of experience writing code.")

(This information might as well be in a named character vector as in a list ...)

Use factor() with levels set equal to your original levels and labels as your short versions (see this answer).

class_survey <- (class_survey
   |> mutate(across(Coding_Exp_Words, factor, levels = class_survey_info,
                    labels = names(class_survey_info)))
)
ggplot(...)

Use forcats::recode() (mutate(across(Coding_Exp_Words, ~forcats::recode(., !!!class_survey_info) )
Specify breaks and values in your x-axis scale.

ggplot(...) +
  scale_x_discrete(breaks = class_survey_info, labels = names(class_survey_info))

(haven't tested because no reproducible example).

Solutions #1 and #2 can be done either in tidyverse with regular pipes (|> or %>%), or with magrittr's assignment pipe (class_survey %<>% mutate(...)), or in base R (class_survey <- transform(class_survey, Coding_Exp_Words = factor(...)) or class_survey$Coding_Exp_Words <- factor(class_survey$Coding_Exp_Words, ...))

Why Is the Terminology of Labels and Levels in Factors So Weird