## How can I compare two factors with different levels?

Convert to character then compare:

`# data`

A <- factor(1:5)

B <- factor(c(1:3,6,6))

str(A)

# Factor w/ 5 levels "1","2","3","4",..: 1 2 3 4 5

str(B)

# Factor w/ 4 levels "1","2","3","6": 1 2 3 4 4

mean(A == B)

Error in Ops.factor(A, B) : level sets of factors are different

`mean(as.character(A) == as.character(B))`

# [1] 0.6

Or another approach would be

`mean(levels(A)[A] == levels(B)[B])`

which is 2 times slower on a 1e8 dataset.

## Compare the levels of two factors

If `factor1`

and `factor2`

are your two factors, just look at `levels(factor1)`

and `levels(factor2)`

.

Same number of levels:

`length(levels(factor1)) == length(levels(factor2))`

Values in one and not the other:

`setdiff(levels(factor1), levels(factor2))`

setdiff(levels(factor2), levels(factor1))

## R - Using If statement to compare factors with different levels

You could convert them to characters for the comparison. However, if you want to compare all of the rows you'll probably want to use `ifelse`

:

`ifelse(as.character(z$x) == as.character(z$y), 1, 0)`

## How to compare two R data frames to find missing factor-level?

Just take the set difference between the levels of the two factors.

`F1 = factor(c('A', 'B', 'C'))`

F2 = factor(c('B', 'C'))

setdiff(levels(F1), levels(F2))

[1] "A"

## Compare factor levels in R

I think you are looking for table function:

`> table(a1, a2)`

a2

a1 [1,3] (3,4]

[1,2] 4 0

(2,3] 2 0

(3,4] 0 3

## Take difference between two levels of factor variable while retaining other factor variables in R

One option with `dplyr`

would be

`library(dplyr)`

my.df %>%

group_by(Gene, Population) %>%

summarize(Coverage = Coverage[Color == "Blue"] - Coverage[Color == "Green"])

# A tibble: 4 x 3

# Groups: Gene [?]

# Gene Population Coverage

# <fct> <fct> <dbl>

# 1 A_1 PopA -0.00600

# 2 A_1 PopB -0.420

# 3 A_2 PopA -0.01

# 4 A_2 PopB 0.100

**Data**

`my.df <- `

structure(list(Gene = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c("A_1", "A_2"), class = "factor"),

Population = structure(c(1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L), .Label = c("PopA", "PopB"), class = "factor"),

Color = structure(c(1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L), .Label = c("Blue", "Green"), class = "factor"),

Coverage = c(0.016, 0.022, 0.1322, 0.552, 0.13, 0.14, 1, 0.9)), class = "data.frame", row.names = c(NA, -8L))

## Comparing all factor levels to the grand mean: can I tweak contrasts in linear model fitting to show all levels?

This answer shows you how to obtain the following coefficient table:

`# Estimate Std. Error t value Pr(>|t|)`

#(Intercept) -0.02901982 0.4680937 -0.06199574 0.9544655

#A -0.19238543 0.6619845 -0.29061922 0.7902750

#B 0.40884591 0.6619845 0.61760645 0.5805485

#C -0.21646049 0.6619845 -0.32698723 0.7651640

Amazing, isn't it? It mimics what you see from `summary(fit)`

, i.e.,

`# Estimate Std. Error t value Pr(>|t|)`

#(Intercept) -0.02901982 0.4680937 -0.06199574 0.9544655

#x1 -0.19238543 0.6619845 -0.29061922 0.7902750

#x2 0.40884591 0.6619845 0.61760645 0.5805485

But now we have all factor levels displayed.

### Why `lm`

summary does not display all factor levels?

In 2016, I answered this Stack Overflow question: `lm` summary not display all factor levels and since then, it has become the target for marking duplicated questions on similar topics.

To recap, the basic idea is that in order to have a full-rank design matrix for least squares fitting, we must apply contrasts to a factor variable. Let's say that the factor has *N* levels, then no matter what type of contrasts we choose (see `?contrasts`

for a list), it reduces the raw *N* dummy variables to a new set of *N - 1* variables. Therefore, only *N - 1* coefficients are associated with an *N*-level factor.

However, we can transform the *N - 1* coefficients back to the original *N* coefficients using the **contrasts matrix**. The transformation enables us to obtain a coefficient table for all factor levels. I will now demonstrate how to do this, based on OP's reproducible example:

`set.seed(1)`

y <- rnorm(6, 0, 1)

x <- factor(rep(LETTERS[1:3], each = 2))

fit <- lm(y ~ x, contrasts = list(x = contr.sum))

In this example, the sum-to-zero contrast is applied to factor `x`

. To know more on how to control contrasts for model fitting, see my answer at How to set contrasts for my variable in regression analysis with R?.

### R code walk-through

For a factor variable of *N* levels subject to sum-to-zero contrasts, we can use the following function to get the *N x (N - 1)* transformation matrix that maps the *(N - 1)* coefficients estimated by `lm`

back to the *N* coefficients for all levels.

`ContrSumMat <- function (fctr, sparse = FALSE) {`

if (!is.factor(fctr)) stop("'fctr' is not a factor variable!")

N <- nlevels(fctr)

Cmat <- contr.sum(N, sparse = sparse)

dimnames(Cmat) <- list(levels(fctr), seq_len(N - 1))

Cmat

}

For the example 3-level factor `x`

, this matrix is:

`Cmat <- ContrSumMat(x)`

# 1 2

#A 1 0

#B 0 1

#C -1 -1

The fitted model `fit`

reports 3 - 1 = 2 coefficients for this factor. We can extract them as:

`## coefficients After Contrasts`

coef_ac <- coef(fit)[2:3]

# x1 x2

#-0.1923854 0.4088459

Therefore, the level-specific coefficients are:

`## coefficients Before Contrasts`

coef_bc <- (Cmat %*% coef_ac)[, 1]

# A B C

#-0.1923854 0.4088459 -0.2164605

## note that they sum to zero as expected

sum(coef_bc)

#[1] 0

Similarly, we can get the covariance matrix after contrasts:

`var_ac <- vcov(fit)[2:3, 2:3]`

# x1 x2

#x1 0.4382235 -0.2191118

#x2 -0.2191118 0.4382235

and transform it to the one before contrasts:

`var_bc <- Cmat %*% var_ac %*% t(Cmat)`

# A B C

#A 0.4382235 -0.2191118 -0.2191118

#B -0.2191118 0.4382235 -0.2191118

#C -0.2191118 -0.2191118 0.4382235

**Interpretation:**

The model intercept

`coef(fit)[1]`

is the grand mean.The computed

`coef_bc`

is the deviation of each level's mean from the grand mean.The diagonal entries of

`var_bc`

gives the estimated variance of these deviations.

We can then proceed to compute t-statistics and p-values for these coefficients, as follows.

`## standard error of point estimate `coef_bc``

std.error_bc <- sqrt(diag(var_bc))

# A B C

#0.6619845 0.6619845 0.6619845

## t-statistics (Null Hypothesis: coef_bc = 0)

t.stats_bc <- coef_bc / std.error_bc

# A B C

#-0.2906192 0.6176065 -0.3269872

## p-values of the t-statistics

p.value_bc <- 2 * pt(abs(t.stats_bc), df = fit$df.residual, lower.tail = FALSE)

# A B C

#0.7902750 0.5805485 0.7651640

## construct a coefficient table that mimics `coef(summary(fit))`

stats.tab_bc <- cbind("Estimate" = coef_bc,

"Std. Error" = std.error_bc,

"t value" = t.stats_bc,

"Pr(>|t|)" = p.value_bc)

# Estimate Std. Error t value Pr(>|t|)

#A -0.1923854 0.6619845 -0.2906192 0.7902750

#B 0.4088459 0.6619845 0.6176065 0.5805485

#C -0.2164605 0.6619845 -0.3269872 0.7651640

We can also augment it by including the result for the grand mean (i.e., the model intercept).

`## extract statistics of the intercept`

intercept.stats <- coef(summary(fit))[1, , drop = FALSE]

# Estimate Std. Error t value Pr(>|t|)

#(Intercept) -0.02901982 0.4680937 -0.06199574 0.9544655

## augment the coefficient table

stats.tab <- rbind(intercept.stats, stats.tab_bc)

# Estimate Std. Error t value Pr(>|t|)

#(Intercept) -0.02901982 0.4680937 -0.06199574 0.9544655

#A -0.19238543 0.6619845 -0.29061922 0.7902750

#B 0.40884591 0.6619845 0.61760645 0.5805485

#C -0.21646049 0.6619845 -0.32698723 0.7651640

We can also print this table with significance stars.

`printCoefmat(stats.tab)`

# Estimate Std. Error t value Pr(>|t|)

#(Intercept) -0.02902 0.46809 -0.0620 0.9545

#A -0.19239 0.66199 -0.2906 0.7903

#B 0.40885 0.66199 0.6176 0.5805

#C -0.21646 0.66199 -0.3270 0.7652

Emm? Why are there no stars? Well, in this example all p-values are very large. The stars will show up if p-values are small. Here is a convincing demo:

`fake.tab <- stats.tab`

fake.tab[, 4] <- fake.tab[, 4] / 100

printCoefmat(fake.tab)

# Estimate Std. Error t value Pr(>|t|)

#(Intercept) -0.02902 0.46809 -0.0620 0.009545 **

#A -0.19239 0.66199 -0.2906 0.007903 **

#B 0.40885 0.66199 0.6176 0.005805 **

#C -0.21646 0.66199 -0.3270 0.007652 **

#---

#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Oh, this is so beautiful. For the meaning of these stars, see my answer at: Interpeting R significance codes for ANOVA table?

### Closing Remarks

It should be possible to write a function (or even an **R** package) to perform such table transformation. However, it might take great effort to make such function flexible enough, to handle:

all type of contrasts (this is easy to do);

complicated model terms, like interaction between a factor and other numeric/factor variables (this seems really involving!!).

So, I will stop here for the moment.

### Miscellaneous Replies

Are the model scores that I get from the lm's summary still accurate, even though it isn't displaying all levels of the factor?

Yes, they are. `lm`

conducts accurate least squares fitting.

In addition, the transformation of coefficient table does not affect R-squares, degree of freedom, residuals, fitted values, F-statistics, ANOVA table, etc.

### Related Topics

Installing Ggplot2 Package on Ubuntu

X^(1/3)' Behaves Differently for Negative Scalar 'X' and Vector 'X' with Negative Values

Splitting String Between Capital and Lowercase Character in R

How to Get Leaflet for R Use 100% of Shiny Dashboard Height

Adding a 3Rd Order Polynomial and Its Equation to a Ggplot in R

What's the Difference Between Substitute and Quote in R

Convert from K to Thousand (1000) in R

Reading and Scanning Ms Word .Doc Files in R

Finding Maximum Value of One Column (By Group) and Inserting Value into Another Data Frame in R

How to Optimize for Integer Parameters (And Other Discontinuous Parameter Space) in R

R Subtract Value for the Same Id (From the First Id That Shows)

How to Make Stacked Barplot with Ggplot2

R - Unable to Install R Packages - Cannot Open the Connection

Given Start Date and End Date, Reshape/Expand Data for Each Day Between (Each Day on a Row)