Plot the Equivalent of Correlation Matrix for Factors (Categorical Data)? and Mixed Types

Plot the equivalent of correlation matrix for factors (categorical data)? And mixed types?

Here's a tidyverse solution:

# example dataframe
df <- data.frame(
group = c('A', 'A', 'A', 'A', 'A', 'B', 'C'),
student = c('01', '01', '01', '02', '02', '01', '02'),
exam_pass = c('Y', 'N', 'Y', 'N', 'Y', 'Y', 'N'),
subject = c('Math', 'Science', 'Japanese', 'Math', 'Science', 'Japanese', 'Math')
)

library(tidyverse)
library(lsr)

# function to get chi square p value and Cramers V
f = function(x,y) {
tbl = df %>% select(x,y) %>% table()
chisq_pval = round(chisq.test(tbl)$p.value, 4)
cramV = round(cramersV(tbl), 4)
data.frame(x, y, chisq_pval, cramV) }

# create unique combinations of column names
# sorting will help getting a better plot (upper triangular)
df_comb = data.frame(t(combn(sort(names(df)), 2)), stringsAsFactors = F)

# apply function to each variable combination
df_res = map2_df(df_comb$X1, df_comb$X2, f)

# plot results
df_res %>%
ggplot(aes(x,y,fill=chisq_pval))+
geom_tile()+
geom_text(aes(x,y,label=cramV))+
scale_fill_gradient(low="red", high="yellow")+
theme_classic()

Sample Image

Note that I'm using lsr package to calculate Cramers V using the cramersV function.

How to do a correlation matrix with categorical, ordinal and interval variables?

First, to find correlation coefficients suitable for different variable types there are already many posts here, so I will only link some: continuos/categorical, continuous/ordinal, binary/ordinal, categorical/categorical and others (just search this site).

Then, if you want, you could put this various correlation coefficients into a matrix as some covariance matrix (you would also have to decide on how to generalize the variances to put on the diagonal). This could be just fine as a way of presenting this information in a compact way. But is it really a covariance matrix? That is, does it have the usual properties of a covariance matrix? The answer is no. It is not necessarily positive definite, so using it in any type of procedure which requires a covariance matrix as input would be, at least, problematical.

So if you want more than just a compact presentation of some coefficients, you are better of telling us what is your real analytical goal, and then search for some way of answering that directly. You could ask that as a new question (linking back to this one).

Computing a correlation matrix with both numerical and logical variables

I am not sure whether your question is about the creation of a data.frame object with several types of variables (see comment) or how to compute correlations if your data is, as you mentioned, "numerical and binary" (*). This link might help. I assume in particular Dan Chaltiel's answer (last answer) will help you.

(*) In the letter case the thread is possibly a duplicate.


EDIT: Considering Dan Chaltiel's approach (see link), does this help?

df <- data.frame(a=c(34,54,55,12,13,6), 
b=c("FALSE","TRUE","TRUE","TRUE","TRUE","FALSE"),
c=c(1:6))

library(dplyr)

model.matrix(~0+., data=df) %>%
cor(use="pairwise.complete.obs")

Output for correlation in R

Correlation is only meaningful for quantitative variables.
Your code computes the correlations between the numbers of yachts of each type,
i.e., the correlation between the columns of the frequency matrix.

There are analogues of correlation for qualitative variables:
Cramer's V, Phi, etc.

library(DescTools) 
counts <- table(dat1[,1:2])
CramerV(counts) # 0.15


Related Topics



Leave a reply



Submit