Plot the equivalent of correlation matrix for factors (categorical data)? And mixed types?
Here's a tidyverse
solution:
# example dataframe
df <- data.frame(
group = c('A', 'A', 'A', 'A', 'A', 'B', 'C'),
student = c('01', '01', '01', '02', '02', '01', '02'),
exam_pass = c('Y', 'N', 'Y', 'N', 'Y', 'Y', 'N'),
subject = c('Math', 'Science', 'Japanese', 'Math', 'Science', 'Japanese', 'Math')
)
library(tidyverse)
library(lsr)
# function to get chi square p value and Cramers V
f = function(x,y) {
tbl = df %>% select(x,y) %>% table()
chisq_pval = round(chisq.test(tbl)$p.value, 4)
cramV = round(cramersV(tbl), 4)
data.frame(x, y, chisq_pval, cramV) }
# create unique combinations of column names
# sorting will help getting a better plot (upper triangular)
df_comb = data.frame(t(combn(sort(names(df)), 2)), stringsAsFactors = F)
# apply function to each variable combination
df_res = map2_df(df_comb$X1, df_comb$X2, f)
# plot results
df_res %>%
ggplot(aes(x,y,fill=chisq_pval))+
geom_tile()+
geom_text(aes(x,y,label=cramV))+
scale_fill_gradient(low="red", high="yellow")+
theme_classic()
Note that I'm using lsr
package to calculate Cramers V using the cramersV
function.
How to do a correlation matrix with categorical, ordinal and interval variables?
First, to find correlation coefficients suitable for different variable types there are already many posts here, so I will only link some: continuos/categorical, continuous/ordinal, binary/ordinal, categorical/categorical and others (just search this site).
Then, if you want, you could put this various correlation coefficients into a matrix as some covariance matrix (you would also have to decide on how to generalize the variances to put on the diagonal). This could be just fine as a way of presenting this information in a compact way. But is it really a covariance matrix? That is, does it have the usual properties of a covariance matrix? The answer is no. It is not necessarily positive definite, so using it in any type of procedure which requires a covariance matrix as input would be, at least, problematical.
So if you want more than just a compact presentation of some coefficients, you are better of telling us what is your real analytical goal, and then search for some way of answering that directly. You could ask that as a new question (linking back to this one).
Computing a correlation matrix with both numerical and logical variables
I am not sure whether your question is about the creation of a data.frame
object with several types of variables (see comment) or how to compute correlations if your data is, as you mentioned, "numerical and binary" (*). This link might help. I assume in particular Dan Chaltiel's answer (last answer) will help you.
(*) In the letter case the thread is possibly a duplicate.
EDIT: Considering Dan Chaltiel's approach (see link), does this help?
df <- data.frame(a=c(34,54,55,12,13,6),
b=c("FALSE","TRUE","TRUE","TRUE","TRUE","FALSE"),
c=c(1:6))
library(dplyr)
model.matrix(~0+., data=df) %>%
cor(use="pairwise.complete.obs")
Output for correlation in R
Correlation is only meaningful for quantitative variables.
Your code computes the correlations between the numbers of yachts of each type,
i.e., the correlation between the columns of the frequency matrix.
There are analogues of correlation for qualitative variables:
Cramer's V, Phi, etc.
library(DescTools)
counts <- table(dat1[,1:2])
CramerV(counts) # 0.15
Related Topics
Hyperlinking Text in a Ggplot2 Visualization
R Cmd Check Note: Found No Calls To: 'R_Registerroutines', 'R_Usedynamicsymbols'
Remove a Layer from a Ggplot2 Chart
Reading in Chunks at a Time Using Fread in Package Data.Table
Rmarkdown Directing Output File into a Directory
Draw a Chronological Timeline with Ggplot2
Name Columns Within Aggregate in R
How to Remove "Rows" with a Na Value
Clustering List for Hclust Function
More Efficient Means of Creating a Corpus and Dtm with 4M Rows
Select Unique Values with 'Select' Function in 'Dplyr' Library
Change the Index Number of a Dataframe
Building a List in a Loop in R - Getting Item Names Correct
Use Dplyr's Summarise_Each to Return One Row Per Function
Trying to Find Row Associated with Max Value in Dataframe R
Ggplot2: Issues with Dual Y-Axes and Loess Smoothing
Cv.Glmnet' Works in Rstudio But Not Rscript
Simple Manual Rmarkdown Tables That Look Good in HTML, PDF and Docx