Calculate Correlation with Cor(), Only for Numerical Columns

Calculate correlation with cor(), only for numerical columns

if you have a dataframe where some columns are numeric and some are other (character or factor) and you only want to do the correlations for the numeric columns, you could do the following:

set.seed(10)

x = as.data.frame(matrix(rnorm(100), ncol = 10))
x$L1 = letters[1:10]
x$L2 = letters[11:20]

cor(x)

Error in cor(x) : 'x' must be numeric

but

cor(x[sapply(x, is.numeric)])

V1 V2 V3 V4 V5 V6 V7
V1 1.00000000 0.3025766 -0.22473884 -0.72468776 0.18890578 0.14466161 0.05325308
V2 0.30257657 1.0000000 -0.27871430 -0.29075170 0.16095258 0.10538468 -0.15008158
V3 -0.22473884 -0.2787143 1.00000000 -0.22644156 0.07276013 -0.35725182 -0.05859479
V4 -0.72468776 -0.2907517 -0.22644156 1.00000000 -0.19305921 0.16948333 -0.01025698
V5 0.18890578 0.1609526 0.07276013 -0.19305921 1.00000000 0.07339531 -0.31837954
V6 0.14466161 0.1053847 -0.35725182 0.16948333 0.07339531 1.00000000 0.02514081
V7 0.05325308 -0.1500816 -0.05859479 -0.01025698 -0.31837954 0.02514081 1.00000000
V8 0.44705527 0.1698571 0.39970105 -0.42461411 0.63951574 0.23065830 -0.28967977
V9 0.21006372 -0.4418132 -0.18623823 -0.25272860 0.15921890 0.36182579 -0.18437981
V10 0.02326108 0.4618036 -0.25205899 -0.05117037 0.02408278 0.47630138 -0.38592733
V8 V9 V10
V1 0.447055266 0.210063724 0.02326108
V2 0.169857120 -0.441813231 0.46180357
V3 0.399701054 -0.186238233 -0.25205899
V4 -0.424614107 -0.252728595 -0.05117037
V5 0.639515737 0.159218895 0.02408278
V6 0.230658298 0.361825786 0.47630138
V7 -0.289679766 -0.184379813 -0.38592733
V8 1.000000000 0.001023392 0.11436143
V9 0.001023392 1.000000000 0.15301699
V10 0.114361431 0.153016985 1.00000000

Does `cor()` only work for numeric variables?

Spearman's rho does require that the data be ordered, which characters are not and even regular factors are not (this is a little bit subtle — they do have an ordering which is used when listing factor levels, plotting, etc., but this ordering is not assumed to have any statistical meaning). It would make sense if cor() allowed ordered factors (factor(..., ordered = TRUE) or ordered(...), but it doesn't. As ?cor says:

The inputs must be numeric (as determined by ‘is.numeric’: logical
values are also allowed for historical compatibility): the
‘"kendall"’ and ‘"spearman"’ methods make sense for ordered inputs
but ‘xtfrm’ can be used to find a suitable prior transformation to
numbers.

However, assuming that you have a factor variable and the order of levels is what you want, then using as.integer() in cor() should work fine. (In fact, the xtfrm.factor() method is just a wrapper for as.integer().)

xf <- ordered(x, levels = c("None", "Little", "Often", "Always"))
cor(as.integer(xf), y, method = "spearman")
## or
cor(xtfrm(xf), y, method = "spearman")

Generate correlation matrix with specific columns and only with significant values in corrplot

I would use the well established Hmisc::rcorr for the calculations. In corrplot::corrplot, subset both the corr= and the p.mat= with [1:6, 7:14].

c_df <- Hmisc::rcorr(cor(correlation_df), type='spearman')

library(corrplot)
corrplot(corr=c_df$r[1:6, 7:14], p.mat=c_df$P[1:6, 7:14], sig.level=0.05,
method='color', diag=FALSE, addCoef.col=1, type='upper', insig='blank',
number.cex=.8)

Sample Image

This appears to correspond to the p-values.

m <- c_df$P[1:6, 7:14] < .05
m[lower.tri(m, diag=TRUE)] <- ''
as.data.frame(replace(m, lower.tri(m, diag=TRUE), ''))
# Al Fe Mn Zn Mo Baresoil Humdepth pH
# N FALSE FALSE TRUE FALSE FALSE FALSE FALSE
# P TRUE TRUE FALSE FALSE FALSE FALSE
# K TRUE FALSE FALSE FALSE TRUE
# Ca FALSE TRUE TRUE FALSE
# Mg TRUE TRUE TRUE
# S FALSE FALSE

How to calculate the correlation of 2 variables for every nth rows in a data frame in r?

We may use a group by approach

by(df[c('y1', 'y2')], as.integer(gl(nrow(df), 200, nrow(df))),
FUN = function(x) cor(x$y1, x$y2))

Or using tidyverse

library(dplyr)
out <- df %>%
group_by(grp = as.integer(gl(n(), 200, n()))) %>%
summarise(Cor = cor(y1, y2))
> dim(out)
[1] 1000 2

data

set.seed(24)
df <- as.data.frame(matrix(rnorm(200 *1000 * 6), ncol = 6))
names(df)[1:2] <- c('y1', 'y2')

Calculate correlations between data.frame columns and assign output to list

Using base R

as.list(cor(df)[1,-1])

-output

$e
[1] 1

$Age
[1] 1

Computing a correlation matrix with both numerical and logical variables

I am not sure whether your question is about the creation of a data.frame object with several types of variables (see comment) or how to compute correlations if your data is, as you mentioned, "numerical and binary" (*). This link might help. I assume in particular Dan Chaltiel's answer (last answer) will help you.

(*) In the letter case the thread is possibly a duplicate.


EDIT: Considering Dan Chaltiel's approach (see link), does this help?

df <- data.frame(a=c(34,54,55,12,13,6), 
b=c("FALSE","TRUE","TRUE","TRUE","TRUE","FALSE"),
c=c(1:6))

library(dplyr)

model.matrix(~0+., data=df) %>%
cor(use="pairwise.complete.obs")


Related Topics



Leave a reply



Submit