Cor Shows Only Na or 1 for Correlations - Why

cor shows only NA or 1 for correlations - Why?

The 1s are because everything is perfectly correlated with itself, and the NAs are because there are NAs in your variables.

You will have to specify how you want R to compute the correlation when there are missing values, because the default is to only compute a coefficient with complete information.

You can change this behavior with the use argument to cor, see ?cor for details.

Getting different cor() output on same data?

The difference is introduced with

  Fstat <- r2 * dfr/(1 - r2)

due to different values of dfr (8 and 6).

> r2
[1] 0.234375000 0.011456074 0.070048129 0.002401998 0.062146106 0.062751663
[7] 0.096834764 0.110197604 0.165400138 0.216547595 0.057255529 0.068074453
[13] 0.030140179 0.136955494 0.005027238
> r2*8/(1-r2)
[1] 2.44897959 0.09271069 0.60259574 0.01926226 0.53011332 0.53562464
[7] 0.85773686 0.99076023 1.58543173 2.21121379 0.48586255 0.58437675
[13] 0.24861473 1.26951037 0.04042111
> r2*6/(1-r2)
[1] 1.83673469 0.06953302 0.45194680 0.01444669 0.39758499 0.40171848
[7] 0.64330264 0.74307017 1.18907380 1.65841034 0.36439691 0.43828256
[13] 0.18646105 0.95213278 0.03031583

Neither of 8 or 6 is correct, since it's based on nrow(X), while for use="complete.obs" it has to be based on the number of complete observations. This can be accomplished by changing the function definition to cor.prob <- function (X, dfr = sum(complete.cases(X)) - 2) { …. Therewith, the same results are produced with and without the prior

d <- d[rowSums(is.na(d[,3:6]))!=4,]

But if I choose to use pairwise.complete.obs instead of complete.obs, then i'll keep the code as it was? And further, since i have NA values different places in my data then i will benefite from "pairwise" rather than "complete.obs"?

Indeed if we use pairwise.complete.obs, we gain the observations where only part of the columns are NA. But, since we then have different numbers of observations for the individual columns, a single dfr value is not appropriate; instead, we can use a dfr matrix:

library(psych)
cor.prob <- function (X, dfr = pairwiseCount(X) - 2)
{

Fstat <- r2 * dfr[above]/(1 - r2)
R[above] <- 1 - pf(Fstat, 1, dfr[above])

Removing NA in correlation matrix

If you simply want to get rid of any column that has one or more NAs, then just do

x<-x[,colSums(is.na(x))==0]

However, even with missing data, you can compute a correlation matrix with no NA values by specifying the use parameter in the function cor. Setting it to either pairwise.complete.obs or complete.obs will result in a correlation matrix with no NAs.

complete.obs will ignore all rows with missing data, whereas pairwise.complete.obs will just ignore the missing pairs of data. Note that, although pairwise.complete.obs "sounds better" because it uses more of the available data, but it isn't guaranteed to produce a positive-definite correlation matrix, which could be a problem.

> set.seed(123)
> x<-array(rnorm(500),c(100,5))
> x[sample(500,3)]<-NA
> cor(x)
[,1] [,2] [,3] [,4] [,5]
[1,] 1 NA NA NA NA
[2,] NA 1 NA NA NA
[3,] NA NA 1 NA NA
[4,] NA NA NA 1.00000000 -0.01925986
[5,] NA NA NA -0.01925986 1.00000000
> cor(x,use="pairwise.complete.obs")
[,1] [,2] [,3] [,4] [,5]
[1,] 1.00000000 -0.04377085 -0.18049501 -0.04914247 -0.19374986
[2,] -0.04377085 1.00000000 0.01296008 0.02606083 -0.12333765
[3,] -0.18049501 0.01296008 1.00000000 -0.03218139 -0.02675554
[4,] -0.04914247 0.02606083 -0.03218139 1.00000000 -0.01925986
[5,] -0.19374986 -0.12333765 -0.02675554 -0.01925986 1.00000000
> cor(x,use="complete.obs")
[,1] [,2] [,3] [,4] [,5]
[1,] 1.00000000 -0.06263112 -0.17914810 -0.02574970 -0.20504268
[2,] -0.06263112 1.00000000 0.01263764 0.02543900 -0.12571570
[3,] -0.17914810 0.01263764 1.00000000 -0.03866312 -0.02520500
[4,] -0.02574970 0.02543900 -0.03866312 1.00000000 -0.01688848
[5,] -0.20504268 -0.12571570 -0.02520500 -0.01688848 1.00000000

Does `cor()` only work for numeric variables?

Spearman's rho does require that the data be ordered, which characters are not and even regular factors are not (this is a little bit subtle — they do have an ordering which is used when listing factor levels, plotting, etc., but this ordering is not assumed to have any statistical meaning). It would make sense if cor() allowed ordered factors (factor(..., ordered = TRUE) or ordered(...), but it doesn't. As ?cor says:

The inputs must be numeric (as determined by ‘is.numeric’: logical
values are also allowed for historical compatibility): the
‘"kendall"’ and ‘"spearman"’ methods make sense for ordered inputs
but ‘xtfrm’ can be used to find a suitable prior transformation to
numbers.

However, assuming that you have a factor variable and the order of levels is what you want, then using as.integer() in cor() should work fine. (In fact, the xtfrm.factor() method is just a wrapper for as.integer().)

xf <- ordered(x, levels = c("None", "Little", "Often", "Always"))
cor(as.integer(xf), y, method = "spearman")
## or
cor(xtfrm(xf), y, method = "spearman")

dplyr: correlations with NA

There is no na.rm argument in cor, it is use. According to ?cor, the usage is

cor(x, y = NULL, use = "everything",
method = c("pearson", "kendall", "spearman"))

use - an optional character string giving a method for computing covariances in the presence of missing values. This must be (an abbreviation of) one of the strings "everything", "all.obs", "complete.obs", "na.or.complete", or "pairwise.complete.obs".

library(dplyr)
xx %>%
group_by(group) %>%
summarize(COR=cor(a,b, use = "complete.obs"))

-output

# A tibble: 4 × 2
group COR
<int> <dbl>
1 1 0.166
2 2 0.190
3 3 0.190
4 4 0.190

If there are groups with all NA, then use "na.or.complete" (updated data in the comments with groups having only NA)

xx %>%
group_by(group) %>%
summarize(COR=cor(a,b, use = "na.or.complete"))
# A tibble: 5 × 2
group COR
<int> <dbl>
1 1 0.0345
2 2 -0.397
3 3 0.150
4 4 0.376
5 5 NA

which returns the same with an if/else condition and using "complete.obs"

xx %>%
group_by(group) %>%
summarize(COR= if(any(complete.cases(a, b)))
cor(a,b, use = "complete.obs") else NA_real_)
# A tibble: 5 × 2
group COR
<int> <dbl>
1 1 0.0345
2 2 -0.397
3 3 0.150
4 4 0.376
5 5 NA

Dealing with missing values for correlations calculation

I would vote for the second option. Sounds like you have a fair amount of missing data and so you would be looking for a sensible multiple imputation strategy to fill in the spaces. See Harrell's text "Regression Modeling Strategies" for a wealth of guidance on 'how's to do this properly.



Related Topics



Leave a reply



Submit