cor shows only NA or 1 for correlations - Why?
The 1
s are because everything is perfectly correlated with itself, and the NA
s are because there are NA
s in your variables.
You will have to specify how you want R to compute the correlation when there are missing values, because the default is to only compute a coefficient with complete information.
You can change this behavior with the use
argument to cor
, see ?cor
for details.
Getting different cor() output on same data?
The difference is introduced with
Fstat <- r2 * dfr/(1 - r2)
due to different values of dfr
(8 and 6).
> r2
[1] 0.234375000 0.011456074 0.070048129 0.002401998 0.062146106 0.062751663
[7] 0.096834764 0.110197604 0.165400138 0.216547595 0.057255529 0.068074453
[13] 0.030140179 0.136955494 0.005027238
> r2*8/(1-r2)
[1] 2.44897959 0.09271069 0.60259574 0.01926226 0.53011332 0.53562464
[7] 0.85773686 0.99076023 1.58543173 2.21121379 0.48586255 0.58437675
[13] 0.24861473 1.26951037 0.04042111
> r2*6/(1-r2)
[1] 1.83673469 0.06953302 0.45194680 0.01444669 0.39758499 0.40171848
[7] 0.64330264 0.74307017 1.18907380 1.65841034 0.36439691 0.43828256
[13] 0.18646105 0.95213278 0.03031583
Neither of 8 or 6 is correct, since it's based on nrow(X)
, while for use="complete.obs"
it has to be based on the number of complete observations. This can be accomplished by changing the function definition to cor.prob <- function (X, dfr = sum(complete.cases(X)) - 2) { …
. Therewith, the same results are produced with and without the priord <- d[rowSums(is.na(d[,3:6]))!=4,]
But if I choose to use
pairwise.complete.obs
instead ofcomplete.obs
, then i'll keep the code as it was? And further, since i haveNA
values different places in my data then i will benefite from"pairwise"
rather than"complete.obs"
?
Indeed if we use pairwise.complete.obs
, we gain the observations where only part of the columns are NA
. But, since we then have different numbers of observations for the individual columns, a single dfr
value is not appropriate; instead, we can use a dfr
matrix:
library(psych)
cor.prob <- function (X, dfr = pairwiseCount(X) - 2)
{
…
Fstat <- r2 * dfr[above]/(1 - r2)
R[above] <- 1 - pf(Fstat, 1, dfr[above])
…
Removing NA in correlation matrix
If you simply want to get rid of any column that has one or more NA
s, then just do
x<-x[,colSums(is.na(x))==0]
However, even with missing data, you can compute a correlation matrix with no NA
values by specifying the use
parameter in the function cor
. Setting it to either pairwise.complete.obs
or complete.obs
will result in a correlation matrix with no NA
s.
complete.obs
will ignore all rows with missing data, whereas pairwise.complete.obs
will just ignore the missing pairs of data. Note that, although pairwise.complete.obs
"sounds better" because it uses more of the available data, but it isn't guaranteed to produce a positive-definite correlation matrix, which could be a problem.
> set.seed(123)
> x<-array(rnorm(500),c(100,5))
> x[sample(500,3)]<-NA
> cor(x)
[,1] [,2] [,3] [,4] [,5]
[1,] 1 NA NA NA NA
[2,] NA 1 NA NA NA
[3,] NA NA 1 NA NA
[4,] NA NA NA 1.00000000 -0.01925986
[5,] NA NA NA -0.01925986 1.00000000
> cor(x,use="pairwise.complete.obs")
[,1] [,2] [,3] [,4] [,5]
[1,] 1.00000000 -0.04377085 -0.18049501 -0.04914247 -0.19374986
[2,] -0.04377085 1.00000000 0.01296008 0.02606083 -0.12333765
[3,] -0.18049501 0.01296008 1.00000000 -0.03218139 -0.02675554
[4,] -0.04914247 0.02606083 -0.03218139 1.00000000 -0.01925986
[5,] -0.19374986 -0.12333765 -0.02675554 -0.01925986 1.00000000
> cor(x,use="complete.obs")
[,1] [,2] [,3] [,4] [,5]
[1,] 1.00000000 -0.06263112 -0.17914810 -0.02574970 -0.20504268
[2,] -0.06263112 1.00000000 0.01263764 0.02543900 -0.12571570
[3,] -0.17914810 0.01263764 1.00000000 -0.03866312 -0.02520500
[4,] -0.02574970 0.02543900 -0.03866312 1.00000000 -0.01688848
[5,] -0.20504268 -0.12571570 -0.02520500 -0.01688848 1.00000000
Does `cor()` only work for numeric variables?
Spearman's rho does require that the data be ordered, which characters are not and even regular factors are not (this is a little bit subtle — they do have an ordering which is used when listing factor levels, plotting, etc., but this ordering is not assumed to have any statistical meaning). It would make sense if cor()
allowed ordered factors (factor(..., ordered = TRUE)
or ordered(...)
, but it doesn't. As ?cor
says:
The inputs must be numeric (as determined by ‘is.numeric’: logical
values are also allowed for historical compatibility): the
‘"kendall"’ and ‘"spearman"’ methods make sense for ordered inputs
but ‘xtfrm’ can be used to find a suitable prior transformation to
numbers.
However, assuming that you have a factor variable and the order of levels is what you want, then using as.integer()
in cor()
should work fine. (In fact, the xtfrm.factor()
method is just a wrapper for as.integer()
.)
xf <- ordered(x, levels = c("None", "Little", "Often", "Always"))
cor(as.integer(xf), y, method = "spearman")
## or
cor(xtfrm(xf), y, method = "spearman")
dplyr: correlations with NA
There is no na.rm
argument in cor
, it is use
. According to ?cor
, the usage is
cor(x, y = NULL, use = "everything",
method = c("pearson", "kendall", "spearman"))
use - an optional character string giving a method for computing covariances in the presence of missing values. This must be (an abbreviation of) one of the strings "everything", "all.obs", "complete.obs", "na.or.complete", or "pairwise.complete.obs".
library(dplyr)
xx %>%
group_by(group) %>%
summarize(COR=cor(a,b, use = "complete.obs"))
-output
# A tibble: 4 × 2
group COR
<int> <dbl>
1 1 0.166
2 2 0.190
3 3 0.190
4 4 0.190
If there are groups with all NA, then use "na.or.complete"
(updated data in the comments with groups having only NA)
xx %>%
group_by(group) %>%
summarize(COR=cor(a,b, use = "na.or.complete"))
# A tibble: 5 × 2
group COR
<int> <dbl>
1 1 0.0345
2 2 -0.397
3 3 0.150
4 4 0.376
5 5 NA
which returns the same with an if/else
condition and using "complete.obs"
xx %>%
group_by(group) %>%
summarize(COR= if(any(complete.cases(a, b)))
cor(a,b, use = "complete.obs") else NA_real_)
# A tibble: 5 × 2
group COR
<int> <dbl>
1 1 0.0345
2 2 -0.397
3 3 0.150
4 4 0.376
5 5 NA
Dealing with missing values for correlations calculation
I would vote for the second option. Sounds like you have a fair amount of missing data and so you would be looking for a sensible multiple imputation strategy to fill in the spaces. See Harrell's text "Regression Modeling Strategies" for a wealth of guidance on 'how's to do this properly.
Related Topics
Avoid Wasting Space When Placing Multiple Aligned Plots Onto One Page
How to Drop Unused Levels from a Data Frame
How to Add an Inset (Subplot) to "Topright" of an R Plot
Colorize Parts of the Title in a Plot
How to Make a Dummy Variable in R
Using R to "Click" a Download File Button on a Webpage
How to Add a Prefix to Several Variable Names Using Dplyr
Split Date Data (M/D/Y) into 3 Separate Columns
Display Y-Axis for Each Subplot When Faceting
How to Increase the Size of Points in Legend of Ggplot2
How to Add Boxplots to Scatterplot with Jitter
How to Compute Roc and Auc Under Roc After Training Using Caret in R
R Convert Between Zoo Object and Data Frame, Results Inconsistent for Different Numbers of Columns
Programmatically Insert Text, Headers and Lists with R Markdown
Group by in R, Ddply with Weighted.Mean
How to Conditionally Replace Values in R Data Frame Using If/Then Statement