Correlation Between Na Columns

dplyr: correlations with NA

There is no na.rm argument in cor, it is use. According to ?cor, the usage is

cor(x, y = NULL, use = "everything",
method = c("pearson", "kendall", "spearman"))

use - an optional character string giving a method for computing covariances in the presence of missing values. This must be (an abbreviation of) one of the strings "everything", "all.obs", "complete.obs", "na.or.complete", or "pairwise.complete.obs".

library(dplyr)
xx %>%
group_by(group) %>%
summarize(COR=cor(a,b, use = "complete.obs"))

-output

# A tibble: 4 × 2
group COR
<int> <dbl>
1 1 0.166
2 2 0.190
3 3 0.190
4 4 0.190

If there are groups with all NA, then use "na.or.complete" (updated data in the comments with groups having only NA)

xx %>%
group_by(group) %>%
summarize(COR=cor(a,b, use = "na.or.complete"))
# A tibble: 5 × 2
group COR
<int> <dbl>
1 1 0.0345
2 2 -0.397
3 3 0.150
4 4 0.376
5 5 NA

which returns the same with an if/else condition and using "complete.obs"

xx %>%
group_by(group) %>%
summarize(COR= if(any(complete.cases(a, b)))
cor(a,b, use = "complete.obs") else NA_real_)
# A tibble: 5 × 2
group COR
<int> <dbl>
1 1 0.0345
2 2 -0.397
3 3 0.150
4 4 0.376
5 5 NA

cor shows only NA or 1 for correlations - Why?

The 1s are because everything is perfectly correlated with itself, and the NAs are because there are NAs in your variables.

You will have to specify how you want R to compute the correlation when there are missing values, because the default is to only compute a coefficient with complete information.

You can change this behavior with the use argument to cor, see ?cor for details.

NA s in Correlation in R

If your data are in data frame then function cor() will calculate correlation between columns of your two data frame. In your case you get all NA because there is only one row in your data frame.

You have to transpose your data frames so that this one row becomes one column and then you can calculate correlation coefficient. To transpose you can use function t().

cor(t(df.A),t(df.B))

Removing NA in correlation matrix

If you simply want to get rid of any column that has one or more NAs, then just do

x<-x[,colSums(is.na(x))==0]

However, even with missing data, you can compute a correlation matrix with no NA values by specifying the use parameter in the function cor. Setting it to either pairwise.complete.obs or complete.obs will result in a correlation matrix with no NAs.

complete.obs will ignore all rows with missing data, whereas pairwise.complete.obs will just ignore the missing pairs of data. Note that, although pairwise.complete.obs "sounds better" because it uses more of the available data, but it isn't guaranteed to produce a positive-definite correlation matrix, which could be a problem.

> set.seed(123)
> x<-array(rnorm(500),c(100,5))
> x[sample(500,3)]<-NA
> cor(x)
[,1] [,2] [,3] [,4] [,5]
[1,] 1 NA NA NA NA
[2,] NA 1 NA NA NA
[3,] NA NA 1 NA NA
[4,] NA NA NA 1.00000000 -0.01925986
[5,] NA NA NA -0.01925986 1.00000000
> cor(x,use="pairwise.complete.obs")
[,1] [,2] [,3] [,4] [,5]
[1,] 1.00000000 -0.04377085 -0.18049501 -0.04914247 -0.19374986
[2,] -0.04377085 1.00000000 0.01296008 0.02606083 -0.12333765
[3,] -0.18049501 0.01296008 1.00000000 -0.03218139 -0.02675554
[4,] -0.04914247 0.02606083 -0.03218139 1.00000000 -0.01925986
[5,] -0.19374986 -0.12333765 -0.02675554 -0.01925986 1.00000000
> cor(x,use="complete.obs")
[,1] [,2] [,3] [,4] [,5]
[1,] 1.00000000 -0.06263112 -0.17914810 -0.02574970 -0.20504268
[2,] -0.06263112 1.00000000 0.01263764 0.02543900 -0.12571570
[3,] -0.17914810 0.01263764 1.00000000 -0.03866312 -0.02520500
[4,] -0.02574970 0.02543900 -0.03866312 1.00000000 -0.01688848
[5,] -0.20504268 -0.12571570 -0.02520500 -0.01688848 1.00000000

How to determine correlation from dataframe with Nan?

Try this. For my case it worked

 df = df.apply(pd.to_numeric, errors='coerce')

DataFrame correlation produces NaN although its values are all integers

Those columns do not change in value right now, yes

As, Joris points out you would expected NaN if the values do not vary. To see why take a look at correlation formula:

cor(i,j) = cov(i,j)/[stdev(i)*stdev(j)]

If the values of the ith or jth variable do not vary, then the respective standard deviation will be zero and so will the denominator of the fraction. Thus, the correlation will be NaN.



Related Topics



Leave a reply



Submit