dplyr: correlations with NA
There is no na.rm
argument in cor
, it is use
. According to ?cor
, the usage is
cor(x, y = NULL, use = "everything",
method = c("pearson", "kendall", "spearman"))
use - an optional character string giving a method for computing covariances in the presence of missing values. This must be (an abbreviation of) one of the strings "everything", "all.obs", "complete.obs", "na.or.complete", or "pairwise.complete.obs".
library(dplyr)
xx %>%
group_by(group) %>%
summarize(COR=cor(a,b, use = "complete.obs"))
-output
# A tibble: 4 × 2
group COR
<int> <dbl>
1 1 0.166
2 2 0.190
3 3 0.190
4 4 0.190
If there are groups with all NA, then use "na.or.complete"
(updated data in the comments with groups having only NA)
xx %>%
group_by(group) %>%
summarize(COR=cor(a,b, use = "na.or.complete"))
# A tibble: 5 × 2
group COR
<int> <dbl>
1 1 0.0345
2 2 -0.397
3 3 0.150
4 4 0.376
5 5 NA
which returns the same with an if/else
condition and using "complete.obs"
xx %>%
group_by(group) %>%
summarize(COR= if(any(complete.cases(a, b)))
cor(a,b, use = "complete.obs") else NA_real_)
# A tibble: 5 × 2
group COR
<int> <dbl>
1 1 0.0345
2 2 -0.397
3 3 0.150
4 4 0.376
5 5 NA
cor shows only NA or 1 for correlations - Why?
The 1
s are because everything is perfectly correlated with itself, and the NA
s are because there are NA
s in your variables.
You will have to specify how you want R to compute the correlation when there are missing values, because the default is to only compute a coefficient with complete information.
You can change this behavior with the use
argument to cor
, see ?cor
for details.
NA s in Correlation in R
If your data are in data frame then function cor()
will calculate correlation between columns of your two data frame. In your case you get all NA because there is only one row in your data frame.
You have to transpose your data frames so that this one row becomes one column and then you can calculate correlation coefficient. To transpose you can use function t()
.
cor(t(df.A),t(df.B))
Removing NA in correlation matrix
If you simply want to get rid of any column that has one or more NA
s, then just do
x<-x[,colSums(is.na(x))==0]
However, even with missing data, you can compute a correlation matrix with no NA
values by specifying the use
parameter in the function cor
. Setting it to either pairwise.complete.obs
or complete.obs
will result in a correlation matrix with no NA
s.
complete.obs
will ignore all rows with missing data, whereas pairwise.complete.obs
will just ignore the missing pairs of data. Note that, although pairwise.complete.obs
"sounds better" because it uses more of the available data, but it isn't guaranteed to produce a positive-definite correlation matrix, which could be a problem.
> set.seed(123)
> x<-array(rnorm(500),c(100,5))
> x[sample(500,3)]<-NA
> cor(x)
[,1] [,2] [,3] [,4] [,5]
[1,] 1 NA NA NA NA
[2,] NA 1 NA NA NA
[3,] NA NA 1 NA NA
[4,] NA NA NA 1.00000000 -0.01925986
[5,] NA NA NA -0.01925986 1.00000000
> cor(x,use="pairwise.complete.obs")
[,1] [,2] [,3] [,4] [,5]
[1,] 1.00000000 -0.04377085 -0.18049501 -0.04914247 -0.19374986
[2,] -0.04377085 1.00000000 0.01296008 0.02606083 -0.12333765
[3,] -0.18049501 0.01296008 1.00000000 -0.03218139 -0.02675554
[4,] -0.04914247 0.02606083 -0.03218139 1.00000000 -0.01925986
[5,] -0.19374986 -0.12333765 -0.02675554 -0.01925986 1.00000000
> cor(x,use="complete.obs")
[,1] [,2] [,3] [,4] [,5]
[1,] 1.00000000 -0.06263112 -0.17914810 -0.02574970 -0.20504268
[2,] -0.06263112 1.00000000 0.01263764 0.02543900 -0.12571570
[3,] -0.17914810 0.01263764 1.00000000 -0.03866312 -0.02520500
[4,] -0.02574970 0.02543900 -0.03866312 1.00000000 -0.01688848
[5,] -0.20504268 -0.12571570 -0.02520500 -0.01688848 1.00000000
How to determine correlation from dataframe with Nan?
Try this. For my case it worked
df = df.apply(pd.to_numeric, errors='coerce')
DataFrame correlation produces NaN although its values are all integers
Those columns do not change in value right now, yes
As, Joris points out you would expected NaN
if the values do not vary. To see why take a look at correlation formula:
cor(i,j) = cov(i,j)/[stdev(i)*stdev(j)]
If the values of the ith or jth variable do not vary, then the respective standard deviation will be zero and so will the denominator of the fraction. Thus, the correlation will be NaN
.
Related Topics
Compute All Fixed Window Averages with Dplyr and Rcpproll
R Formatting a Date from a Character Mmm Dd, Yyyy to Class Date
Circular Heatmap That Looks Like a Donut
How to Create Textarea as Input in a Shiny Webapp in R
Save All Plots Already Present in the Panel of Rstudio
Get Selected Row from Datatable in Shiny App
Stl Decomposition of Time Series with Missing Values for Anomaly Detection
Avoid That Space in Column Name Is Replaced with Period (".") When Using Read.Csv()
What Is the Correct/Standard Way to Check If Difference Is Smaller Than MAChine Precision
How Does Gganimate Order an Ordered Bar Time-Series
Merge Two Dataframes If Timestamp of X Is Within Time Interval of Y
Correlation Between Na Columns
How and When Should I Use On.Exit
Differences in Heatmap/Clustering Defaults in R (Heatplot Versus Heatmap.2)
How to Separate Title Page and Table of Content Page from Knitr Rmarkdown PDF
How to Preserve Transparency in Ggplot2
Insert Portions of a Markdown Document Inside Another Markdown Document Using Knitr