Weighted Pearson's Correlation?
You can go back to the definition of the correlation.
f <- function( x, y, w = rep(1,length(x))) {
stopifnot( length(x) == dim(y)[2] )
w <- w / sum(w)
# Center x and y, using the weighted means
x <- x - sum(x*w)
y <- y - apply( t(y) * w, 2, sum )
# Compute the variance
vx <- sum( w * x * x )
vy <- rowSums( w * y * y ) # Incorrect: see Heather's remark, in the other answer
# Compute the covariance
vxy <- colSums( t(y) * x * w )
# Compute the correlation
vxy / sqrt(vx * vy)
}
f(x,y)[1]
cor(x,y[1,]) # Identical
f(x, y, xy.wt)
Weighted correlation in R
Ok, this is very easy to do in R using cov.wt
(available in stats
)
weighted_corr <- cov.wt(DF, wt = w, cor = TRUE)
corr_matrix <- weighted_corr$cor
That's it!
How to optimise the Pearson's correlation coefficient by adjusting the weights?
I'm willing to bet there is some closed form solution, but if hacked code suffices, see below
(this solution is based on the scipy.optimize package
https://docs.scipy.org/doc/scipy/reference/tutorial/optimize.html)
(the minimize is turned into a maximize by returning -1 times r_squared)
import numpy as np
from scipy import stats
from scipy import optimize
import IPython
def get_linregress(*args):
#IPython.embed()
w1,w2,w3 = args[0]
x1_raw=np.array([277, 115, 196])
x2_raw=np.array([263, 118, 191])
x3_raw=np.array([270, 114, 191])
w=np.array([w1, w2, w3])
#w=np.array([1, 1, 1])
x1=np.prod([w,x1_raw], axis=0).sum()
x2=np.prod([w,x2_raw], axis=0).sum()
x3=np.prod([w,x3_raw], axis=0).sum()
x=np.array([x1, x2, x3])
y=np.array([71.86, 71.14, 70.76])
slope, intercept, r_value, p_value, std_err = stats.linregress(x,y) r_squared = r_value**2
return -1*r_squared
res = optimize.minimize(get_linregress, [1,2,3], method='Nelder-Mead', tol=1e-6)
res.x
Weighted correlation
The main idea is that whenever you see E(...) you replace 1/n with w/sum(w).
Theory:
Corr(X,Y) = E( (X - E(X))*(Y - E(Y) ) / SD(X)SD(Y) ;
So first calculate E(X) and E(Y).
E(X) = (2.02382 * .43873 + ... + 3.80725*.82439) / (.43873+...+.82439) = 3.368
E(Y) = [same weighted average idea] = 4.705
sd(X) = sqrt( var(X) ) = sqrt( E( (X-E(X))^2 ) ) = sqrt( ( (.43873)(2.02382-3.368)^2 + ... + (.82439)(3.80725-3.368)^2 ) / (.43873+...+.82439) ) = sqrt(0.3054023) = 0.5526321
sd(Y) = [same weighted average idea] = sqrt(1.860124) = 1.363863
corr(x,y) = ( (.43873)(2.02382-3.368)(6.00298-4.705)+...+(.82439)(3.80725-3.368)(3.18414-4.705) ) / ( (.43873+...+.82439)(.5526)(1.3634) ) = 0.2085651
Weighted correlation in julia
This is the way to do it:
julia> using Statistics, StatsBase
julia> x = [1 2
3 4
1 -2
3 -4
5 -6]
5×2 Matrix{Int64}:
1 2
3 4
1 -2
3 -4
5 -6
julia> cor(x, Weights([1,1,0,0,0]))
2×2 Matrix{Float64}:
1.0 1.0
1.0 1.0
julia> cor(x, Weights([0,0,1,1,1]))
2×2 Matrix{Float64}:
1.0 -1.0
-1.0 1.0
Is there a way in R for doing a pairwise-weighted correlation matrix?
Good question
Here how I do it
It is not fast but faster than looping.
df_correlation is a dataframe with only the variables I want to compute the correlations
and newdf is my original dataframe with the weight and other variables
data_list <- combn(names(df_correlation),2,simplify = FALSE)
data_list <- map(data_list,~c(.,"BalancingWeights"))
dimension <- length(names(df_correlation))
allcorr <- matrix(data =NA,nrow = dimension,ncol = dimension)
row.names(allcorr)<-names(df_correlation)
colnames(allcorr) <- names(df_correlation)
myfunction<- function(data,x,y,weight){
indice <-!(is.na(data[[x]])|is.na(data[[y]]))
return(wCorr::weightedCorr(data[[x]][indice],
data[[y]][indice], method = c("Pearson"),
weights = data[[weight]][indice], ML = FALSE, fast = TRUE))
}
b <- map_dbl(data_list,~myfunction(newdf,.[1],.[2],.[3]))
allcorr[upper.tri(allcorr, diag = FALSE)]<- b
allcorr[lower.tri(allcorr,diag=FALSE)] <- b
view(allcorr)
Related Topics
How to Remove Row If It Has a Na Value in One Certain Column
Changing Font Size in R Datatables (Dt)
How to Remove Unique Entry and Keep Duplicates in R
Minimal Example of Rpy2 Regression Using Pandas Data Frame
Dynamic Position for Ggplot2 Objects (Especially Geom_Text)
Why Is Stat = "Identity" Necessary in Geom_Bar in Ggplot
Texture in Barplot for 7 Bars in R
Problems Using Foreach Parallelization
How to Change the Na Color from Gray to White in a Ggplot Choropleth Map
R: Robust Se's and Model Diagnostics in Stargazer Table
How to Call External R Script from R Markdown (.Rmd) in Rstudio
Show That Shiny Is Busy (Or Loading) When Changing Tab Panels