Weighted Pearson's Correlation

Weighted Pearson's Correlation?

You can go back to the definition of the correlation.

f <- function( x, y, w = rep(1,length(x))) {
stopifnot( length(x) == dim(y)[2] )
w <- w / sum(w)
# Center x and y, using the weighted means
x <- x - sum(x*w)
y <- y - apply( t(y) * w, 2, sum )
# Compute the variance
vx <- sum( w * x * x )
vy <- rowSums( w * y * y ) # Incorrect: see Heather's remark, in the other answer
# Compute the covariance
vxy <- colSums( t(y) * x * w )
# Compute the correlation
vxy / sqrt(vx * vy)
}
f(x,y)[1]
cor(x,y[1,]) # Identical
f(x, y, xy.wt)

Weighted correlation in R

Ok, this is very easy to do in R using cov.wt (available in stats)

weighted_corr <- cov.wt(DF, wt = w, cor = TRUE)
corr_matrix <- weighted_corr$cor

That's it!

How to optimise the Pearson's correlation coefficient by adjusting the weights?

I'm willing to bet there is some closed form solution, but if hacked code suffices, see below

(this solution is based on the scipy.optimize package
https://docs.scipy.org/doc/scipy/reference/tutorial/optimize.html)

(the minimize is turned into a maximize by returning -1 times r_squared)

import numpy as np
from scipy import stats
from scipy import optimize
import IPython

def get_linregress(*args):

#IPython.embed()
w1,w2,w3 = args[0]

x1_raw=np.array([277, 115, 196])
x2_raw=np.array([263, 118, 191])
x3_raw=np.array([270, 114, 191])

w=np.array([w1, w2, w3])
#w=np.array([1, 1, 1])

x1=np.prod([w,x1_raw], axis=0).sum()
x2=np.prod([w,x2_raw], axis=0).sum()
x3=np.prod([w,x3_raw], axis=0).sum()

x=np.array([x1, x2, x3])

y=np.array([71.86, 71.14, 70.76])

slope, intercept, r_value, p_value, std_err = stats.linregress(x,y) r_squared = r_value**2

return -1*r_squared

res = optimize.minimize(get_linregress, [1,2,3], method='Nelder-Mead', tol=1e-6)

res.x

Weighted correlation

The main idea is that whenever you see E(...) you replace 1/n with w/sum(w).

Theory:

Corr(X,Y) = E( (X - E(X))*(Y - E(Y) ) / SD(X)SD(Y) ;

So first calculate E(X) and E(Y).

E(X) = (2.02382 * .43873 + ... + 3.80725*.82439) / (.43873+...+.82439) = 3.368

E(Y) = [same weighted average idea] = 4.705

sd(X) = sqrt( var(X) ) = sqrt( E( (X-E(X))^2 ) ) = sqrt( ( (.43873)(2.02382-3.368)^2 + ... + (.82439)(3.80725-3.368)^2 ) / (.43873+...+.82439) ) = sqrt(0.3054023) = 0.5526321

sd(Y) = [same weighted average idea] = sqrt(1.860124) = 1.363863

corr(x,y) = ( (.43873)(2.02382-3.368)(6.00298-4.705)+...+(.82439)(3.80725-3.368)(3.18414-4.705) ) / ( (.43873+...+.82439)(.5526)(1.3634) ) = 0.2085651

Weighted correlation in julia

This is the way to do it:

julia> using Statistics, StatsBase

julia> x = [1 2
3 4
1 -2
3 -4
5 -6]
5×2 Matrix{Int64}:
1 2
3 4
1 -2
3 -4
5 -6

julia> cor(x, Weights([1,1,0,0,0]))
2×2 Matrix{Float64}:
1.0 1.0
1.0 1.0

julia> cor(x, Weights([0,0,1,1,1]))
2×2 Matrix{Float64}:
1.0 -1.0
-1.0 1.0

Is there a way in R for doing a pairwise-weighted correlation matrix?

Good question
Here how I do it
It is not fast but faster than looping.

df_correlation is a dataframe with only the variables I want to compute the correlations
and newdf is my original dataframe with the weight and other variables

   data_list <- combn(names(df_correlation),2,simplify = FALSE)
data_list <- map(data_list,~c(.,"BalancingWeights"))

dimension <- length(names(df_correlation))
allcorr <- matrix(data =NA,nrow = dimension,ncol = dimension)
row.names(allcorr)<-names(df_correlation)
colnames(allcorr) <- names(df_correlation)

myfunction<- function(data,x,y,weight){
indice <-!(is.na(data[[x]])|is.na(data[[y]]))
return(wCorr::weightedCorr(data[[x]][indice],
data[[y]][indice], method = c("Pearson"),
weights = data[[weight]][indice], ML = FALSE, fast = TRUE))
}

b <- map_dbl(data_list,~myfunction(newdf,.[1],.[2],.[3]))

allcorr[upper.tri(allcorr, diag = FALSE)]<- b

allcorr[lower.tri(allcorr,diag=FALSE)] <- b
view(allcorr)


Related Topics



Leave a reply



Submit