Computing the Correlation Coefficient Between Two Multi-Dimensional Arrays

Computing the correlation coefficient between two multi-dimensional arrays

Correlation (default 'valid' case) between two 2D arrays:

You can simply use matrix-multiplication np.dot like so -

out = np.dot(arr_one,arr_two.T)

Correlation with the default "valid" case between each pairwise row combinations (row1,row2) of the two input arrays would correspond to multiplication result at each (row1,row2) position.


Row-wise Correlation Coefficient calculation for two 2D arrays:

def corr2_coeff(A, B):
# Rowwise mean of input arrays & subtract from input arrays themeselves
A_mA = A - A.mean(1)[:, None]
B_mB = B - B.mean(1)[:, None]

# Sum of squares across rows
ssA = (A_mA**2).sum(1)
ssB = (B_mB**2).sum(1)

# Finally get corr coeff
return np.dot(A_mA, B_mB.T) / np.sqrt(np.dot(ssA[:, None],ssB[None]))

This is based upon this solution to How to apply corr2 functions in Multidimentional arrays in MATLAB

Benchmarking

This section compares runtime performance with the proposed approach against generate_correlation_map & loopy pearsonr based approach listed in the other answer.(taken from the function test_generate_correlation_map() without the value correctness verification code at the end of it). Please note the timings for the proposed approach also include a check at the start to check for equal number of columns in the two input arrays, as also done in that other answer. The runtimes are listed next.

Case #1:

In [106]: A = np.random.rand(1000, 100)

In [107]: B = np.random.rand(1000, 100)

In [108]: %timeit corr2_coeff(A, B)
100 loops, best of 3: 15 ms per loop

In [109]: %timeit generate_correlation_map(A, B)
100 loops, best of 3: 19.6 ms per loop

Case #2:

In [110]: A = np.random.rand(5000, 100)

In [111]: B = np.random.rand(5000, 100)

In [112]: %timeit corr2_coeff(A, B)
1 loops, best of 3: 368 ms per loop

In [113]: %timeit generate_correlation_map(A, B)
1 loops, best of 3: 493 ms per loop

Case #3:

In [114]: A = np.random.rand(10000, 10)

In [115]: B = np.random.rand(10000, 10)

In [116]: %timeit corr2_coeff(A, B)
1 loops, best of 3: 1.29 s per loop

In [117]: %timeit generate_correlation_map(A, B)
1 loops, best of 3: 1.83 s per loop

The other loopy pearsonr based approach seemed too slow, but here are the runtimes for one small datasize -

In [118]: A = np.random.rand(1000, 100)

In [119]: B = np.random.rand(1000, 100)

In [120]: %timeit corr2_coeff(A, B)
100 loops, best of 3: 15.3 ms per loop

In [121]: %timeit generate_correlation_map(A, B)
100 loops, best of 3: 19.7 ms per loop

In [122]: %timeit pearsonr_based(A, B)
1 loops, best of 3: 33 s per loop

Correlation coefficient between a 2D and a 3D array - NumPy/Python

We could use corr2_coeff from this post after reshaping the inputs to 2D versions, such that the first input is reshaped to a one-column array and the second one would have number of columns same as the combined length of its last two axes, like so -

corr2_coeff(A.reshape(1,-1),B.reshape(B.shape[0],-1)).ravel()

Sample run -

In [143]: from scipy.stats.stats import pearsonr
...:
...: A = np.random.random([5,5])
...: B = np.random.random([3,5,5])
...: C = []
...: for i in B:
...: C.append(pearsonr(A.flatten(), i.flatten())[0])
...:
...: C = np.array(C)
...:

In [144]: C
Out[144]: array([ 0.05637413, -0.26749579, -0.08957621])

In [145]: corr2_coeff(A.reshape(1,-1),B.reshape(B.shape[0],-1)).ravel()
Out[145]: array([ 0.05637413, -0.26749579, -0.08957621])

For really huge arrays, we might need to resort to one-loop, like so -

[corr2_coeff(A.reshape(1,-1), i.reshape(1,-1)) for i in B]

Computing row-wise correlation coefficients between two 2d arrays in Python

I think I'd just use a list-comprehension and a module for calculating the coefficient:

from scipy.stats.stats import pearsonr
import numpy as np

M = 10
T = 4
A = np.random.rand(M*T).reshape((M, T))
B = np.random.rand(M*T).reshape((M, T))
diag_pear_coef = [pearsonr(A[i, :], B[i, :])[0] for i in range(M)]

Does that work for you? Note that pearsonr returns more than just the correlation coefficient, hence the [0] indexing.

Good luck!

compute array of correlations between two multidimensional arrays in R

Using abind we may combine these two arrays into a four-dimensional one and then employ apply across the first two dimensions:

library(abind)
apply(abind(X, Y, along = 4), 1:2, function(Z) cor(Z[, 1], Z[, 2]))

correlation coefficient between columns of 2 dataframes

I think you need something like this,

a=df1.columns.values
b=df2.columns.values
print [df1[u].corr(df2[v]) for u,v in list(itertools.product(a, b))]


Related Topics



Leave a reply



Submit