Calculating Pearson Correlation and Significance in Python

Calculate pearson correlation in python

You should use a groupby grouped with corr() as your aggregation function:

country = ['India','India','India','India','India','China','China','China','China','China']
Year = [2018,2017,2016,2015,2014,2018,2017,2016,2015,2014]
GDP = [100,98,94,64,66,200,189,165,134,130]
CO2 = [94,96,90,76,64,180,172,150,121,117]
df = pd.DataFrame({'country':country,'Year':Year,'GDP':GDP,'CO2':CO2})
print(df.groupby('country')[['GDP','CO2']].corr()

If we work this output a bit we can go to something fancier:

df_corr = (df.groupby('country')['GDP','CO2'].corr()).drop(columns='GDP').drop('CO2',level=1).rename(columns={'CO2':'Correlation'})
df_corr = df_corr.reset_index().drop(columns='level_1').set_index('country',drop=True)
print(df_corr)

Output:

         Correlation
country             
China       0.999581
India       0.932202

How can I compute the Pearson correlation matrix and retain only significant values?

Looking through the docs for pearsonr reveals the fomulae used to compute the correlations. It should not be too difficult to get the correlations between each column of a matrix using vectorization.

While you could compute the value of C using pandas, I will show pure numpyan implementation for the entire process.

First, compute the r-values:

X = np.array([[1,  1, -2],
              [0,  0,  0],
              [0, .2,  1],
              [5,  3,  4]])
n = X.shape[0]

X -= X.mean(axis=0)
s = (X**2).sum(axis=0)
r = (X[..., None] * X[..., None, :]).sum(axis=0) / np.sqrt(s[:, None] * s[None, :])

Computing the p values is made simple given the existence of the beta distribution in scipy. Taken directly from the docs:

dist = scipy.stats.beta(n/2 - 1, n/2 - 1, loc=-1, scale=2)
p = 2 * dist.cdf(-abs(r))

You can trivially make a mask from p with your threshold, and apply it to r to make C:

mask = (p <= 0.01)
C = np.zeros_like(r)
C[mask] = r[mask]

A better option would probably be to modify your r in-place:

r[p > 0.1] = 0

In function form:

def non_trivial_correlation(X, threshold=0.1):
    n = X.shape[0]
    X = X - X.mean(axis=0) # Don't modify the original
    x = (X**2).sum(axis=0)
    r = (X[..., None] * X[..., None, :]).sum(axis=0) / np.sqrt(s[:, None] * s[None, :])
    p = 2 * scipy.stats.beta(n/2 - 1, n/2 - 1, loc=-1, scale=2).cdf(-abs(r))
    r[p > threshold] = 0
    return r

Calculating Pearson correlation

Use scipy :

scipy.stats.pearsonr(x, y)

Calculates a Pearson correlation coefficient and the p-value for testing non-correlation.

The Pearson correlation coefficient measures the linear relationship between two datasets. Strictly speaking, Pearson’s correlation requires that each dataset be normally distributed. Like other correlation coefficients, this one varies between -1 and +1 with 0 implying no correlation. Correlations of -1 or +1 imply an exact linear relationship. Positive correlations imply that as x increases, so does y. Negative correlations imply that as x increases, y decreases.

The p-value roughly indicates the probability of an uncorrelated system producing datasets that have a Pearson correlation at least as extreme as the one computed from these datasets. The p-values are not entirely reliable but are probably reasonable for datasets larger than 500 or so.

Parameters :

x : 1D array

y : 1D array the same length as x

Returns :

(Pearson’s correlation coefficient, :
2-tailed p-value)

Estimate Pearson correlation coefficient from stream of data

Yes, this can be computed incrementally. The method is a small generalisation of Welford's algorithm, see here, for example

You maintain a number of variables, updating them each time data comes in. At each stage these are the mean etc of the data seen so far

Initialisation:

int n = 0; // number of points
double mx = 0.0; // mean of x's
double my = 0.0; // mean of y's
double vx = 0.0; // variance of x's
double vy = 0.0; // variance of y's
double cxy = 0.0; // covariance of x and y

Update (new values x,y in )

  n += 1;
double f = 1.0/n;
double dx = x - mx;
double dy = y - my;
  mx += f*dx;
  my += f*dy;
  vx = (1.0-f)*(vx + f*dx*dx);
  vy = (1.0-f)*(vy + f*dy*dy);
  cxy= (1.0-f)*(cxy+ f*dx*dy);

In terms of these variables we have

rxy = cxy/sqrt( vx*vy)

Note though that vx and vy will be zero after just one pair as been seen.

Don't be surprised if the stream of estimates for rxy is noisy. Estimates of correlation tend to be so.

implementing pearson's correlation coefficient for 3 variables in scipy.stats

You can use itertools.combinations to produce 3 pairs to calculate the correlation for:

from itertools import combinations
import scipy.stats

list_of_vars = [array_1, array_2, array_3]
results = [scipy.stats.pearsonr(*pair) for pair in combinations(list_of_vars, 2)]

the results will be a list of 2-tuples of (r, p_value), e.g:

[(0.3488605505012684, 0.7731373652607254),      # for <array_1, array_2>
 (-0.7590110031075414, 0.45136569429566353),    # for <array_1, array_3>
 (0.3453846068421791, 0.7754969404436115)]      # for <array_2, array_3>

How do I determine a correlation coefficient in Python?

Easiest would be to use scipy.stats (see here)

import numpy as np
from scipy.stats.stats import pearsonr

x = np.random.random(20)
y = np.random.random(20)

print(pearsonr(x, y))

This will give you two values, the correlation and the p-value.

You can implement it yourself like this:

x = np.random.random(20)
y = np.random.random(20)
x_bar = np.mean(x)
y_bar = np.mean(y)

top = np.sum((x - x_bar) * (y - y_bar))
bot = np.sqrt(np.sum(np.power(x - x_bar, 2)) * np.sum(np.power(y - y_bar, 2)))

print(top/bot)

Both give the same result, good luck!