Calculate pearson correlation in python
You should use a groupby
grouped with corr()
as your aggregation function:
country = ['India','India','India','India','India','China','China','China','China','China']
Year = [2018,2017,2016,2015,2014,2018,2017,2016,2015,2014]
GDP = [100,98,94,64,66,200,189,165,134,130]
CO2 = [94,96,90,76,64,180,172,150,121,117]
df = pd.DataFrame({'country':country,'Year':Year,'GDP':GDP,'CO2':CO2})
print(df.groupby('country')[['GDP','CO2']].corr()
If we work this output a bit we can go to something fancier:
df_corr = (df.groupby('country')['GDP','CO2'].corr()).drop(columns='GDP').drop('CO2',level=1).rename(columns={'CO2':'Correlation'})
df_corr = df_corr.reset_index().drop(columns='level_1').set_index('country',drop=True)
print(df_corr)
Output:
Correlation
country
China 0.999581
India 0.932202
How can I compute the Pearson correlation matrix and retain only significant values?
Looking through the docs for pearsonr
reveals the fomulae used to compute the correlations. It should not be too difficult to get the correlations between each column of a matrix using vectorization.
While you could compute the value of C
using pandas, I will show pure numpyan implementation for the entire process.
First, compute the r-values:
X = np.array([[1, 1, -2],
[0, 0, 0],
[0, .2, 1],
[5, 3, 4]])
n = X.shape[0]
X -= X.mean(axis=0)
s = (X**2).sum(axis=0)
r = (X[..., None] * X[..., None, :]).sum(axis=0) / np.sqrt(s[:, None] * s[None, :])
Computing the p
values is made simple given the existence of the beta distribution in scipy. Taken directly from the docs:
dist = scipy.stats.beta(n/2 - 1, n/2 - 1, loc=-1, scale=2)
p = 2 * dist.cdf(-abs(r))
You can trivially make a mask from p
with your threshold, and apply it to r
to make C
:
mask = (p <= 0.01)
C = np.zeros_like(r)
C[mask] = r[mask]
A better option would probably be to modify your r
in-place:
r[p > 0.1] = 0
In function form:
def non_trivial_correlation(X, threshold=0.1):
n = X.shape[0]
X = X - X.mean(axis=0) # Don't modify the original
x = (X**2).sum(axis=0)
r = (X[..., None] * X[..., None, :]).sum(axis=0) / np.sqrt(s[:, None] * s[None, :])
p = 2 * scipy.stats.beta(n/2 - 1, n/2 - 1, loc=-1, scale=2).cdf(-abs(r))
r[p > threshold] = 0
return r
Calculating Pearson correlation
Use scipy :
scipy.stats.pearsonr(x, y)
Calculates a Pearson correlation coefficient and the p-value for testing non-correlation.
The Pearson correlation coefficient measures the linear relationship between two datasets. Strictly speaking, Pearson’s correlation requires that each dataset be normally distributed. Like other correlation coefficients, this one varies between -1 and +1 with 0 implying no correlation. Correlations of -1 or +1 imply an exact linear relationship. Positive correlations imply that as x increases, so does y. Negative correlations imply that as x increases, y decreases.
The p-value roughly indicates the probability of an uncorrelated system producing datasets that have a Pearson correlation at least as extreme as the one computed from these datasets. The p-values are not entirely reliable but are probably reasonable for datasets larger than 500 or so.
Parameters :
x : 1D array
y : 1D array the same length as x
Returns :
(Pearson’s correlation coefficient, :
2-tailed p-value)
Estimate Pearson correlation coefficient from stream of data
Yes, this can be computed incrementally. The method is a small generalisation of Welford's algorithm, see here, for example
You maintain a number of variables, updating them each time data comes in. At each stage these are the mean etc of the data seen so far
Initialisation:
int n = 0; // number of points
double mx = 0.0; // mean of x's
double my = 0.0; // mean of y's
double vx = 0.0; // variance of x's
double vy = 0.0; // variance of y's
double cxy = 0.0; // covariance of x and y
Update (new values x,y in )
n += 1;
double f = 1.0/n;
double dx = x - mx;
double dy = y - my;
mx += f*dx;
my += f*dy;
vx = (1.0-f)*(vx + f*dx*dx);
vy = (1.0-f)*(vy + f*dy*dy);
cxy= (1.0-f)*(cxy+ f*dx*dy);
In terms of these variables we have
rxy = cxy/sqrt( vx*vy)
Note though that vx and vy will be zero after just one pair as been seen.
Don't be surprised if the stream of estimates for rxy is noisy. Estimates of correlation tend to be so.
implementing pearson's correlation coefficient for 3 variables in scipy.stats
You can use itertools.combinations
to produce 3 pairs to calculate the correlation for:
from itertools import combinations
import scipy.stats
list_of_vars = [array_1, array_2, array_3]
results = [scipy.stats.pearsonr(*pair) for pair in combinations(list_of_vars, 2)]
the results
will be a list of 2-tuples of (r, p_value)
, e.g:
[(0.3488605505012684, 0.7731373652607254), # for <array_1, array_2>
(-0.7590110031075414, 0.45136569429566353), # for <array_1, array_3>
(0.3453846068421791, 0.7754969404436115)] # for <array_2, array_3>
How do I determine a correlation coefficient in Python?
Easiest would be to use scipy.stats (see here)
import numpy as np
from scipy.stats.stats import pearsonr
x = np.random.random(20)
y = np.random.random(20)
print(pearsonr(x, y))
This will give you two values, the correlation and the p-value.
You can implement it yourself like this:
x = np.random.random(20)
y = np.random.random(20)
x_bar = np.mean(x)
y_bar = np.mean(y)
top = np.sum((x - x_bar) * (y - y_bar))
bot = np.sqrt(np.sum(np.power(x - x_bar, 2)) * np.sum(np.power(y - y_bar, 2)))
print(top/bot)
Both give the same result, good luck!
Related Topics
How to Find Duplicate Elements in Array Using for Loop in Python
How to Crop an Image with Pygame
Why Do Many Examples Use 'Fig, Ax = Plt.Subplots()' in Matplotlib/Pyplot/Python
Python: Changing Methods and Attributes at Runtime
What Is Different Between All These Opencv Python Interfaces
Most Pythonic Way to Interleave Two Strings
How to Multiply Each Element in a List by a Number
Prevent Plot from Showing in Jupyter Notebook
Accessing a Value in a Tuple That Is in a List
Keras Sequential Model Input Layer
How Do Python Functions Handle the Types of Parameters That You Pass In
Why Does Python's _Import_ Require Fromlist
Return a Download and Rendered Page in One Flask Response
Calling Dot Products and Linear Algebra Operations in Cython