﻿ Use .Corr to Get the Correlation Between Two Columns - ITCodar

# Use .Corr to Get the Correlation Between Two Columns

## Use .corr to get the correlation between two columns

Without actual data it is hard to answer the question but I guess you are looking for something like this:

``Top15['Citable docs per Capita'].corr(Top15['Energy Supply per Capita'])``

That calculates the correlation between your two columns `'Citable docs per Capita'` and `'Energy Supply per Capita'`.

To give an example:

``import pandas as pddf = pd.DataFrame({'A': range(4), 'B': [2*i for i in range(4)]})   A  B0  0  01  1  22  2  43  3  6``

Then

``df['A'].corr(df['B'])``

gives `1` as expected.

Now, if you change a value, e.g.

``df.loc[2, 'B'] = 4.5   A    B0  0  0.01  1  2.02  2  4.53  3  6.0``

the command

``df['A'].corr(df['B'])``

returns

``0.99586``

which is still close to 1, as expected.

If you apply `.corr` directly to your dataframe, it will return all pairwise correlations between your columns; that's why you then observe `1s` at the diagonal of your matrix (each column is perfectly correlated with itself).

``df.corr()``

will therefore return

``          A         BA  1.000000  0.995862B  0.995862  1.000000``

In the graphic you show, only the upper left corner of the correlation matrix is represented (I assume).

There can be cases, where you get `NaN`s in your solution - check this post for an example.

If you want to filter entries above/below a certain threshold, you can check this question.
If you want to plot a heatmap of the correlation coefficients, you can check this answer and if you then run into the issue with overlapping axis-labels check the following post.

## Correlation coefficient of two columns in pandas dataframe with .corr()

Calling `.corr()` on the entire DataFrame gives you a full correlation matrix:

``>>> table.corr()        Group     AgeGroup  1.0000 -0.1533Age   -0.1533  1.0000``

You can use the separate Series instead:

``>>> table['Group'].corr(table['Age'])-0.15330486289034567``

This should be faster than using the full matrix and indexing it (with `df.corr().iat['Group', 'Age']`). Also, this should work whether `Group` is bool or int dtype.

## Calculate correlation between columns of strings

You can convert datatype to categorical and then do it

``df['profession']=df['profession'].astype('category').cat.codesdf['media']=df['media'].astype('category').cat.codesdf.corr()``

## Python Pandas pandas correlation one column vs all

The most efficient method it to use `corrwith`.

Example:

``df.corrwith(df['A'])``

Setup of example data:

``import numpy as npimport pandas as pddf = pd.DataFrame(np.random.randint(10, size=(5, 5)), columns=list('ABCDE'))#    A  B  C  D  E# 0  7  2  0  0  0# 1  4  4  1  7  2# 2  6  2  0  6  6# 3  9  8  0  2  1# 4  6  0  9  7  7``

output:

``A    1.000000B    0.526317C   -0.209734D   -0.720400E   -0.326986dtype: float64``

## Calculate correlation between two columns based on column names

You can create a function like this:

``cor_f <- function(x) {    cor(test[,names(test)[grepl(x, names(test))]])[2]  }cor_f('Obs1') #correlation between Obs1_grp1 and Obs1_grp2#0.3159908``

In case you need a loop, one way would be:

``vars <- c('Obs1', 'Obs2')    sapply(vars, function(i) cor_f(i))``