Pairwise Correlation Table

Pairwise Correlation Table

Bill Venables offers this solution in this answer from the R mailing list to which I've made some slight modifications:

cor.prob <- function(X, dfr = nrow(X) - 2) {
R <- cor(X)
above <- row(R) < col(R)
r2 <- R[above]^2
Fstat <- r2 * dfr / (1 - r2)
R[above] <- 1 - pf(Fstat, 1, dfr)

cor.mat <- t(R)
cor.mat[upper.tri(cor.mat)] <- NA
cor.mat
}

So let's test it out:

set.seed(123)
data <- matrix(rnorm(100), 20, 5)
cor.prob(data)

[,1] [,2] [,3] [,4] [,5]
[1,] 1.0000000 NA NA NA NA
[2,] 0.7005361 1.0000000 NA NA NA
[3,] 0.5990483 0.6816955 1.0000000 NA NA
[4,] 0.6098357 0.3287116 0.5325167 1.0000000 NA
[5,] 0.3364028 0.1121927 0.1329906 0.5962835 1

Does that line up with cor.test?

cor.test(data[,2], data[,3])

Pearson's product-moment correlation
data: data[, 2] and data[, 3]
t = 0.4169, df = 18, p-value = 0.6817
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.3603246 0.5178982
sample estimates:
cor
0.09778865

Seems to work ok.

R: Converting Large Dataframe to Pairwise Correlation Matrix

You can dcast your data first and use cor() function to get the correlation matrix:

library(data.table)
dt <- data.table(
group = c(rep(1,5),rep(2,5),rep(3,5),rep(4,5),rep(5,5)),
thing = c(rep(c('a','b','c','d','e'),5)),
score = c(1,1,0,0,1,1,1,0,1,0,1,1,1,0,0,0,1,1,0,1,0,1,0,1,0)
)
dt

m <- dcast(dt, group ~ thing, value.var = "score")

cor(m[, -1])

data.table is usually performant, but if it is not working for you please write a reproducible example that generates large amount of data, somebody might benchmark speed and memory on different solutions.

How can I get all pairwise correlations of a list of 2d matrices in MATLAB?

You only need to reshape m so that each matrix is linearized into a column. Then corrcoef gives the desired result:

r = corrcoef(reshape(m, [], size(m,3)));

How to get a pariwise correlation dataframe from a matrix?

I use only dplyr and reshape2, is it working for what you want ?

library(reshape2)
library(dplyr)

set.seed(1)
n <- 10
df <- data.frame(var1 = rnorm(n), var2 = rnorm(n), var3 = rnorm(n))
melt(cor(df), id = c("Var1", "Var2")) %>%
filter(Var1 != Var2) %>%
filter(duplicated(value))

I made an assumption which can be anoying for you, it is to suppose no duplicate between correlation coefficient neither for symetric relation. I hope it will help.

Edit 1 :

I look on internet, I find the following package : corrr in which their is a function correlate which return you directly a data.frame. If you want the specify output, the following lines gives you something near :

install.packages('corrr')
library(corrr)
stretch(correlate(df))

But after that you need to do the same trick as above to have only what you want.

Edit 2 :

I look again on internet to see what could help you, in fact the only package which take a matrix in input and gives a data.frame in output is corrr but I tested it on a bigger matrix and it is realy slow.

n <- 100
p <- 8000
m <- n * p
df <- data.frame(matrix(rnorm(m), nrow = n, ncol = p))

t <- Sys.time()
m_cor <- cor(df)
t <- Sys.time() - t
t1 <- t

library(corrr)

t <- Sys.time()
m_cor <- correlate(df)
t <- Sys.time() - t
t2 <- t

library(propagate)

t <- Sys.time()
m_cor <- bigcor(df)
t <- Sys.time() - t
t3 <- t

Wherease propogate seems to be a good thing to deal with very large matrix but I'm not sure to well understand the type ff of the output. In my exemple, the basic cor function is faster than the two others solution. Do you find any solution ?

I will be happy to see your proposal because it is an interesting question.

Calculating pairwise correlation among all columns

from pandas import *
import numpy as np
from libraries.settings import *
from scipy.stats.stats import pearsonr
import itertools

Creating random sample data:

df = DataFrame(np.random.random((5, 5)), columns=['gene_' + chr(i + ord('a')) for i in range(5)]) 
print(df)

gene_a gene_b gene_c gene_d gene_e
0 0.471257 0.854139 0.781204 0.678567 0.697993
1 0.292909 0.046159 0.250902 0.064004 0.307537
2 0.422265 0.646988 0.084983 0.822375 0.713397
3 0.113963 0.016122 0.227566 0.206324 0.792048
4 0.357331 0.980479 0.157124 0.560889 0.973161

correlations = {}
columns = df.columns.tolist()

for col_a, col_b in itertools.combinations(columns, 2):
correlations[col_a + '__' + col_b] = pearsonr(df.loc[:, col_a], df.loc[:, col_b])

result = DataFrame.from_dict(correlations, orient='index')
result.columns = ['PCC', 'p-value']

print(result.sort_index())

PCC p-value
gene_a__gene_b 0.461357 0.434142
gene_a__gene_c 0.177936 0.774646
gene_a__gene_d -0.854884 0.064896
gene_a__gene_e -0.155440 0.802887
gene_b__gene_c -0.575056 0.310455
gene_b__gene_d -0.097054 0.876621
gene_b__gene_e 0.061175 0.922159
gene_c__gene_d -0.633302 0.251381
gene_c__gene_e -0.771120 0.126836
gene_d__gene_e 0.531805 0.356315
  • Get unique combinations of DataFrame columns using
    itertools.combination(iterable, r)
  • Iterate through these combinations and calculate pairwise correlations using scipy.stats.stats.personr
  • Add results (PCC and p-value tuple) to dictionary
  • Build DataFrame from dictionary

You could then also save result.to_csv(). You might find it convenient to use a MultiIndex (two columns containing the names of each columns) instead of the created names for the pairwise correlations.

Calculating non-pairwise correlation between different sets of columns of a dataset

You can do this using the cor function, just need to specify the second parameter. Example:

vertical_axis <- 1:3
horizontal_axis <- 4:8
corr <- cor(mtcars[, vertical_axis], mtcars[, horizontal_axis])
corrplot(corr)

The result:

correlation plot demo


NOTE: Using the corrplot function with the order parameter will produce errors or a wrong plot. If you need to reorder the variables (ex.: with hclust), reorder the correlation matrix before passing it to corrplot.

Plot correlation matrix using pandas

You can use pyplot.matshow() from matplotlib:

import matplotlib.pyplot as plt

plt.matshow(dataframe.corr())
plt.show()

Edit:

In the comments was a request for how to change the axis tick labels. Here's a deluxe version that is drawn on a bigger figure size, has axis labels to match the dataframe, and a colorbar legend to interpret the color scale.

I'm including how to adjust the size and rotation of the labels, and I'm using a figure ratio that makes the colorbar and the main figure come out the same height.


EDIT 2:
As the df.corr() method ignores non-numerical columns, .select_dtypes(['number']) should be used when defining the x and y labels to avoid an unwanted shift of the labels (included in the code below).

f = plt.figure(figsize=(19, 15))
plt.matshow(df.corr(), fignum=f.number)
plt.xticks(range(df.select_dtypes(['number']).shape[1]), df.select_dtypes(['number']).columns, fontsize=14, rotation=45)
plt.yticks(range(df.select_dtypes(['number']).shape[1]), df.select_dtypes(['number']).columns, fontsize=14)
cb = plt.colorbar()
cb.ax.tick_params(labelsize=14)
plt.title('Correlation Matrix', fontsize=16);

correlation plot example

Create correlation table for large number of variables

example <- matrix(rep(0.8,25),5,5)

Or as @Vincent pointed out, matrix(0.8,5,5) is much better.

diag(example) <- 1

> example
[,1] [,2] [,3] [,4] [,5]
[1,] 1.0 0.8 0.8 0.8 0.8
[2,] 0.8 1.0 0.8 0.8 0.8
[3,] 0.8 0.8 1.0 0.8 0.8
[4,] 0.8 0.8 0.8 1.0 0.8
[5,] 0.8 0.8 0.8 0.8 1.0

Then you can just scale it up to as large as you need. In your case:

mat2 <- matrix(0.8,1000,1000)


Related Topics



Leave a reply



Submit