Pairwise Correlation Table
Bill Venables offers this solution in this answer from the R mailing list to which I've made some slight modifications:
cor.prob <- function(X, dfr = nrow(X) - 2) {
R <- cor(X)
above <- row(R) < col(R)
r2 <- R[above]^2
Fstat <- r2 * dfr / (1 - r2)
R[above] <- 1 - pf(Fstat, 1, dfr)
cor.mat <- t(R)
cor.mat[upper.tri(cor.mat)] <- NA
cor.mat
}
So let's test it out:
set.seed(123)
data <- matrix(rnorm(100), 20, 5)
cor.prob(data)
[,1] [,2] [,3] [,4] [,5]
[1,] 1.0000000 NA NA NA NA
[2,] 0.7005361 1.0000000 NA NA NA
[3,] 0.5990483 0.6816955 1.0000000 NA NA
[4,] 0.6098357 0.3287116 0.5325167 1.0000000 NA
[5,] 0.3364028 0.1121927 0.1329906 0.5962835 1
Does that line up with cor.test?
cor.test(data[,2], data[,3])
Pearson's product-moment correlation
data: data[, 2] and data[, 3]
t = 0.4169, df = 18, p-value = 0.6817
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.3603246 0.5178982
sample estimates:
cor
0.09778865
Seems to work ok.
R: Converting Large Dataframe to Pairwise Correlation Matrix
You can dcast
your data first and use cor()
function to get the correlation matrix:
library(data.table)
dt <- data.table(
group = c(rep(1,5),rep(2,5),rep(3,5),rep(4,5),rep(5,5)),
thing = c(rep(c('a','b','c','d','e'),5)),
score = c(1,1,0,0,1,1,1,0,1,0,1,1,1,0,0,0,1,1,0,1,0,1,0,1,0)
)
dt
m <- dcast(dt, group ~ thing, value.var = "score")
cor(m[, -1])
data.table
is usually performant, but if it is not working for you please write a reproducible example that generates large amount of data, somebody might benchmark speed and memory on different solutions.
How can I get all pairwise correlations of a list of 2d matrices in MATLAB?
You only need to reshape m
so that each matrix is linearized into a column. Then corrcoef
gives the desired result:
r = corrcoef(reshape(m, [], size(m,3)));
How to get a pariwise correlation dataframe from a matrix?
I use only dplyr
and reshape2
, is it working for what you want ?
library(reshape2)
library(dplyr)
set.seed(1)
n <- 10
df <- data.frame(var1 = rnorm(n), var2 = rnorm(n), var3 = rnorm(n))
melt(cor(df), id = c("Var1", "Var2")) %>%
filter(Var1 != Var2) %>%
filter(duplicated(value))
I made an assumption which can be anoying for you, it is to suppose no duplicate between correlation coefficient neither for symetric relation. I hope it will help.
Edit 1 :
I look on internet, I find the following package : corrr
in which their is a function correlate
which return you directly a data.frame. If you want the specify output, the following lines gives you something near :
install.packages('corrr')
library(corrr)
stretch(correlate(df))
But after that you need to do the same trick as above to have only what you want.
Edit 2 :
I look again on internet to see what could help you, in fact the only package which take a matrix in input and gives a data.frame in output is corrr
but I tested it on a bigger matrix and it is realy slow.
n <- 100
p <- 8000
m <- n * p
df <- data.frame(matrix(rnorm(m), nrow = n, ncol = p))
t <- Sys.time()
m_cor <- cor(df)
t <- Sys.time() - t
t1 <- t
library(corrr)
t <- Sys.time()
m_cor <- correlate(df)
t <- Sys.time() - t
t2 <- t
library(propagate)
t <- Sys.time()
m_cor <- bigcor(df)
t <- Sys.time() - t
t3 <- t
Wherease propogate
seems to be a good thing to deal with very large matrix but I'm not sure to well understand the type ff
of the output. In my exemple, the basic cor
function is faster than the two others solution. Do you find any solution ?
I will be happy to see your proposal because it is an interesting question.
Calculating pairwise correlation among all columns
from pandas import *
import numpy as np
from libraries.settings import *
from scipy.stats.stats import pearsonr
import itertools
Creating random sample data:
df = DataFrame(np.random.random((5, 5)), columns=['gene_' + chr(i + ord('a')) for i in range(5)])
print(df)
gene_a gene_b gene_c gene_d gene_e
0 0.471257 0.854139 0.781204 0.678567 0.697993
1 0.292909 0.046159 0.250902 0.064004 0.307537
2 0.422265 0.646988 0.084983 0.822375 0.713397
3 0.113963 0.016122 0.227566 0.206324 0.792048
4 0.357331 0.980479 0.157124 0.560889 0.973161
correlations = {}
columns = df.columns.tolist()
for col_a, col_b in itertools.combinations(columns, 2):
correlations[col_a + '__' + col_b] = pearsonr(df.loc[:, col_a], df.loc[:, col_b])
result = DataFrame.from_dict(correlations, orient='index')
result.columns = ['PCC', 'p-value']
print(result.sort_index())
PCC p-value
gene_a__gene_b 0.461357 0.434142
gene_a__gene_c 0.177936 0.774646
gene_a__gene_d -0.854884 0.064896
gene_a__gene_e -0.155440 0.802887
gene_b__gene_c -0.575056 0.310455
gene_b__gene_d -0.097054 0.876621
gene_b__gene_e 0.061175 0.922159
gene_c__gene_d -0.633302 0.251381
gene_c__gene_e -0.771120 0.126836
gene_d__gene_e 0.531805 0.356315
- Get unique combinations of
DataFrame
columns usingitertools.combination(iterable, r)
- Iterate through these combinations and calculate pairwise correlations using
scipy.stats.stats.personr
- Add results (PCC and p-value tuple) to
dictionary
- Build
DataFrame
fromdictionary
You could then also save result.to_csv()
. You might find it convenient to use a MultiIndex
(two columns containing the names of each columns) instead of the created names for the pairwise correlations.
Calculating non-pairwise correlation between different sets of columns of a dataset
You can do this using the cor
function, just need to specify the second parameter. Example:
vertical_axis <- 1:3
horizontal_axis <- 4:8
corr <- cor(mtcars[, vertical_axis], mtcars[, horizontal_axis])
corrplot(corr)
The result:
NOTE: Using the corrplot
function with the order
parameter will produce errors or a wrong plot. If you need to reorder the variables (ex.: with hclust
), reorder the correlation matrix before passing it to corrplot
.
Plot correlation matrix using pandas
You can use pyplot.matshow()
from matplotlib
:
import matplotlib.pyplot as plt
plt.matshow(dataframe.corr())
plt.show()
Edit:
In the comments was a request for how to change the axis tick labels. Here's a deluxe version that is drawn on a bigger figure size, has axis labels to match the dataframe, and a colorbar legend to interpret the color scale.
I'm including how to adjust the size and rotation of the labels, and I'm using a figure ratio that makes the colorbar and the main figure come out the same height.
EDIT 2:
As the df.corr() method ignores non-numerical columns, .select_dtypes(['number'])
should be used when defining the x and y labels to avoid an unwanted shift of the labels (included in the code below).
f = plt.figure(figsize=(19, 15))
plt.matshow(df.corr(), fignum=f.number)
plt.xticks(range(df.select_dtypes(['number']).shape[1]), df.select_dtypes(['number']).columns, fontsize=14, rotation=45)
plt.yticks(range(df.select_dtypes(['number']).shape[1]), df.select_dtypes(['number']).columns, fontsize=14)
cb = plt.colorbar()
cb.ax.tick_params(labelsize=14)
plt.title('Correlation Matrix', fontsize=16);
Create correlation table for large number of variables
example <- matrix(rep(0.8,25),5,5)
Or as @Vincent pointed out, matrix(0.8,5,5)
is much better.
diag(example) <- 1
> example
[,1] [,2] [,3] [,4] [,5]
[1,] 1.0 0.8 0.8 0.8 0.8
[2,] 0.8 1.0 0.8 0.8 0.8
[3,] 0.8 0.8 1.0 0.8 0.8
[4,] 0.8 0.8 0.8 1.0 0.8
[5,] 0.8 0.8 0.8 0.8 1.0
Then you can just scale it up to as large as you need. In your case:
mat2 <- matrix(0.8,1000,1000)
Related Topics
Colorize Parts of the Title in a Plot
Setting the Color for an Individual Data Point
Shaded Area Under Two Curves Using R
Disregarding Simple Warnings/Errors in Trycatch()
Use Ls() or Objects() to Get Objects of Class Data.Frame
R List Files with Multiple Conditions
Align Violin Plots with Dodged Box Plots
Combine a List of Matrices to a Single Matrix by Rows
Predicted Values for Logistic Regression from Glm and Stat_Smooth in Ggplot2 Are Different
Find Indices of Non Zero Elements in Matrix
Passing List of Named Parameters to Function
Declaring a Const Variable in R
Aesthetics Must Either Be Length One, or the Same Length as the Dataproblems
Row/Column Counter in 'Apply' Functions