Display Correlation Tables as Descending List
Here's one of many ways I could think to do this. I used the reshape package because the melt()
syntax was easy for me to remember, but the melt()
command could pretty easily be done with base R commands:
require(reshape)
## set up dummy data
a <- rnorm(100)
b <- a + (rnorm(100, 0, 2))
c <- a + b + (rnorm(100)/10)
df <- data.frame(a, b, c)
c <- cor(df)
## c is the correlations matrix
## keep only the lower triangle by
## filling upper with NA
c[upper.tri(c, diag=TRUE)] <- NA
m <- melt(c)
## sort by descending absolute correlation
m <- m[order(- abs(m$value)), ]
## omit the NA values
dfOut <- na.omit(m)
## if you really want a list and not a data.frame
listOut <- split(dfOut, 1:nrow(dfOut))
Show correlations as an ordered list, not as a large matrix
I always use
zdf <- as.data.frame(as.table(z))
zdf
# Var1 Var2 Freq
# 1 a a 1.00000
# 2 b a -0.99669
# 3 c a -0.14063
# 4 d a -0.28061
# 5 e a 0.80519
Then use subset(zdf, abs(Freq) > 0.5)
to select significant values.
Create table showing the sorted absolute correlation of various variables with another series
Answer by @Jon Spring is perfect. Here is the same code in base R
res1 <- c(0, 5, 2, 7, 1)
data2 <- data.frame(x1 = 1:5, # uncorrelated
x2 = 14:10, # uncorrelated and wrong direction
x3 = c(0, 5, 1, 6, 0), # very similar
x4 = c(0, 0, 2, 7, 1)) # somewhat similar
correlation = cor(data2, res1, method = "pearson")
names = rownames(correlation)
abs_cor = abs(correlation)
data = data.frame(X_var = names,abs_cor = abs_cor,cor = correlation)
data[order(data$abs_cor,decreasing = TRUE),]
List Highest Correlation Pairs from a Large Correlation Matrix in Pandas?
You can use DataFrame.values
to get an numpy array of the data and then use NumPy functions such as argsort()
to get the most correlated pairs.
But if you want to do this in pandas, you can unstack
and sort the DataFrame:
import pandas as pd
import numpy as np
shape = (50, 4460)
data = np.random.normal(size=shape)
data[:, 1000] += data[:, 2000]
df = pd.DataFrame(data)
c = df.corr().abs()
s = c.unstack()
so = s.sort_values(kind="quicksort")
print so[-4470:-4460]
Here is the output:
2192 1522 0.636198
1522 2192 0.636198
3677 2027 0.641817
2027 3677 0.641817
242 130 0.646760
130 242 0.646760
1171 2733 0.670048
2733 1171 0.670048
1000 2000 0.742340
2000 1000 0.742340
dtype: float64
Is there a cleaner way to subset correlation matrices?
A better option is to create a temporary object with the cor
output
tmp <- cor(numericData)
use that object to get the row/column index and subset the rows/columns
rc <- which(tmp < 1 & tmp > 0.8, arr.ind = TRUE)
out <- data.frame(rn = row.names(tmp)[rc[,1]], cn = colnames(tmp)[rc[,2]])
and remove the 'tmp'
rm(tmp)
Or another option without creating any temporary object is to convert to data.frame
after creating the table
class, and subset
the data.frame based on the values in 'Freq' column
subset(as.data.frame.table(cor(numericData)), Freq < 1 & Freq > 0.8)
A reproducible example with mtcars
subset(as.data.frame.table(cor(mtcars)), Freq < 1 & Freq > 0.8)
# Var1 Var2 Freq
#14 disp cyl 0.9020329
#15 hp cyl 0.8324475
#24 cyl disp 0.9020329
#28 wt disp 0.8879799
#35 cyl hp 0.8324475
#58 disp wt 0.8879799
Or with between
library(dplyr)
as.data.frame.table(cor(mtcars)) %>%
filter(data.table::between(Freq, 0.8, 1, incbounds = FALSE))
# Var1 Var2 Freq
#1 disp cyl 0.9020329
#2 hp cyl 0.8324475
#3 cyl disp 0.9020329
#4 wt disp 0.8879799
#5 cyl hp 0.8324475
#6 disp wt 0.8879799
Is it possible to filter a corrplot/cormatrix in R?
cor(x)
function, when given one argument (matrix or a data.frame) computes correlations between all pairs of variables present in the columns. However the same function can accept two arguments: cor(x, y)
, in which case it only computes correlations between pairs x and y.
So in your case you can provide all your group variables as x, and the response variable as y, and then plot the result (assuming "response" is in the last column):
cors <- cor(dat[,-ncol(dat)], dat[,ncol(dat)])
corrplot::corrplot(cors)
Sorting correlation matrix
pd.concat([cor[col_name].sort_values(ascending=False)
.rename_axis(col_name.replace('Ply', 'index'))
.reset_index()
for col_name in cor],
axis=1)
Explanation:
pd.concat([df_1, ..., df_6], axis=1)
concatenates 6 dataframes (each one will be already sorted and will have 2 columns: ‘index_i’ and ‘Ply_i’).[cor[col_name] for col_name in cor]
would create a list of 6 Series, where each Series is the next column ofcor
.ser.sort_values(ascending=False)
sorts values of a Seriesser
in the descending order (indices also move - together with their values).col_name.replace('Ply', 'index')
creates a new string from a stringcol_name
by replacing 'Ply' with 'index'.ser.rename_axis(name).reset_index()
renames the index axis, and extracts the index (with its name) as a new column, converting a Series into a DataFrame. The new index of this dataframe is the default range index (from 0 to 6).
Result:
(with my randomly generated numbers)
index_1 | Ply_1 | index_2 | Ply_2 | index_3 | Ply_3 | index_4 | Ply_4 | index_5 | Ply_5 | index_6 | Ply_6 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Ply_1 | 1 | Ply_2 | 1 | Ply_3 | 1 | Ply_4 | 1 | Ply_5 | 1 | Ply_6 | 1 |
1 | Ply_2 | 0.387854 | Ply_1 | 0.387854 | Ply_1 | 0.258825 | Ply_1 | 0.337613 | Ply_4 | 0.0618012 | Ply_1 | 0.058282 |
2 | Ply_4 | 0.337613 | Ply_4 | 0.293496 | Ply_4 | 0.0552454 | Ply_2 | 0.293496 | Ply_2 | 0.060881 | Ply_3 | -0.207621 |
3 | Ply_3 | 0.258825 | Ply_5 | 0.060881 | Ply_2 | -0.0900126 | Ply_5 | 0.0618012 | Ply_3 | -0.110885 | Ply_2 | -0.22012 |
4 | Ply_6 | 0.058282 | Ply_3 | -0.0900126 | Ply_5 | -0.110885 | Ply_3 | 0.0552454 | Ply_1 | -0.390893 | Ply_4 | -0.291842 |
5 | Ply_5 | -0.390893 | Ply_6 | -0.22012 | Ply_6 | -0.207621 | Ply_6 | -0.291842 | Ply_6 | -0.394074 | Ply_5 | -0.394074 |
Output for large correlation matrices in R
Is there anything wrong with
z <- matrix(rnorm(10000),100)
write.csv(cor(z),file="cortmp.csv")
? View(cor(z))
works for me, although I don't know if it's copy-and-pasteable.
For psych::corr.test
dimnames(z) <- list(1:100,1:100)
z[1,2] <- NA ## unbalance to induce sample size matrix
ct <- psych::corr.test(z)
write.csv(ct$n,file="ntmp.csv") ## sample sizes
write.csv(ct$t,file="ttmp.csv") ## t statistics
write.csv(ct$p,file="ptmp.csv") ## p-values
et cetera. (See str(ct)
.)
R's paradigm is that if you want to transfer information to another program you're going to output it to a file rather than copying and pasting it from the console ...
Related Topics
Optimal/Efficient Plotting of Survival/Regression Analysis Results
How to Group by All But One Columns
Arrange a Grouped_Df by Group Variable Not Working
How to Rank Within Groups in R
Combination Boxplot and Histogram Using Ggplot2
Saving Plot as PDF and Simultaneously Display It in the Window (X11)
Using R to Analyze Balance Sheets and Income Statements
Fastest Way to Multiply Matrix Columns with Vector Elements in R
Is There a More Efficient Way to Replace Null with Na in a List
How to Create a List of Vectors in Rcpp
How to Remove Duplicated Column Names in R
How to Solve Prcomp.Default(): Cannot Rescale a Constant/Zero Column to Unit Variance
Add Density Lines to Histogram and Cumulative Histogram
Adding S4 Dispatch to Base R S3 Generic
Distance of Point Feature to Nearest Polygon in R
Install the Package That Has Been Removed from the Cran Repository Easily