Display Correlation Tables as Descending List

Display Correlation Tables as Descending List

Here's one of many ways I could think to do this. I used the reshape package because the melt() syntax was easy for me to remember, but the melt() command could pretty easily be done with base R commands:

require(reshape)
## set up dummy data
a <- rnorm(100)
b <- a + (rnorm(100, 0, 2))
c <- a + b + (rnorm(100)/10)
df <- data.frame(a, b, c)
c <- cor(df)
## c is the correlations matrix

## keep only the lower triangle by
## filling upper with NA
c[upper.tri(c, diag=TRUE)] <- NA

m <- melt(c)

## sort by descending absolute correlation
m <- m[order(- abs(m$value)), ]

## omit the NA values
dfOut <- na.omit(m)

## if you really want a list and not a data.frame
listOut <- split(dfOut, 1:nrow(dfOut))

Show correlations as an ordered list, not as a large matrix

I always use

zdf <- as.data.frame(as.table(z))
zdf
# Var1 Var2 Freq
# 1 a a 1.00000
# 2 b a -0.99669
# 3 c a -0.14063
# 4 d a -0.28061
# 5 e a 0.80519

Then use subset(zdf, abs(Freq) > 0.5) to select significant values.

Create table showing the sorted absolute correlation of various variables with another series

Answer by @Jon Spring is perfect. Here is the same code in base R

res1 <- c(0, 5, 2, 7, 1)
data2 <- data.frame(x1 = 1:5, # uncorrelated
x2 = 14:10, # uncorrelated and wrong direction
x3 = c(0, 5, 1, 6, 0), # very similar
x4 = c(0, 0, 2, 7, 1)) # somewhat similar

correlation = cor(data2, res1, method = "pearson")
names = rownames(correlation)
abs_cor = abs(correlation)
data = data.frame(X_var = names,abs_cor = abs_cor,cor = correlation)
data[order(data$abs_cor,decreasing = TRUE),]

List Highest Correlation Pairs from a Large Correlation Matrix in Pandas?

You can use DataFrame.values to get an numpy array of the data and then use NumPy functions such as argsort() to get the most correlated pairs.

But if you want to do this in pandas, you can unstack and sort the DataFrame:

import pandas as pd
import numpy as np

shape = (50, 4460)

data = np.random.normal(size=shape)

data[:, 1000] += data[:, 2000]

df = pd.DataFrame(data)

c = df.corr().abs()

s = c.unstack()
so = s.sort_values(kind="quicksort")

print so[-4470:-4460]

Here is the output:

2192  1522    0.636198
1522 2192 0.636198
3677 2027 0.641817
2027 3677 0.641817
242 130 0.646760
130 242 0.646760
1171 2733 0.670048
2733 1171 0.670048
1000 2000 0.742340
2000 1000 0.742340
dtype: float64

Is there a cleaner way to subset correlation matrices?

A better option is to create a temporary object with the cor output

tmp <- cor(numericData)

use that object to get the row/column index and subset the rows/columns

rc <- which(tmp < 1 & tmp > 0.8, arr.ind = TRUE)
out <- data.frame(rn = row.names(tmp)[rc[,1]], cn = colnames(tmp)[rc[,2]])

and remove the 'tmp'

rm(tmp)

Or another option without creating any temporary object is to convert to data.frame after creating the table class, and subset the data.frame based on the values in 'Freq' column

subset(as.data.frame.table(cor(numericData)), Freq < 1 & Freq > 0.8)

A reproducible example with mtcars

subset(as.data.frame.table(cor(mtcars)), Freq < 1 & Freq > 0.8)
# Var1 Var2 Freq
#14 disp cyl 0.9020329
#15 hp cyl 0.8324475
#24 cyl disp 0.9020329
#28 wt disp 0.8879799
#35 cyl hp 0.8324475
#58 disp wt 0.8879799

Or with between

library(dplyr)
as.data.frame.table(cor(mtcars)) %>%
filter(data.table::between(Freq, 0.8, 1, incbounds = FALSE))
# Var1 Var2 Freq
#1 disp cyl 0.9020329
#2 hp cyl 0.8324475
#3 cyl disp 0.9020329
#4 wt disp 0.8879799
#5 cyl hp 0.8324475
#6 disp wt 0.8879799

Is it possible to filter a corrplot/cormatrix in R?

cor(x) function, when given one argument (matrix or a data.frame) computes correlations between all pairs of variables present in the columns. However the same function can accept two arguments: cor(x, y), in which case it only computes correlations between pairs x and y.

So in your case you can provide all your group variables as x, and the response variable as y, and then plot the result (assuming "response" is in the last column):

cors <- cor(dat[,-ncol(dat)], dat[,ncol(dat)])
corrplot::corrplot(cors)

Sorting correlation matrix

pd.concat([cor[col_name].sort_values(ascending=False)
.rename_axis(col_name.replace('Ply', 'index'))
.reset_index()
for col_name in cor],
axis=1)

Explanation:

  • pd.concat([df_1, ..., df_6], axis=1) concatenates 6 dataframes (each one will be already sorted and will have 2 columns: ‘index_i’ and ‘Ply_i’).

  • [cor[col_name] for col_name in cor] would create a list of 6 Series, where each Series is the next column of cor.

  • ser.sort_values(ascending=False) sorts values of a Series ser in the descending order (indices also move - together with their values).

  • col_name.replace('Ply', 'index') creates a new string from a string col_name by replacing 'Ply' with 'index'.

  • ser.rename_axis(name).reset_index() renames the index axis, and extracts the index (with its name) as a new column, converting a Series into a DataFrame. The new index of this dataframe is the default range index (from 0 to 6).

Result:

(with my randomly generated numbers)
















































































































index_1Ply_1index_2Ply_2index_3Ply_3index_4Ply_4index_5Ply_5index_6Ply_6
0Ply_11Ply_21Ply_31Ply_41Ply_51Ply_61
1Ply_20.387854Ply_10.387854Ply_10.258825Ply_10.337613Ply_40.0618012Ply_10.058282
2Ply_40.337613Ply_40.293496Ply_40.0552454Ply_20.293496Ply_20.060881Ply_3-0.207621
3Ply_30.258825Ply_50.060881Ply_2-0.0900126Ply_50.0618012Ply_3-0.110885Ply_2-0.22012
4Ply_60.058282Ply_3-0.0900126Ply_5-0.110885Ply_30.0552454Ply_1-0.390893Ply_4-0.291842
5Ply_5-0.390893Ply_6-0.22012Ply_6-0.207621Ply_6-0.291842Ply_6-0.394074Ply_5-0.394074

Output for large correlation matrices in R

Is there anything wrong with

z <- matrix(rnorm(10000),100)
write.csv(cor(z),file="cortmp.csv")

? View(cor(z)) works for me, although I don't know if it's copy-and-pasteable.

For psych::corr.test

dimnames(z) <- list(1:100,1:100)
z[1,2] <- NA ## unbalance to induce sample size matrix
ct <- psych::corr.test(z)
write.csv(ct$n,file="ntmp.csv") ## sample sizes
write.csv(ct$t,file="ttmp.csv") ## t statistics
write.csv(ct$p,file="ptmp.csv") ## p-values

et cetera. (See str(ct).)

R's paradigm is that if you want to transfer information to another program you're going to output it to a file rather than copying and pasting it from the console ...



Related Topics



Leave a reply



Submit