Show Correlations as an Ordered List, Not as a Large Matrix

Show correlations as an ordered list, not as a large matrix

I always use

zdf <- as.data.frame(as.table(z))
zdf
# Var1 Var2 Freq
# 1 a a 1.00000
# 2 b a -0.99669
# 3 c a -0.14063
# 4 d a -0.28061
# 5 e a 0.80519

Then use subset(zdf, abs(Freq) > 0.5) to select significant values.

List Highest Correlation Pairs from a Large Correlation Matrix in Pandas?

You can use DataFrame.values to get an numpy array of the data and then use NumPy functions such as argsort() to get the most correlated pairs.

But if you want to do this in pandas, you can unstack and sort the DataFrame:

import pandas as pd
import numpy as np

shape = (50, 4460)

data = np.random.normal(size=shape)

data[:, 1000] += data[:, 2000]

df = pd.DataFrame(data)

c = df.corr().abs()

s = c.unstack()
so = s.sort_values(kind="quicksort")

print so[-4470:-4460]

Here is the output:

2192  1522    0.636198
1522 2192 0.636198
3677 2027 0.641817
2027 3677 0.641817
242 130 0.646760
130 242 0.646760
1171 2733 0.670048
2733 1171 0.670048
1000 2000 0.742340
2000 1000 0.742340
dtype: float64

Display Correlation Tables as Descending List

Here's one of many ways I could think to do this. I used the reshape package because the melt() syntax was easy for me to remember, but the melt() command could pretty easily be done with base R commands:

require(reshape)
## set up dummy data
a <- rnorm(100)
b <- a + (rnorm(100, 0, 2))
c <- a + b + (rnorm(100)/10)
df <- data.frame(a, b, c)
c <- cor(df)
## c is the correlations matrix

## keep only the lower triangle by
## filling upper with NA
c[upper.tri(c, diag=TRUE)] <- NA

m <- melt(c)

## sort by descending absolute correlation
m <- m[order(- abs(m$value)), ]

## omit the NA values
dfOut <- na.omit(m)

## if you really want a list and not a data.frame
listOut <- split(dfOut, 1:nrow(dfOut))

Returning the highest and lowest correlations from a correlation matrix in pandas

Your conditions are hard to generalize into one command, but here is one approach you can take.

Remove the diagonal

import numpy as np
np.fill_diagonal(corr.values, np.nan)
print(corr)
# A B C D E
#A NaN 0.65 0.31 0.94 0.55
#B 0.87 NaN 0.96 0.67 0.41
#C 0.95 0.88 NaN 0.72 0.69
#D 0.64 0.84 0.99 NaN 0.78
#E 0.71 0.62 0.89 0.32 NaN

Find Top 2 and Bottom Column Names

You can use the answer on Find names of top-n highest-value columns in each pandas dataframe row to get the top 2 and bottom one value for each row (Stock).

order_top2 = np.argsort(-corr.values, axis=1)[:, :2]
order_bottom = np.argsort(corr.values, axis=1)[:, :1]

result_top2 = pd.DataFrame(
corr.columns[order_top2],
columns=['1st', '2nd'],
index=corr.index
)

result_bottom = pd.DataFrame(
corr.columns[order_bottom],
columns=['Last'],
index=corr.index
)

result = result_top2.join(result_bottom)
# 1st 2nd Last
#A D B C
#B C A E
#C A B E
#D C B A
#E C A D

Now use pandas.DataFrame.lookup to grab the corresponding column value in corr for each column in result

for x in result.columns:
result[x+"_Val"] = corr.lookup(corr.index, result[x])
print(result)
# 1st 2nd Last 1st_Val 2nd_Val Last_Val
#A D B C 0.94 0.65 0.31
#B C A E 0.96 0.87 0.41
#C A B E 0.95 0.88 0.69
#D C B A 0.99 0.84 0.64
#E C A D 0.89 0.71 0.32

Reorder columns (optional)

print(result[['1st', '1st_Val', '2nd', '2nd_Val', 'Last', 'Last_Val']])
# 1st 1st_Val 2nd 2nd_Val Last Last_Val
#A D 0.94 B 0.65 C 0.31
#B C 0.96 A 0.87 E 0.41
#C A 0.95 B 0.88 E 0.69
#D C 0.99 B 0.84 A 0.64
#E C 0.89 A 0.71 D 0.32

Why i can't make a correlation matrix with larger number of values

Can you try this code?

# Import libraries
import matplotlib.pyplot as plt
import seaborn as sns

# In case you want correlation matrix for specific columns from df_full
df_regr2 = df_full[['var1', 'var2', 'var3']]


# Get correlations
df_corr2 = df_regr2.corr()

np.ones_like(df_corr2, dtype=np.bool)

fig, ax = plt.subplots(figsize=(8, 6))

ax.set_title('Title Name')

# Mask
mask = np.triu(np.ones_like(df_corr2, dtype=np.bool))

# Adjust mask and df
mask = mask[1:, :-1]
corr = df_corr2.iloc[1:,:-1].copy()

# Plot heatmap
sns.heatmap(corr, mask=mask, annot=True, fmt=".2f", cmap=plt.get_cmap('RdYlGn'),
vmin=-1, vmax=1, cbar_kws={"shrink": .8})
# Yticks
plt.yticks(rotation=0)
plt.show()

Sort for top matrix correlations and remove reverse duplicates without apply

One option is to replace either the upper.tri or lower.tri to NA and then melt. This had the advantage of pre-processing without having to post-process. For large datasets, it would be better to do pre-processing rather than convert to long dataset and then remove the duplicates

library(reshape2)
m1[lower.tri(m1, diag = TRUE)] <- NA
melt(m1, na.rm = TRUE)

NOTE: Also, no need for any additional packages except the one the OP is already using

Find the pair of most correlated variables

Solution using corrr:

corrr is a package for exploring correlations in R. It focuses on
creating and working with data frames of correlations

library(corrr)
matrix(rnorm(100), 5) %>%
correlate() %>%
stretch() %>%
arrange(r)

Solution using reshape2 & data.table:

You can reshape2::melt (imported with data.table) cor result and order (sort) according correlation values.

library(data.table)
corMatrix <- cor(matrix(rnorm(100), 5))
setDT(melt(corMatrix))[order(value)]


Related Topics



Leave a reply



Submit