Show correlations as an ordered list, not as a large matrix
I always use
zdf <- as.data.frame(as.table(z))
zdf
# Var1 Var2 Freq
# 1 a a 1.00000
# 2 b a -0.99669
# 3 c a -0.14063
# 4 d a -0.28061
# 5 e a 0.80519
Then use subset(zdf, abs(Freq) > 0.5)
to select significant values.
List Highest Correlation Pairs from a Large Correlation Matrix in Pandas?
You can use DataFrame.values
to get an numpy array of the data and then use NumPy functions such as argsort()
to get the most correlated pairs.
But if you want to do this in pandas, you can unstack
and sort the DataFrame:
import pandas as pd
import numpy as np
shape = (50, 4460)
data = np.random.normal(size=shape)
data[:, 1000] += data[:, 2000]
df = pd.DataFrame(data)
c = df.corr().abs()
s = c.unstack()
so = s.sort_values(kind="quicksort")
print so[-4470:-4460]
Here is the output:
2192 1522 0.636198
1522 2192 0.636198
3677 2027 0.641817
2027 3677 0.641817
242 130 0.646760
130 242 0.646760
1171 2733 0.670048
2733 1171 0.670048
1000 2000 0.742340
2000 1000 0.742340
dtype: float64
Display Correlation Tables as Descending List
Here's one of many ways I could think to do this. I used the reshape package because the melt()
syntax was easy for me to remember, but the melt()
command could pretty easily be done with base R commands:
require(reshape)
## set up dummy data
a <- rnorm(100)
b <- a + (rnorm(100, 0, 2))
c <- a + b + (rnorm(100)/10)
df <- data.frame(a, b, c)
c <- cor(df)
## c is the correlations matrix
## keep only the lower triangle by
## filling upper with NA
c[upper.tri(c, diag=TRUE)] <- NA
m <- melt(c)
## sort by descending absolute correlation
m <- m[order(- abs(m$value)), ]
## omit the NA values
dfOut <- na.omit(m)
## if you really want a list and not a data.frame
listOut <- split(dfOut, 1:nrow(dfOut))
Returning the highest and lowest correlations from a correlation matrix in pandas
Your conditions are hard to generalize into one command, but here is one approach you can take.
Remove the diagonal
import numpy as np
np.fill_diagonal(corr.values, np.nan)
print(corr)
# A B C D E
#A NaN 0.65 0.31 0.94 0.55
#B 0.87 NaN 0.96 0.67 0.41
#C 0.95 0.88 NaN 0.72 0.69
#D 0.64 0.84 0.99 NaN 0.78
#E 0.71 0.62 0.89 0.32 NaN
Find Top 2 and Bottom Column Names
You can use the answer on Find names of top-n highest-value columns in each pandas dataframe row to get the top 2 and bottom one value for each row (Stock).
order_top2 = np.argsort(-corr.values, axis=1)[:, :2]
order_bottom = np.argsort(corr.values, axis=1)[:, :1]
result_top2 = pd.DataFrame(
corr.columns[order_top2],
columns=['1st', '2nd'],
index=corr.index
)
result_bottom = pd.DataFrame(
corr.columns[order_bottom],
columns=['Last'],
index=corr.index
)
result = result_top2.join(result_bottom)
# 1st 2nd Last
#A D B C
#B C A E
#C A B E
#D C B A
#E C A D
Now use pandas.DataFrame.lookup
to grab the corresponding column value in corr
for each column in result
for x in result.columns:
result[x+"_Val"] = corr.lookup(corr.index, result[x])
print(result)
# 1st 2nd Last 1st_Val 2nd_Val Last_Val
#A D B C 0.94 0.65 0.31
#B C A E 0.96 0.87 0.41
#C A B E 0.95 0.88 0.69
#D C B A 0.99 0.84 0.64
#E C A D 0.89 0.71 0.32
Reorder columns (optional)
print(result[['1st', '1st_Val', '2nd', '2nd_Val', 'Last', 'Last_Val']])
# 1st 1st_Val 2nd 2nd_Val Last Last_Val
#A D 0.94 B 0.65 C 0.31
#B C 0.96 A 0.87 E 0.41
#C A 0.95 B 0.88 E 0.69
#D C 0.99 B 0.84 A 0.64
#E C 0.89 A 0.71 D 0.32
Why i can't make a correlation matrix with larger number of values
Can you try this code?
# Import libraries
import matplotlib.pyplot as plt
import seaborn as sns
# In case you want correlation matrix for specific columns from df_full
df_regr2 = df_full[['var1', 'var2', 'var3']]
# Get correlations
df_corr2 = df_regr2.corr()
np.ones_like(df_corr2, dtype=np.bool)
fig, ax = plt.subplots(figsize=(8, 6))
ax.set_title('Title Name')
# Mask
mask = np.triu(np.ones_like(df_corr2, dtype=np.bool))
# Adjust mask and df
mask = mask[1:, :-1]
corr = df_corr2.iloc[1:,:-1].copy()
# Plot heatmap
sns.heatmap(corr, mask=mask, annot=True, fmt=".2f", cmap=plt.get_cmap('RdYlGn'),
vmin=-1, vmax=1, cbar_kws={"shrink": .8})
# Yticks
plt.yticks(rotation=0)
plt.show()
Sort for top matrix correlations and remove reverse duplicates without apply
One option is to replace either the upper.tri
or lower.tri
to NA
and then melt
. This had the advantage of pre-processing without having to post-process. For large datasets, it would be better to do pre-processing rather than convert to long dataset and then remove the duplicates
library(reshape2)
m1[lower.tri(m1, diag = TRUE)] <- NA
melt(m1, na.rm = TRUE)
NOTE: Also, no need for any additional packages except the one the OP is already using
Find the pair of most correlated variables
Solution using corrr:
corrr is a package for exploring correlations in R. It focuses on
creating and working with data frames of correlations
library(corrr)
matrix(rnorm(100), 5) %>%
correlate() %>%
stretch() %>%
arrange(r)
Solution using reshape2 & data.table:
You can reshape2::melt
(imported with data.table
) cor
result and order (sort) according correlation values.
library(data.table)
corMatrix <- cor(matrix(rnorm(100), 5))
setDT(melt(corMatrix))[order(value)]
Related Topics
Dplyr Replacing Na Values in a Column Based on Multiple Conditions
How to Perform Pairwise Operation Like '%In%' and Set Operations for a List of Vectors
Save a Ggplot2 Time Series Plot Grob Generated by Ggplotgrob
R - How to Make Barplot Plot Zeros for Missing Values Over the Data Range
R::Ggplot2::Geom_Points: How to Swap Points with Pie Charts
How to Subset a Matrix with Different Column Positions for Each Row
Calculate Total Miles Traveled from Vectors of Lat/Lon
Add Column with Counts of Another
How to Combine Scales for Colour and Size into One Legend
R V3.4.0-2 Unable to Find Libgfortran.So.3 on Arch
Save Imported CSV Data in Vector - R
Datalabels in R Highcharter Cannot Be Seen After Print as Png or Jpg
How to Stop Bookdown Tables from Floating to Bottom of the Page in PDF
Display Row Names in a Data.Table Object