Does Calculating Correlation Between Two Dataframes Require a Loop

Does calculating correlation between two dataframes require a loop?

cor accepts two data.frames:

A<-data.frame(A1=c(1,2,3,4,5),B1=c(6,7,8,9,10),C1=c(11,12,13,14,15 ))

B<-data.frame(A2=c(6,7,7,10,11),B2=c(2,1,3,8,11),C2=c(1,5,16,7,8))

cor(A,B)

# A2 B2 C2
# A1 0.9481224 0.9190183 0.459588
# B1 0.9481224 0.9190183 0.459588
# C1 0.9481224 0.9190183 0.459588

diag(cor(A,B))
#[1] 0.9481224 0.9190183 0.4595880

Edit:

Here are some benchmarks:

Unit: microseconds
expr min lq median uq max neval
diag(cor(A, B)) 230.292 238.4225 243.0115 255.0295 352.955 100
mapply(cor, A, B) 267.076 281.5120 286.8030 299.5260 375.087 100
unlist(Map(cor, A, B)) 250.053 259.1045 264.5635 275.9035 1146.140 100

Edit2:

And some better benchmarks using

set.seed(42)
A <- as.data.frame(matrix(rnorm(10*n),ncol=n))
B <- as.data.frame(matrix(rnorm(10*n),ncol=n))

Sample Image

However, I should probably mention that these benchmarks strongly depend on the number of rows.

Edit3: Since I was asked for the benchmarking code, here it is.

b <- sapply(2^(1:12), function(n) {
set.seed(42)
A <- as.data.frame(matrix(rnorm(10*n),ncol=n))
B <- as.data.frame(matrix(rnorm(10*n),ncol=n))

require(microbenchmark)
res <- print(microbenchmark(
diag(cor(A,B)),
mapply(cor, A, B),
unlist(Map(cor,A,B)),
times=10
),unit="us")
res$median
})

b <- t(b)

matplot(x=1:12,log10(b),type="l",
ylab="log10(median [µs])",
xlab="log2(n)",col=1:3,lty=1)
legend("topleft", legend=c("diag(cor(A, B))",
"mapply(cor, A, B)",
"unlist(Map(cor,A,B))"),lty=1, col=1:3)

R find correlations between dataframes of different sizes

Many R functions expect observations to be rows, and variables to be columns. You appear to have flipped this around. No worries, we can transpose and use the normal cor function.

Some example data:

set.seed(1234)
df1 <- as.data.frame(matrix(rnorm(12), ncol = 3))
df2 <- as.data.frame(matrix(rnorm(15), ncol = 3))

Now calculate the correlations:

cors <- t(cor(t(df1), t(df2)))

If you need p-values, we can do that manually, using vectorized functions:

df <- ncol(df1) - 2
t_vals <- cors * sqrt(df) / sqrt(1 - cors ^ 2)
p_vals <- 2 * min(pt(t_vals, df), pt(t_vals, df, lower.tail = FALSE))

Those p-values are two-sided.

And to check, if it all worked:

cor.test(unlist(df1[3, ]), unlist(df2[3, ]))
# Pearson's product-moment correlation
#
# data: unlist(df1[3, ]) and unlist(df2[3, ])
# t = -0.015874, df = 1, p-value = 0.9899
# alternative hypothesis: true correlation is not equal to 0
# sample estimates:
# cor
# -0.01587196
cors[3, 3]
# [1] -0.01587196
p_vals[3, 3]
# [1] 0.9898952

Looping over a list of data frames and calculate the correlation coefficient

I suggest to put your data.frames into a list and then run lapply.
Like this:

# setting up a reproducible example
data(mtcars)
data(iris)
# remove the last column cause it ain't numeric.
iris <- iris[,-5]
listOfDataFrames <- list()
listOfDataFrames[[1]] <- mtcars
listOfDataFrames[[2]] <- iris

# here's a one liner using base R.
lapply(listOfDataFrames,cor)

Welcome to SO, btw.
Ah, I guess you mean cor, right? However this works with basically any function.

Iterating over columns from two dataframes to estimate correlation and p-value

Your looping setup is incorrect on a couple measures... You are using the same variable name in both for-loops which is going to cause problems. Also, you are computing correl outside of your inner loop... etc.

What you want to do is loop over the columns with 1 loop, assuming that both data frames have the same column names. If they do not, you will need to take extra steps to find the common column names and then iterate over them.

Something like this should work:

import os
import pandas as pd
import scipy as sp
import scipy.stats as stats

df_1 = pd.DataFrame({ 'A': ['dog', 'pig', 'cat'],
'B': [0.25, 0.50, 0.75],
'C': [0.30, 0.40, 0.90]})
df_2 = pd.DataFrame({ 'A': ['bird', 'monkey', 'rat'],
'B': [0.20, 0.60, 0.90],
'C': [0.80, 0.50, 0.10]})

results = dict()
for column in df_1.columns[1:]:
correl = stats.pearsonr(df_1[column], df_2[column])
results[column] = correl

print(results)


Related Topics



Leave a reply



Submit