Generate Numbers with Specific Correlation

Generate numbers with specific correlation

Assuming you mean two normal/Gaussian vectors of values with correlation 0.56

We can use mvrnorm() from package MASS

require(MASS)
out <- mvrnorm(50, mu = c(0,0), Sigma = matrix(c(1,0.56,0.56,1), ncol = 2),
empirical = TRUE)

which gives

> cor(out)
[,1] [,2]
[1,] 1.00 0.56
[2,] 0.56 1.00

The empirical = TRUE bit is important otherwise the actual correlation achieved is subject to randomness too and will not be exactly the stated value with larger discrepancies for smaller samples.

Assuming you mean a lag 1 correlation of 0.56 & Gaussian random variables

For this one you can use the arima.sim() function:

> arima.sim(list(ar = 0.56), n = 50)
Time Series:
Start = 1
End = 50
Frequency = 1
[1] 0.62125233 -0.04742303 0.57468608 -0.07201988 -1.91416757 -1.11827563
[7] 0.15718249 0.63217365 -1.24635896 -0.22950855 -0.79918784 0.31892842
[13] 0.33335688 -1.24328177 -0.79056890 1.08443057 0.55553819 0.33460674
[19] -0.33037659 -0.65244221 0.70461755 0.61450122 0.53731454 0.19563672
[25] 1.73945110 1.27119241 0.82484460 1.58382861 1.81619212 -0.94462052
[31] -1.36024898 -0.30964390 -0.94963216 -3.75725819 -1.77342095 -1.20963799
[37] -1.76325350 -1.20556172 -0.94684678 -0.85407649 0.14922226 -0.31109945
[43] 0.39456259 0.89610859 -0.70913792 -2.27954408 -1.14722464 0.39140446
[49] 0.66376227 1.63275483

Python-Generating numbers according to a corellation matrix

Thank you for answering my question about when data you have access to. The error that you received was generated when you called cholesky. cholesky requires that your matrix be positive semidefinite. One way to check if a matrix is semi-positive definite is to see if all of its eigenvalues are greater than zero. One of the eigenvalues of your correlation/covarance matrix is nearly zero. I think that cholesky is just being fussy. Use can use scipy.linalg.sqrtm as an alternate decomposition.

For your question on the generation of multivariate normals, the random normal that you generate should be a standard random normal, i.e. a mean of 0 and a width of 1. Numpy provides a standard random normal generator with np.random.randn.
To generate a multivariate normal, you should also take the decomposition of the covariance, not the correlation matrix. The following will generate a multivariate normal using an affine transformation, as in your question.

from scipy.linalg import cholesky, sqrtm
relavant_columns = ['Affecting homelife',
'Affecting mobility',
'Affecting social life/hobbies',
'Affecting work',
'Mood',
'Pain Score',
'Range of motion in Doc']

# df is a pandas dataframe containing the data frame from figure 1
mu = df[relavant_columns].mean().values
cov = df[relavant_columns].cov().values
number_of_sample = 10

# generate using affine transformation
#c2 = cholesky(cov).T
c2 = sqrtm(cov).T
s = np.matmul(c2, np.random.randn(c2.shape[0], number_of_sample)) + mu.reshape(-1, 1)

# transpose so each row is a sample
s = s.T

Numpy also has a built-in function which can generate multivariate normals directly

s = np.random.multivariate_normal(mu, cov, size=number_of_sample)

Generating correlated numbers

I ended up writing a short paper on this

It doesn't include your sorting method (although in practice I think it's similar to my first method, in a roundabout way), but does describe two ways that don't require iteration.

How to generate correlated numbers?

As an alternative, please consider the following. Let the random variables X ~ N(0,1) and Y ~ N(0,1) independently. Then the random variables X and rho X + sqrt(1 - rho^2) Y are both distributed N(0,1), but are now correlated with correlation rho. So possible R code could be

# Define the parameters
meanA <- -0.5
meanB <- 0.5
sdA <- 1
sdB <- 2
correlation <- 0.9

n <- 10000 # You want 30

# Generate from independent standard normals
x <- rnorm(n, 0, 1)
y <- rnorm(n, 0, 1)

# Transform
x2 <- x # could be avoided
y2 <- correlation*x + sqrt(1 - correlation^2)*y

# Fix up means and standard deviations
x3 <- meanA + sdA*x2
y3 <- meanB + sdB*y2

# Check summary statistics
mean(x3)
# [1] -0.4981958
mean(y3)
# [1] 0.4999068

sd(x3)
# [1] 1.014299
sd(y3)
# [1] 2.022377

cor(x3, y3)
# [1] 0.9002529

R: Create dataset with specific correlation in r

Correlation isn't affecting by linear transformation of the underlying variables. So the most direct way to get what you want could be:

out <- as.data.frame(mvrnorm(10, mu = c(0,0), 
Sigma = matrix(c(1,0.56,0.56,1),, ncol = 2),
empirical = TRUE))

out$V1.s <- (out$V1 - min(out$V1))*1000+10
out$V2.s <- (out$V2 - min(out$V2))*200+30

Now the data frame out has "shifted" columns V1.s and V2.s which are non-negative and "large". You can use whatever numbers you want instead of 1000, 10, 200, and 30 in my code above. The answer for the correlation will still be 0.56.

> cor(out$V1.s, out$V2.s)
[1] 0.56


Related Topics



Leave a reply



Submit