Generate numbers with specific correlation
Assuming you mean two normal/Gaussian vectors of values with correlation 0.56
We can use mvrnorm()
from package MASS
require(MASS)
out <- mvrnorm(50, mu = c(0,0), Sigma = matrix(c(1,0.56,0.56,1), ncol = 2),
empirical = TRUE)
which gives
> cor(out)
[,1] [,2]
[1,] 1.00 0.56
[2,] 0.56 1.00
The empirical = TRUE
bit is important otherwise the actual correlation achieved is subject to randomness too and will not be exactly the stated value with larger discrepancies for smaller samples.
Assuming you mean a lag 1 correlation of 0.56 & Gaussian random variables
For this one you can use the arima.sim()
function:
> arima.sim(list(ar = 0.56), n = 50)
Time Series:
Start = 1
End = 50
Frequency = 1
[1] 0.62125233 -0.04742303 0.57468608 -0.07201988 -1.91416757 -1.11827563
[7] 0.15718249 0.63217365 -1.24635896 -0.22950855 -0.79918784 0.31892842
[13] 0.33335688 -1.24328177 -0.79056890 1.08443057 0.55553819 0.33460674
[19] -0.33037659 -0.65244221 0.70461755 0.61450122 0.53731454 0.19563672
[25] 1.73945110 1.27119241 0.82484460 1.58382861 1.81619212 -0.94462052
[31] -1.36024898 -0.30964390 -0.94963216 -3.75725819 -1.77342095 -1.20963799
[37] -1.76325350 -1.20556172 -0.94684678 -0.85407649 0.14922226 -0.31109945
[43] 0.39456259 0.89610859 -0.70913792 -2.27954408 -1.14722464 0.39140446
[49] 0.66376227 1.63275483
Python-Generating numbers according to a corellation matrix
Thank you for answering my question about when data you have access to. The error that you received was generated when you called cholesky. cholesky requires that your matrix be positive semidefinite. One way to check if a matrix is semi-positive definite is to see if all of its eigenvalues are greater than zero. One of the eigenvalues of your correlation/covarance matrix is nearly zero. I think that cholesky is just being fussy. Use can use scipy.linalg.sqrtm
as an alternate decomposition.
For your question on the generation of multivariate normals, the random normal that you generate should be a standard random normal, i.e. a mean of 0 and a width of 1. Numpy provides a standard random normal generator with np.random.randn
.
To generate a multivariate normal, you should also take the decomposition of the covariance, not the correlation matrix. The following will generate a multivariate normal using an affine transformation, as in your question.
from scipy.linalg import cholesky, sqrtm
relavant_columns = ['Affecting homelife',
'Affecting mobility',
'Affecting social life/hobbies',
'Affecting work',
'Mood',
'Pain Score',
'Range of motion in Doc']
# df is a pandas dataframe containing the data frame from figure 1
mu = df[relavant_columns].mean().values
cov = df[relavant_columns].cov().values
number_of_sample = 10
# generate using affine transformation
#c2 = cholesky(cov).T
c2 = sqrtm(cov).T
s = np.matmul(c2, np.random.randn(c2.shape[0], number_of_sample)) + mu.reshape(-1, 1)
# transpose so each row is a sample
s = s.T
Numpy also has a built-in function which can generate multivariate normals directly
s = np.random.multivariate_normal(mu, cov, size=number_of_sample)
Generating correlated numbers
I ended up writing a short paper on this
It doesn't include your sorting method (although in practice I think it's similar to my first method, in a roundabout way), but does describe two ways that don't require iteration.
How to generate correlated numbers?
As an alternative, please consider the following. Let the random variables X ~ N(0,1) and Y ~ N(0,1) independently. Then the random variables X and rho X + sqrt(1 - rho^2) Y are both distributed N(0,1), but are now correlated with correlation rho. So possible R code could be
# Define the parameters
meanA <- -0.5
meanB <- 0.5
sdA <- 1
sdB <- 2
correlation <- 0.9
n <- 10000 # You want 30
# Generate from independent standard normals
x <- rnorm(n, 0, 1)
y <- rnorm(n, 0, 1)
# Transform
x2 <- x # could be avoided
y2 <- correlation*x + sqrt(1 - correlation^2)*y
# Fix up means and standard deviations
x3 <- meanA + sdA*x2
y3 <- meanB + sdB*y2
# Check summary statistics
mean(x3)
# [1] -0.4981958
mean(y3)
# [1] 0.4999068
sd(x3)
# [1] 1.014299
sd(y3)
# [1] 2.022377
cor(x3, y3)
# [1] 0.9002529
R: Create dataset with specific correlation in r
Correlation isn't affecting by linear transformation of the underlying variables. So the most direct way to get what you want could be:
out <- as.data.frame(mvrnorm(10, mu = c(0,0),
Sigma = matrix(c(1,0.56,0.56,1),, ncol = 2),
empirical = TRUE))
out$V1.s <- (out$V1 - min(out$V1))*1000+10
out$V2.s <- (out$V2 - min(out$V2))*200+30
Now the data frame out
has "shifted" columns V1.s
and V2.s
which are non-negative and "large". You can use whatever numbers you want instead of 1000, 10, 200, and 30 in my code above. The answer for the correlation will still be 0.56.
> cor(out$V1.s, out$V2.s)
[1] 0.56
Related Topics
Data.Table in R - Multiple Filters Using Multiple Keys - Binary Search
In R, How to Subset a Data.Frame by Values from Another Data.Frame
Unnest a List Column Directly into Several Columns
Apply a Function to Groups Within a Data.Frame in R
Applying the Same Factor Levels to Multiple Variables in an R Data Frame
How to Create a Different Report for Each Subset of a Data Frame with R Markdown
Applying R Script Prepared for Single File to Multiple Files in the Directory
Run a Bash Script from an R Script
Does Converting Character Columns to Factors Save Memory
Creating Legend with Circles Leaflet R
R Reshape a Vector into Multiple Columns
Merge Data Frames and Overwrite Values
How to Display the Median Value in a Boxplot in Ggplot
One-Class Classification with Svm in R
Control Number of Decimal Places on Xtable Output in R