In R, how do I find the optimal variable to maximize or minimize correlation between several datasets
Since most optimization routines work best with no constraints,
you can transform (reparametrize) the problem of finding
four numbers, x, y, z, j,
constrained to be between 0 and 1 and to sum up to 1,
into the problem of finding three real numbers q1, q2, q3
(with no constraints).
For instance, if we have a function s that maps the real line R
to the interval (0,1),
the following does the trick:
x = s(q1)
y = (1-x) * s(q2)
z = (1-x-y) * s(q3)
j = 1-x-y-z
It is probably easier to understand in two dimensions:
in this case, the set of points (x,y,z)
with coordinates between 0 and 1 and summing up to 1
is a triangle and s(q1),s(q2) form a coordinate system
for points in that triangle.
# Sample data
A <- rnorm(100)
B <- rnorm(100)
C <- rnorm(100)
D <- rnorm(100)
E <- rnorm(100)
f <- function(p) cor(p[1]*A + p[2]*B + p[3]*C + p[4]*D, E)
# Unconstrained optimization
optim(
c(1,1,1,1)/4, # Starting values
f, # Function to maximize
control=list(fnscale=-1) # Maximize (default is to minimize)
)
# Transform the parameters
sigmoid <- function(x) exp(x) / ( 1 + exp(x) )
convert <- function(p) {
q1 <- sigmoid(p[1])
q2 <- (1-q1) * sigmoid(p[2])
q3 <- (1-q1-q2) * sigmoid(p[3])
q4 <- 1-q1-q2-q3
c(q1,q2,q3,q4)
}
# Optimization
g <- function(p) f(convert(p))
p <- optim(c(0,0,0,0), g, control=list(fnscale=-1))
convert(p$par)
How to estimate the correlation between two functions with unknown variables inside?
If I am understanding correctly, you are trying to minimise the squared correlation between the output of f1 and f2 at different values of lambda. This means that for each value of lambda you are assessing, you need to feed in the complete vector of tau values. This will give a vector output for each value of lambda so that a correlation between the output from the two functions can be calculated at any single value of lambda.
To do this, we create a vectorized function that takes lambda values and calculates the squared correlation between f1 and f2 at those values of lambda across all values of tau
f3 <- function(lambda) {
sapply(lambda, function(l) {
cor(f1(l, seq(1/12, 10, 1/12)), f2(l, seq(1/12, 10, 1/12)))^2
})
}
To get the optimal value of lambda that minimizes the squared correlation, we just use optimize
:
optimize(f3, c(0, 100))$minimum
#> [1] 0.6678021
How to solve a simple optimization in R
You need to use function solnp
in package Rsolnp
because it allows constraints based on an equality.
The idea is to build a function to minimize and your equality function for the constraint.
Fun <- function(param){
e <- total_observations/(sum(param[1]*var1)+sum(param[2]*var2)+sum(param[3]*var3)+sum(param[4]*var4))
output <- (param[1]*var1 + param[2]*var2 + param[3]*var3 + param[4]*var4)/e
-cor(output, observations) #We want to maximize cor and therefore minimize -cor
}
eqn <- function(param){sum(param)}
With your example data:
observations <- c(1000,250,78,0,0,90)
total_observations=1418
var1=c(1,0.3,0.5,0.01,0.05,0.6)
var2=c(500,40,40,0,0,100)
var3=c(1,0.1,0.2,0,0.1,0)
var4=c(2,0.04,0.003,0.003,0,0.05)
Your optimization:
solnp(c(.1,.2,.3,.4),fun=Fun, eqfun=eqn, eqB=1)
Iter: 1 fn: -0.9793 Pars: 0.1395748 0.0008403 0.3881053 0.4714796
Iter: 2 fn: -0.9793 Pars: 0.1395531 0.0008406 0.3881409 0.4714653
solnp--> Completed in 2 iterations
$pars
[1] 0.1395530843 0.0008406453 0.3881409239 0.4714653466
$convergence
[1] 0
$values
[1] -0.9729894 -0.9793458 -0.9793458
$lagrange
[,1]
[1,] 2.521018e-06
$hessian
[,1] [,2] [,3] [,4]
[1,] 0.4843670 5.0498894 -0.08329380 0.39560040
[2,] 5.0498894 699.5317385 -2.38763807 -0.65610831
[3,] -0.0832938 -2.3876381 0.91837245 -0.09486495
[4,] 0.3956004 -0.6561083 -0.09486495 0.43979850
$ineqx0
NULL
$nfuneval
[1] 709
$outer.iter
[1] 2
$elapsed
Time difference of 0.2371149 secs
If you save that into a variable res
, what you're looking for is stored in res$pars
.
Constrain Optimisation Problems in R
If you want to find
the values cost[1]
,...,cost[5]
that maximize revenue[1]+...+revenue[5]
subject to the constraints cost[1]+...+cost[5]<=budget
(and 0 <= cost[i] <= budget
),
you can parametrize the set of feasible solutions
as follows
cost[1] = s(x[1]) * budget
cost[2] = s(x[2]) * ( budget - cost[1] )
cost[3] = s(x[3]) * ( budget - cost[1] - cost[2])
cost[4] = s(x[4]) * ( budget - cost[1] - cost[2] - cost[3] )
cost[5] = budget - cost[1] - cost[2] - cost[3] - cost[4]
where x[1]
,...,x[4]
are the parameters to find
(with no constraints on them)
and s
is any bijection between the real line R and the segment (0,1).
# Sample data
a <- rlnorm(5)
b <- rlnorm(5)
budget <- rlnorm(1)
# Reparametrization
s <- function(x) exp(x) / ( 1 + exp(x) )
cost <- function(x) {
cost <- rep(NA,5)
cost[1] = s(x[1]) * budget
cost[2] = s(x[2]) * ( budget - cost[1] )
cost[3] = s(x[3]) * ( budget - cost[1] - cost[2])
cost[4] = s(x[4]) * ( budget - cost[1] - cost[2] - cost[3] )
cost[5] = budget - cost[1] - cost[2] - cost[3] - cost[4]
cost
}
# Function to maximize
f <- function(x) {
result <- sum( a * cost(x) ^ b )
cat( result, "\n" )
result
}
# Optimization
r <- optim(c(0,0,0,0), f, control=list(fnscale=-1))
cost(r$par)
Optimization with Constraints
1.
Since the objective is quadratic and the constraints linear,
you can use solve.QP
.
It finds the b
that minimizes
(1/2) * t(b) %*% Dmat %*% b - t(dvec) %*% b
under the constraints
t(Amat) %*% b >= bvec.
Here, we want b
that minimizes
sum( (b-betas)^2 ) = sum(b^2) - 2 * sum(b*betas) + sum(beta^2)
= t(b) %*% t(b) - 2 * t(b) %*% betas + sum(beta^2).
Since the last term, sum(beta^2)
, is constant, we can drop it,
and we can set
Dmat = diag(n)
dvec = betas.
The constraints are
b[1] <= b[2]
b[2] <= b[3]
...
b[n-1] <= b[n]
i.e.,
-b[1] + b[2] >= 0
- b[2] + b[3] >= 0
...
- b[n-1] + b[n] >= 0
so that t(Amat)
is
[ -1 1 ]
[ -1 1 ]
[ -1 1 ]
[ ... ]
[ -1 1 ]
and bvec
is zero.
This leads to the following code.
# Sample data
betas <- c(1.2, 1.3, 1.6, 1.4, 2.2)
# Optimization
n <- length(betas)
Dmat <- diag(n)
dvec <- betas
Amat <- matrix(0,nr=n,nc=n-1)
Amat[cbind(1:(n-1), 1:(n-1))] <- -1
Amat[cbind(2:n, 1:(n-1))] <- 1
t(Amat) # Check that it looks as it should
bvec <- rep(0,n-1)
library(quadprog)
r <- solve.QP(Dmat, dvec, Amat, bvec)
# Check the result, graphically
plot(betas)
points(r$solution, pch=16)
2.
You can use constrOptim
in the same way (the objective function can be arbitrary, but the constraints have to be linear).
3.
More generally, you can use optim
if you reparametrize the problem
into a non-constrained optimization problem,
for instance
b[1] = exp(x[1])
b[2] = b[1] + exp(x[2])
...
b[n] = b[n-1] + exp(x[n-1]).
There are a few examples
here
or there.
Optimization in R: Maximizing and Minimizing Many Variables
This is called the diet problem and it is a popular introduction to linear programming (see, e.g., the first Google hit I found for the diet problem). Linear programming solvers through packages such as lpSolve
can be used to solve many variants of the diet problem.
For instance, consider the version of the problem in the link above, where you have the following foods to choose from:
(food <- data.frame(Food=c("Corn", "2% Milk", "Wheat Bread"), CostPerServing=c(.18, .23, .05), VitaminA=c(107, 500, 0), Calories=c(72, 121, 65)))
# Food CostPerServing VitaminA Calories
# 1 Corn 0.18 107 72
# 2 2% Milk 0.23 500 121
# 3 Wheat Bread 0.05 0 65
Assume you wanted to wanted to find the number of servings of each food that minimize total cost subject to the constraint that the total calories must be between 2000 and 2500 and the amount of Vitamin A must be between 5000 and 50000. If you defined variables X1, X2, and X2, then your objective is .18*X1 + .23*X2 + .05*X3, a linear function of the variables. Similarly each of your constraints in a linear function of the variables; for instance, the lower bound on the number of calories is a constraint of the form 72*X1 + 121*X2 + 65*X3 >= 2000.
The lp
function from the lpSolve
package takes as input a vector indicating the coefficients in the objective value, and information about constraints (a constraint matrix, the direction of each constraint, and the right-hand side of each constraint). For the stated problem, this would be:
library(lpSolve)
mod <- lp("min", # min/max
food$CostPerServing, # Objective
rbind(food$VitaminA, food$VitaminA, food$Calories, food$Calories), # Constraint matrix
c(">=", "<=", ">=", "<="), # Constraint directions
c(5000, 50000, 2000, 2500))
After the model has solved, you can look at the objective function and values:
mod$objval
# [1] 2.907692
mod$solution
# [1] 0.00000 10.00000 12.15385
sum(food$VitaminA * mod$solution)
# [1] 5000
sum(food$Calories * mod$solution)
# [1] 2000
The cheapest cost to meet the constraints would be $2.91, and you can achieve this by using 0 servings of corn, 10 servings of 2% milk, and 12.15 servings of wheat bread. This yields exactly 5000 units of Vitamin A and exactly 2000 calories.
Related Topics
Why Does Lm Run Out of Memory While Matrix Multiplication Works Fine for Coefficients
Specify Position of Geom_Text by Keywords Like "Top", "Bottom", "Left", "Right", "Center"
Loop for Reverse Geocoding in R
Filter Groups in Dplyr That Exclusively Contain Specific Combinations of Values
Margin Adjustments When Using Ggplot's Geom_Tile()
Text Color Based on Contrast Against Background
How to Add Abline with Lattice Xyplot Function
R Histogram with Multiple Populations
Multi Line Title in Ggplot 2 with Multiple Italicized Words
Filling Bars in Barplot with Textiles in Ggplot2
Rename Columns in Multiple Dataframes, R
Linear Model with 'Lm': How to Get Prediction Variance of Sum of Predicted Values
Alignment of Numbers on the Individual Bars with Ggplot2
R - Cumulative Sum by Condition
How to Measure Area Between 2 Distribution Curves in R/Ggplot2