In R, How to Find the Optimal Variable to Maximize or Minimize Correlation Between Several Datasets

In R, how do I find the optimal variable to maximize or minimize correlation between several datasets

Since most optimization routines work best with no constraints,
you can transform (reparametrize) the problem of finding
four numbers, x, y, z, j,
constrained to be between 0 and 1 and to sum up to 1,
into the problem of finding three real numbers q1, q2, q3
(with no constraints).
For instance, if we have a function s that maps the real line R
to the interval (0,1),
the following does the trick:

  x = s(q1)
  y = (1-x) * s(q2)
  z = (1-x-y) * s(q3)
  j = 1-x-y-z

It is probably easier to understand in two dimensions:
in this case, the set of points (x,y,z)
with coordinates between 0 and 1 and summing up to 1
is a triangle and s(q1),s(q2) form a coordinate system
for points in that triangle.

# Sample data
A <- rnorm(100)
B <- rnorm(100)
C <- rnorm(100)
D <- rnorm(100)
E <- rnorm(100)
f <- function(p) cor(p[1]*A + p[2]*B + p[3]*C + p[4]*D, E)

# Unconstrained optimization
optim(
  c(1,1,1,1)/4, # Starting values
  f,            # Function to maximize
  control=list(fnscale=-1) # Maximize (default is to minimize)
)

# Transform the parameters
sigmoid <- function(x) exp(x) / ( 1 + exp(x) )
convert <- function(p) {
  q1 <- sigmoid(p[1])
  q2 <- (1-q1) * sigmoid(p[2])
  q3 <- (1-q1-q2) * sigmoid(p[3])
  q4 <- 1-q1-q2-q3 
  c(q1,q2,q3,q4)
}

# Optimization
g <- function(p) f(convert(p))
p <- optim(c(0,0,0,0), g, control=list(fnscale=-1))
convert(p$par)

How to estimate the correlation between two functions with unknown variables inside?

If I am understanding correctly, you are trying to minimise the squared correlation between the output of f1 and f2 at different values of lambda. This means that for each value of lambda you are assessing, you need to feed in the complete vector of tau values. This will give a vector output for each value of lambda so that a correlation between the output from the two functions can be calculated at any single value of lambda.

To do this, we create a vectorized function that takes lambda values and calculates the squared correlation between f1 and f2 at those values of lambda across all values of tau

f3 <- function(lambda) {
  sapply(lambda, function(l) {
      cor(f1(l, seq(1/12, 10, 1/12)), f2(l, seq(1/12, 10, 1/12)))^2
  })
}

To get the optimal value of lambda that minimizes the squared correlation, we just use optimize:

optimize(f3, c(0, 100))$minimum
#> [1] 0.6678021

How to solve a simple optimization in R

You need to use function solnp in package Rsolnpbecause it allows constraints based on an equality.

The idea is to build a function to minimize and your equality function for the constraint.

Fun <- function(param){
    e <- total_observations/(sum(param[1]*var1)+sum(param[2]*var2)+sum(param[3]*var3)+sum(param[4]*var4))
    output <- (param[1]*var1 + param[2]*var2 + param[3]*var3 + param[4]*var4)/e
    -cor(output, observations) #We want to maximize cor and therefore minimize -cor
    }

eqn <- function(param){sum(param)}

With your example data:

observations <- c(1000,250,78,0,0,90)
total_observations=1418
var1=c(1,0.3,0.5,0.01,0.05,0.6)
var2=c(500,40,40,0,0,100)
var3=c(1,0.1,0.2,0,0.1,0)
var4=c(2,0.04,0.003,0.003,0,0.05)

Your optimization:

solnp(c(.1,.2,.3,.4),fun=Fun, eqfun=eqn, eqB=1)

Iter: 1 fn: -0.9793  Pars:  0.1395748 0.0008403 0.3881053 0.4714796
Iter: 2 fn: -0.9793  Pars:  0.1395531 0.0008406 0.3881409 0.4714653
solnp--> Completed in 2 iterations
$pars
[1] 0.1395530843 0.0008406453 0.3881409239 0.4714653466

$convergence
[1] 0

$values
[1] -0.9729894 -0.9793458 -0.9793458

$lagrange
             [,1]
[1,] 2.521018e-06

$hessian
           [,1]        [,2]        [,3]        [,4]
[1,]  0.4843670   5.0498894 -0.08329380  0.39560040
[2,]  5.0498894 699.5317385 -2.38763807 -0.65610831
[3,] -0.0832938  -2.3876381  0.91837245 -0.09486495
[4,]  0.3956004  -0.6561083 -0.09486495  0.43979850

$ineqx0
NULL

$nfuneval
[1] 709

$outer.iter
[1] 2

$elapsed
Time difference of 0.2371149 secs

If you save that into a variable res, what you're looking for is stored in res$pars.

Constrain Optimisation Problems in R

If you want to find
the values cost[1],...,cost[5]
that maximize revenue[1]+...+revenue[5]
subject to the constraints cost[1]+...+cost[5]<=budget
(and 0 <= cost[i] <= budget),
you can parametrize the set of feasible solutions
as follows

cost[1] = s(x[1]) * budget
cost[2] = s(x[2]) * ( budget - cost[1] )
cost[3] = s(x[3]) * ( budget - cost[1] - cost[2])
cost[4] = s(x[4]) * ( budget - cost[1] - cost[2] - cost[3] )
cost[5] = budget - cost[1] - cost[2] - cost[3] - cost[4]

where x[1],...,x[4] are the parameters to find
(with no constraints on them)
and s is any bijection between the real line R and the segment (0,1).

# Sample data
a <- rlnorm(5)
b <- rlnorm(5)
budget <- rlnorm(1)

# Reparametrization
s <- function(x) exp(x) / ( 1 + exp(x) )
cost <- function(x) {
  cost <- rep(NA,5)
  cost[1] = s(x[1]) * budget
  cost[2] = s(x[2]) * ( budget - cost[1] )
  cost[3] = s(x[3]) * ( budget - cost[1] - cost[2])
  cost[4] = s(x[4]) * ( budget - cost[1] - cost[2] - cost[3] )
  cost[5] = budget - cost[1] - cost[2] - cost[3] - cost[4]
  cost  
}

# Function to maximize
f <- function(x) {
  result <- sum( a * cost(x) ^ b )
  cat( result, "\n" )
  result
}

# Optimization
r <- optim(c(0,0,0,0), f, control=list(fnscale=-1))
cost(r$par)

Optimization with Constraints

1.
Since the objective is quadratic and the constraints linear,
you can use solve.QP.

It finds the b that minimizes

(1/2) * t(b) %*% Dmat %*% b - t(dvec) %*% b

under the constraints

t(Amat) %*% b >= bvec.

Here, we want b that minimizes

sum( (b-betas)^2 ) = sum(b^2) - 2 * sum(b*betas) + sum(beta^2)
                   = t(b) %*% t(b) - 2 * t(b) %*% betas + sum(beta^2).

Since the last term, sum(beta^2), is constant, we can drop it,
and we can set

Dmat = diag(n)
dvec = betas.

The constraints are

b[1] <= b[2]
b[2] <= b[3]
...
b[n-1] <= b[n]

i.e.,

-b[1] + b[2]                       >= 0
      - b[2] + b[3]                >= 0
               ...
                   - b[n-1] + b[n] >= 0

so that t(Amat) is

[ -1  1                ]
[    -1  1             ]
[       -1  1          ]
[             ...      ]
[                -1  1 ]

and bvec is zero.

This leads to the following code.

# Sample data
betas <- c(1.2, 1.3, 1.6, 1.4, 2.2)

# Optimization
n <- length(betas)
Dmat <- diag(n)
dvec <- betas
Amat <- matrix(0,nr=n,nc=n-1)
Amat[cbind(1:(n-1), 1:(n-1))] <- -1
Amat[cbind(2:n,     1:(n-1))] <-  1
t(Amat)  # Check that it looks as it should
bvec <- rep(0,n-1)
library(quadprog)
r <- solve.QP(Dmat, dvec, Amat, bvec)

# Check the result, graphically
plot(betas)
points(r$solution, pch=16)

2.
You can use constrOptim in the same way (the objective function can be arbitrary, but the constraints have to be linear).

3.
More generally, you can use optim if you reparametrize the problem
into a non-constrained optimization problem,
for instance

b[1] = exp(x[1])
b[2] = b[1] + exp(x[2])
...
b[n] = b[n-1] + exp(x[n-1]).

There are a few examples
here
or there.

Optimization in R: Maximizing and Minimizing Many Variables

This is called the diet problem and it is a popular introduction to linear programming (see, e.g., the first Google hit I found for the diet problem). Linear programming solvers through packages such as lpSolve can be used to solve many variants of the diet problem.

For instance, consider the version of the problem in the link above, where you have the following foods to choose from:

(food <- data.frame(Food=c("Corn", "2% Milk", "Wheat Bread"), CostPerServing=c(.18, .23, .05), VitaminA=c(107, 500, 0), Calories=c(72, 121, 65)))
#          Food CostPerServing VitaminA Calories
# 1        Corn           0.18      107       72
# 2     2% Milk           0.23      500      121
# 3 Wheat Bread           0.05        0       65

Assume you wanted to wanted to find the number of servings of each food that minimize total cost subject to the constraint that the total calories must be between 2000 and 2500 and the amount of Vitamin A must be between 5000 and 50000. If you defined variables X1, X2, and X2, then your objective is .18*X1 + .23*X2 + .05*X3, a linear function of the variables. Similarly each of your constraints in a linear function of the variables; for instance, the lower bound on the number of calories is a constraint of the form 72*X1 + 121*X2 + 65*X3 >= 2000.

The lp function from the lpSolve package takes as input a vector indicating the coefficients in the objective value, and information about constraints (a constraint matrix, the direction of each constraint, and the right-hand side of each constraint). For the stated problem, this would be:

library(lpSolve)
mod <- lp("min",  # min/max
          food$CostPerServing,  # Objective
          rbind(food$VitaminA, food$VitaminA, food$Calories, food$Calories),  # Constraint matrix
          c(">=", "<=", ">=", "<="),  # Constraint directions
          c(5000, 50000, 2000, 2500))

After the model has solved, you can look at the objective function and values:

mod$objval
# [1] 2.907692
mod$solution
# [1]  0.00000 10.00000 12.15385
sum(food$VitaminA * mod$solution)
# [1] 5000
sum(food$Calories * mod$solution)
# [1] 2000

The cheapest cost to meet the constraints would be $2.91, and you can achieve this by using 0 servings of corn, 10 servings of 2% milk, and 12.15 servings of wheat bread. This yields exactly 5000 units of Vitamin A and exactly 2000 calories.