Optimizing for Accuracy Instead of Loss in Keras Model

Optimizing for accuracy instead of loss in Keras model

To start with, the code snippet you have used as example:

model.compile(loss='mean_squared_error', optimizer='sgd', metrics='acc')

is actually invalid (although Keras will not produce any error or warning) for a very simple and elementary reason: MSE is a valid loss for regression problems, for which problems accuracy is meaningless (it is meaningful only for classification problems, where MSE is not a valid loss function). For details (including a code example), see own answer in What function defines accuracy in Keras when the loss is mean squared error (MSE)?; for a similar situation in scikit-learn, see own answer in this thread.

Continuing to your general question: in regression settings, usually we don't need a separate performance metric, and we normally use just the loss function itself for this purpose, i.e. the correct code for the example you have used would simply be

model.compile(loss='mean_squared_error', optimizer='sgd')

without any metrics specified. We could of course use metrics='mse', but this is redundant and not really needed. Sometimes people use something like

model.compile(loss='mean_squared_error', optimizer='sgd', metrics=['mse','mae'])

i.e. optimise the model according to the MSE loss, but show also its performance in the mean absolute error (MAE) in addition to MSE.

Now, your question:

shouldn't the focus of the model during its training to maximize acc (or minimize 1/acc) instead of minimizing MSE?

is indeed valid, at least in principle (save for the reference to MSE), but only for classification problems, where, roughly speaking, the situation is as follows: we cannot use the vast arsenal of convex optimization methods in order to directly maximize the accuracy, because accuracy is not a differentiable function; so, we need a proxy differentiable function to use as loss. The most common example of such a loss function suitable for classification problems is the cross entropy.

Rather unsurprisingly, this question of yours pops up from time to time, albeit in slight variations in context; see for example own answers in

Cost function training target versus accuracy desired goal
Targeting a specific metric to optimize in tensorflow

For the interplay between loss and accuracy in the special case of binary classification, you may find my answers in the following threads useful:

Loss & accuracy - Are these reasonable learning curves?
How does Keras evaluate the accuracy?

Is there an optimizer in keras based on precision or recall instead of loss?

You don't use precision or recall to be optimize. You just track them as valid scores to get the best weights. Do not mix loss, optimizer, metrics and other. They are not meant for the same thing.

THRESHOLD = 0.5
def precision(y_true, y_pred, threshold_shift=0.5-THRESHOLD):

    # just in case 
    y_pred = K.clip(y_pred, 0, 1)

    # shifting the prediction threshold from .5 if needed
    y_pred_bin = K.round(y_pred + threshold_shift)

    tp = K.sum(K.round(y_true * y_pred_bin)) + K.epsilon()
    fp = K.sum(K.round(K.clip(y_pred_bin - y_true, 0, 1)))

    precision = tp / (tp + fp)
    return precision

def recall(y_true, y_pred, threshold_shift=0.5-THRESHOLD):

    # just in case 
    y_pred = K.clip(y_pred, 0, 1)

    # shifting the prediction threshold from .5 if needed
    y_pred_bin = K.round(y_pred + threshold_shift)

    tp = K.sum(K.round(y_true * y_pred_bin)) + K.epsilon()
    fn = K.sum(K.round(K.clip(y_true - y_pred_bin, 0, 1)))

    recall = tp / (tp + fn)
    return recall

def fbeta(y_true, y_pred, beta = 2, threshold_shift=0.5-THRESHOLD):   
    # just in case 
    y_pred = K.clip(y_pred, 0, 1)

    # shifting the prediction threshold from .5 if needed
    y_pred_bin = K.round(y_pred + threshold_shift)

    tp = K.sum(K.round(y_true * y_pred_bin)) + K.epsilon()
    fp = K.sum(K.round(K.clip(y_pred_bin - y_true, 0, 1)))
    fn = K.sum(K.round(K.clip(y_true - y_pred, 0, 1)))

    precision = tp / (tp + fp)
    recall = tp / (tp + fn)

    beta_squared = beta ** 2
    return (beta_squared + 1) * (precision * recall) / (beta_squared * precision + recall) 

def model_fit(X,y,X_test,y_test):
    class_weight={
    1: 1/(np.sum(y) / len(y)),
    0:1}
    np.random.seed(47)
    model = Sequential()
    model.add(Dense(1000, input_shape=(X.shape[1],)))
    model.add(Activation('relu'))
    model.add(Dropout(0.35))
    model.add(Dense(500))
    model.add(Activation('relu'))
    model.add(Dropout(0.35))
    model.add(Dense(250))
    model.add(Activation('relu'))
    model.add(Dropout(0.35))
    model.add(Dense(1))
    model.add(Activation('sigmoid'))

    model.compile(loss='binary_crossentropy', optimizer='adamax',metrics=[fbeta,precision,recall])
    model.fit(X, y,validation_data=(X_test,y_test), epochs=200, batch_size=50, verbose=2,class_weight = class_weight)
    return model

How to improve model loss and accuracy?

I guess, you are not happy with training speed: ETA 3:30:04. Usually models should train a few epochs to get significant reduction of loss. But waiting 4 hours per epoch isn't cool, is it? There is several things you can do:

Make sure that you train your model on GPU, because the difference between training on CPU and GPU is insane
You can try make you model less complicated
Or, if you want to have complicated model, but you don't have much time to train, use Transfer Learning

In transfer learning you can use pretrained model, add your own layers and retrain. Here's an example:

from tensorflow.keras.applications import MobileNetV2
from tensorflow.keras.layers import *

base_model = MobileNetV2(
    include_top=False, 
    input_shape=(IMG_WIDTH, IMG_HEIGHT, IMG_CHANNELS)
)
base_model.trainable = False

layer = Dense(256, activation='relu')(base_model.output)
layer = BatchNormalization()(layer)
out = Dense(61, activation='softmax')(layer)

model = Model(inputs=base_model.input, outputs=out)

Targeting a specific metric to optimize in tensorflow

In classification settings, optimizers minimize the loss, e.g. cross entropy; quantities like accuracy, F-score, precision, recall etc. are essentially business metrics, and they are not (and cannot be) directly minimized during the optimization process.

This is a question that pops up rather frequently here in SO in various disguises; here are some threads which will hopefully help you disentangle the concepts (although they refer to accuracy, precision, and recall, the argument is exactly the same for the F-score):

Loss & accuracy - Are these reasonable learning curves?

Cost function training target versus accuracy desired goal

Is there an optimizer in keras based on precision or recall instead of loss?

The bottom line, adapting one of my own (linked) answers:

Loss and metrics like accuracy or F-score are different things; roughly speaking, metrics like accuracy & F-score are what we are actually interested in from a business perspective, while the loss is the objective function that the learning algorithms (optimizers) are trying to minimize from a mathematical perspective. Even more roughly speaking, you can think of the loss as the "translation" of the business objective (accuracy, F-score etc) to the mathematical domain, a translation which is necessary in classification problems (in regression ones, usually the loss and the business objective are the same, or at least can be the same in principle, e.g. the RMSE)...

Why we use the loss to update our model but use the metrics to choose the model we need?

Question is arguably too broad for SO; nevertheless, here is a couple of things which you will hopefully find helpful...

Since you have chosen to use the loss to update the model, why not use the loss to select the model?

Because, while the loss is the quantity we have to optimize from the mathematical perspective, the quantity of interest from the business perspective is the metric; in other words, at the end of the day, as users of the model, we are interested in the metric, and not in the loss (at least for settings where these two quantities are by default different, such as in classification problems).

That said, selecting the model based on loss is also a perfectly valid strategy, too; as always, there is some subjectivity, and it depends on the specific problem.

Take the regression problem as an example, when someone use the 'mse' as their loss, why they define metrics=['mae']

This is not the norm, and is far from standard; normally, for regression problems, it is perfectly natural to use the loss as the metric, too. I agree with you that choices like the one you refer to seem unnatural, and in general do not seem to make much sense. Just keep in mind that because someone used it in a blog or something does not make it necessarily "correct" (or a good idea), but is is difficult to argue in general without taking into account possible arguments for the specific case.

I don't know why these metrics [F1 or AUC] can improve the problem caused by imbalance data.

They don't "improve" anything - they are just more appropriate instead of accuracy, where a naive approach in a heavily imbalanced dataset (think of 99% majority class) will be simply to classify everything as the majority class, which would give a 99% accuracy without the model having learned anything.

I am confused about when someone send more than one metric to the parameter metrics in the function compile. I don't understand why multiple, why not one. What is the advantage of defining multiple metrics over one?

Again, generally speaking, there is no advantage, neither this is the norm; but everything depends on possible specifics.

UPDATE (after comment): Limiting the discussion to classification settings (since in regression, the loss and the metric can be the same thing), similar questions pop up rather frequently, I guess because the subtle differences between the loss and the various available metrics (accuracy, precision, recall, F1 score etc) are not well understood; consider for example the inverse of your question:

Optimizing for accuracy instead of loss in Keras model

and the links therein. Quoting from one of my own linked answers:

Loss and accuracy are different things; roughly speaking, the accuracy is what we are actually interested in from a business perspective, while the loss is the objective function that the learning algorithms (optimizers) are trying to minimize from a mathematical perspective. Even more roughly speaking, you can think of the loss as the "translation" of the business objective (accuracy) to the mathematical domain, a translation which is necessary in classification problems (in regression ones, usually the loss and the business objective are the same, or at least can be the same in principle, e.g. the RMSE)...

You may also find the discussion in Cost function training target versus accuracy desired goal helpful.

Why would I choose a loss-function differing from my metrics?

In many cases the metric you are interested might not be differentiable, so you cannot use it as a loss, this is the case for accuracy for example, where the cross entropy loss is used instead as it is differentiable.

For metrics that are already differentiable, you just want to get additional information from the learning process, as each metrics measures something different. For example the MSE has a scale that is squared from the scale of the data/predictions, so to get the same scale you have to use RMSE or the MAE. The MAPE gives you relative (not absolute) error, so all of these metrics measure something different that might be of interest.

In the case of accuracy, this metric is used because it is easily interpretable by a human, while cross entropy loss are less intuitive to interpret.

Loss in Keras Model evaluation

When defining a machine learning model, we want a way to measure the performance of our model so that we could compare it with other models to choose the best one and also make sure that it is good enough. Therefore, we define some metrics like accuracy (in the context of classification), which is the proportion of correctly classified samples by the model, to measure how our model performs and whether it is good enough for our task or not.

Although these metrics are truly comprehensible by us, however the problem is that they cannot be directly used by the learning process of our models to tune the parameters of the model. Instead, we define other measures, which are usually called loss functions or objective functions, which can be directly used by the training process (i.e. optimization). These functions are usually defined such that we expect that when their values are low we would have a high accuracy. That's why you would commonly see that the machine learning algorithms are trying to minimize a loss function with the expectation that the accuracy increases. In other words, the models are indirectly learning by optimizing the loss functions. The loss values are important during training of the model, e.g. if they are not decreasing or fluctuating then this means there is a problem somewhere that needs to be fixed.

As a result, what we are ultimately (i.e. when testing a model) concerned about is the value of metrics (like accuracy) we have initially defined and we don't care about the final value of loss functions. That's why you don't hear things like "the loss value of a [specific model] on the ImageNet dataset is 8.732"! That does not tell you anything whether the model is great, good, bad or terrible. Rather, you would hear that "this model performs with 87% accuracy on the ImageNet dataset".

Optimizing for Accuracy Instead of Loss in Keras Model