Nan Loss When Training Regression Network

NaN loss when training regression network

Regression with neural networks is hard to get working because the output is unbounded, so you are especially prone to the exploding gradients problem (the likely cause of the nans).

Historically, one key solution to exploding gradients was to reduce the learning rate, but with the advent of per-parameter adaptive learning rate algorithms like Adam, you no longer need to set a learning rate to get good performance. There is very little reason to use SGD with momentum anymore unless you're a neural network fiend and know how to tune the learning schedule.

Here are some things you could potentially try:

  1. Normalize your outputs by quantile normalizing or z scoring. To be rigorous, compute this transformation on the training data, not on the entire dataset. For example, with quantile normalization, if an example is in the 60th percentile of the training set, it gets a value of 0.6. (You can also shift the quantile normalized values down by 0.5 so that the 0th percentile is -0.5 and the 100th percentile is +0.5).

  2. Add regularization, either by increasing the dropout rate or adding L1 and L2 penalties to the weights. L1 regularization is analogous to feature selection, and since you said that reducing the number of features to 5 gives good performance, L1 may also.

  3. If these still don't help, reduce the size of your network. This is not always the best idea since it can harm performance, but in your case you have a large number of first-layer neurons (1024) relative to input features (35) so it may help.

  4. Increase the batch size from 32 to 128. 128 is fairly standard and could potentially increase the stability of the optimization.

Loss: NaN in Keras while performing regression

These are the steps that you can take to find the cause of your problem:

  1. Make sure that your dataset is what it should be:

    • Look for any nan/inf in your dataset and fix it.
    • Incorrect encoding (convert it to UTF-8).
    • Invalid values in your column or rows.
  2. Normalize your model using Dropout, BatchNormalization, L1/L2 regularization, change your batch_size, or scale your data to other ranges (e.g. [-1, 1]).

  3. Reduce the size of your network.

  4. Change other hyper-parameters (e.g. optimizer or activation function).

You can check this and this link to get extra help.

loss: nan when fitting training data in tensorflow keras regression network

I have found the answer to my own question:

As it turns out, Tensorflow doesn't work as of now in python 3.10.
After downgrading my python version to 3.8 everything started working.

Keras Regression model loss: nan. How to fix it?

As explained in the comment, very often a NaN is caused by learning rates that are too high or similar instabilities during Optimization which causes exploding gradients. This can also be prevented by setting clipnorm. Setting up an optimizer with a proper learning rate:

opt = keras.optimizers.Adam(0.001, clipnorm=1.)
model.compile(loss=["mse", "mse"], loss_weights=[0.9, 0.1], optimizer=opt)

leads to better training in your notebook:

Epoch 1/20
363/363 [==============================] - 1s 2ms/step - loss: 1547.7197 - main_output_loss: 967.1940 - aux_output_loss: 6772.4609 - val_loss: 19.9807 - val_main_output_loss: 20.0967 - val_aux_output_loss: 18.9365
Epoch 2/20
363/363 [==============================] - 1s 2ms/step - loss: 13.2916 - main_output_loss: 14.0150 - aux_output_loss: 6.7812 - val_loss: 14.6868 - val_main_output_loss: 14.5820 - val_aux_output_loss: 15.6298
Epoch 3/20
363/363 [==============================] - 1s 2ms/step - loss: 11.0539 - main_output_loss: 11.6683 - aux_output_loss: 5.5244 - val_loss: 10.5564 - val_main_output_loss: 10.2116 - val_aux_output_loss: 13.6594
Epoch 4/20
363/363 [==============================] - 1s 1ms/step - loss: 7.4646 - main_output_loss: 7.7688 - aux_output_loss: 4.7269 - val_loss: 13.2672 - val_main_output_loss: 11.5239 - val_aux_output_loss: 28.9570
Epoch 5/20
363/363 [==============================] - 1s 2ms/step - loss: 5.6873 - main_output_loss: 5.8091 - aux_output_loss: 4.5909 - val_loss: 5.0464 - val_main_output_loss: 4.5089 - val_aux_output_loss: 9.8839

It's not performing amazingly but you'll have to optimize all hyperparameters from here to tweak it to satisfaction.

You can also observe the effect of clipnorm by using SGD as you initially intended :

opt = keras.optimizers.SGD(0.001, clipnorm=1.)
model.compile(loss=["mse", "mse"], loss_weights=[0.9, 0.1], optimizer=opt)

This trains properly. As soon as you remove clipnorm however, you will get NaNs.

Deep-Learning Nan loss reasons

There are lots of things I have seen make a model diverge.

  1. Too high of a learning rate. You can often tell if this is the case if the loss begins to increase and then diverges to infinity.

  2. I am not to familiar with the DNNClassifier but I am guessing it uses the categorical cross entropy cost function. This involves taking the log of the prediction which diverges as the prediction approaches zero. That is why people usually add a small epsilon value to the prediction to prevent this divergence. I am guessing the DNNClassifier probably does this or uses the tensorflow opp for it. Probably not the issue.

  3. Other numerical stability issues can exist such as division by zero where adding the epsilon can help. Another less obvious one if the square root whose derivative can diverge if not properly simplified when dealing with finite precision numbers. Yet again I doubt this is the issue in the case of the DNNClassifier.

  4. You may have an issue with the input data. Try calling assert not np.any(np.isnan(x)) on the input data to make sure you are not introducing the nan. Also make sure all of the target values are valid. Finally, make sure the data is properly normalized. You probably want to have the pixels in the range [-1, 1] and not [0, 255].

  5. The labels must be in the domain of the loss function, so if using a logarithmic-based loss function all labels must be non-negative (as noted by evan pu and the comments below).

Nan Loss when training Deep neural Recommender model using tensorflow

I got a similar error when using tfrs on a custom dataset. And it turns out that I had some none print characters and sysmbols in the data. I simply searched and removed the symbols (manually, some regex) and i also limit the text columns in the dataframe to printable characters only.

from string import printable as pt

allowed_set = set(pt)
df[col] = df[col].apply(lambda x: ''.join([' ' if s not in allowed_set else s for s in x]))

I hope it helps.



Related Topics



Leave a reply



Submit