Deep-Learning Nan Loss Reasons

Deep-Learning Nan loss reasons

There are lots of things I have seen make a model diverge.

Too high of a learning rate. You can often tell if this is the case if the loss begins to increase and then diverges to infinity.
I am not to familiar with the DNNClassifier but I am guessing it uses the categorical cross entropy cost function. This involves taking the log of the prediction which diverges as the prediction approaches zero. That is why people usually add a small epsilon value to the prediction to prevent this divergence. I am guessing the DNNClassifier probably does this or uses the tensorflow opp for it. Probably not the issue.
Other numerical stability issues can exist such as division by zero where adding the epsilon can help. Another less obvious one if the square root whose derivative can diverge if not properly simplified when dealing with finite precision numbers. Yet again I doubt this is the issue in the case of the DNNClassifier.
You may have an issue with the input data. Try calling assert not np.any(np.isnan(x)) on the input data to make sure you are not introducing the nan. Also make sure all of the target values are valid. Finally, make sure the data is properly normalized. You probably want to have the pixels in the range [-1, 1] and not [0, 255].
The labels must be in the domain of the loss function, so if using a logarithmic-based loss function all labels must be non-negative (as noted by evan pu and the comments below).

Loss is always nan when training a deep learning model from tabular data

One of the reasons:
Check whether your dataset have NaN values or not. NaN values can cause problem to the model while learning.

Some of the major bugs in your code:

You are using sigmoid activation function instead of softmax for output layer having 3 neurons
You are fitting both train and test set while using encoders which is wrong. You should fit_transform for your train data and only use transform for test sets
Also you are using input for all layers which is wrong, Only the first layer should accept the input tensor.
You forgot to use prepare_inputs function for X_train and X_test
Your model should be fit with X_train_enc not X_train

Use this instead


import tensorflow as tf
import numpy as np
import pandas as pd
import sklearn.preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler, MinMaxScaler
from sklearn.preprocessing import OrdinalEncoder
from tensorflow import optimizers
from tensorflow.python.keras.layers import Dense, Dropout, Normalization
from tensorflow.python.keras.models import Sequential, Model 

def load_dataset(data_folder_csv):
    # load the dataset as a pandas DataFrame
    data = pd.read_csv(data_folder_csv, header=0)
    # retrieve numpy array
    dataset = data.values

    # split into input (X) and output (y) variables
    X = dataset[:, :-1]
    y = dataset[:, -1]
    print(y)

    # format all fields as floats
    X = X.astype(np.float)
    # reshape the output variable to be one column (e.g. a 2D shape)
    y = y.reshape((len(y), 1))
    return X, y

# prepare input data using min/max scaler.
def prepare_inputs(X_train, X_test):
    oe = MinMaxScaler()
    X_train_enc = oe.fit_transform(X_train)
    X_test_enc = oe.transform(X_test)
    return X_train_enc, X_test_enc

# prepare target
def prepare_targets(y_train, y_test):
    le = LabelEncoder()
    ohe = OneHotEncoder()
    y_train = le.fit_transform(y_train)
    y_test = le.transform(y_test)
    y_train_enc = ohe.fit_transform(y_train).toarray()
    y_test_enc = ohe.transform(y_test).toarray()
    return y_train_enc, y_test_enc

X, y = load_dataset("csv_ready.csv")

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

print('Train', X_train.shape, y_train.shape)
print('Test', X_test.shape, y_test.shape)

#prepare_input function missing here
X_train_enc, X_test_enc = prepare_inputs(X_train, X_test)
print('Finished preparing inputs.')

# prepare output data
y_train_enc, y_test_enc = prepare_targets(y_train, y_test)

model = Sequential()
model.add(Dense(128, input_dim=X_train.shape[1], activation="relu")) 
model.add(Dropout(0.2))
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(128, activation='relu'))
model.add(Dense(3, activation='softmax'))

#opt = optimizers.Adam(lr=0.01, decay=1e-6)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

model.fit(X_train_enc, y_train_enc, epochs=20, batch_size=32, verbose=1, use_multiprocessing=True)

_, accuracy = model.evaluate(X_test_enc, y_test_enc, verbose=0)
print('Accuracy: %.2f' % (accuracy * 100))

NaN loss when training regression network

Regression with neural networks is hard to get working because the output is unbounded, so you are especially prone to the exploding gradients problem (the likely cause of the nans).

Historically, one key solution to exploding gradients was to reduce the learning rate, but with the advent of per-parameter adaptive learning rate algorithms like Adam, you no longer need to set a learning rate to get good performance. There is very little reason to use SGD with momentum anymore unless you're a neural network fiend and know how to tune the learning schedule.

Here are some things you could potentially try:

Normalize your outputs by quantile normalizing or z scoring. To be rigorous, compute this transformation on the training data, not on the entire dataset. For example, with quantile normalization, if an example is in the 60th percentile of the training set, it gets a value of 0.6. (You can also shift the quantile normalized values down by 0.5 so that the 0th percentile is -0.5 and the 100th percentile is +0.5).
Add regularization, either by increasing the dropout rate or adding L1 and L2 penalties to the weights. L1 regularization is analogous to feature selection, and since you said that reducing the number of features to 5 gives good performance, L1 may also.
If these still don't help, reduce the size of your network. This is not always the best idea since it can harm performance, but in your case you have a large number of first-layer neurons (1024) relative to input features (35) so it may help.
Increase the batch size from 32 to 128. 128 is fairly standard and could potentially increase the stability of the optimization.

training loss is nan in keras LSTM

I'm more familiar with working with PyTorch than Keras. However there are still a couple of things I would recommend doing:

Check your data. Ensure that there are no missing or null values in the data that you pass into your model. This is is the most likely culprit. A single null value will cause the loss to be NaN.
You could try lowering the learning rate (0.001 or something even smaller) and/or removing gradient clipping. I've actually had gradient contributing be the cause of NaN loss before.
Try scaling your data (though unscaled data will usually cause infinite losses rather than NaN loses). Use StandardScaler or one of the other scalers in sklearn.

If all that fails then I'd try to just pass some very simple dummy data into the model and see if the problem persists. Then you will know if it is a code problem or a data problem. Hope this helps and feel free to ask questions if you have them.

Common causes of nans during training of neural networks

I came across this phenomenon several times. Here are my observations:

Gradient blow up

Reason: large gradients throw the learning process off-track.

What you should expect: Looking at the runtime log, you should look at the loss values per-iteration. You'll notice that the loss starts to grow significantly from iteration to iteration, eventually the loss will be too large to be represented by a floating point variable and it will become nan.

What can you do: Decrease the base_lr (in the solver.prototxt) by an order of magnitude (at least). If you have several loss layers, you should inspect the log to see which layer is responsible for the gradient blow up and decrease the loss_weight (in train_val.prototxt) for that specific layer, instead of the general base_lr.

Bad learning rate policy and params

Reason: caffe fails to compute a valid learning rate and gets 'inf' or 'nan' instead, this invalid rate multiplies all updates and thus invalidating all parameters.

What you should expect: Looking at the runtime log, you should see that the learning rate itself becomes 'nan', for example:

... sgd_solver.cpp:106] Iteration 0, lr = -nan

What can you do: fix all parameters affecting the learning rate in your 'solver.prototxt' file.

For instance, if you use lr_policy: "poly" and you forget to define max_iter parameter, you'll end up with lr = nan...

For more information about learning rate in caffe, see this thread.

Faulty Loss function

Reason: Sometimes the computations of the loss in the loss layers causes nans to appear. For example, Feeding InfogainLoss layer with non-normalized values, using custom loss layer with bugs, etc.

What you should expect: Looking at the runtime log you probably won't notice anything unusual: loss is decreasing gradually, and all of a sudden a nan appears.

What can you do: See if you can reproduce the error, add printout to the loss layer and debug the error.

For example: Once I used a loss that normalized the penalty by the frequency of label occurrence in a batch. It just so happened that if one of the training labels did not appear in the batch at all - the loss computed produced nans. In that case, working with large enough batches (with respect to the number of labels in the set) was enough to avoid this error.

Faulty input

Reason: you have an input with nan in it!

What you should expect: once the learning process "hits" this faulty input - output becomes nan. Looking at the runtime log you probably won't notice anything unusual: loss is decreasing gradually, and all of a sudden a nan appears.

What can you do: re-build your input datasets (lmdb/leveldn/hdf5...) make sure you do not have bad image files in your training/validation set. For debug you can build a simple net that read the input layer, has a dummy loss on top of it and runs through all the inputs: if one of them is faulty, this dummy net should also produce nan.

stride larger than kernel size in `"Pooling"` layer

For some reason, choosing stride > kernel_size for pooling may results with nans. For example:

layer {
  name: "faulty_pooling"
  type: "Pooling"
  bottom: "x"
  top: "y"
  pooling_param {
    pool: AVE
    stride: 5
    kernel: 3
  }
}

results with nans in y.

Instabilities in `"BatchNorm"`

It was reported that under some settings "BatchNorm" layer may output nans due to numerical instabilities.

This issue was raised in bvlc/caffe and PR #5136 is attempting to fix it.

Recently, I became aware of debug_info flag: setting debug_info: true in 'solver.prototxt' will make caffe print to log more debug information (including gradient magnitudes and activation values) during training: This information can help in spotting gradient blowups and other problems in the training process.

Nan Loss when training Deep neural Recommender model using tensorflow

I got a similar error when using tfrs on a custom dataset. And it turns out that I had some none print characters and sysmbols in the data. I simply searched and removed the symbols (manually, some regex) and i also limit the text columns in the dataframe to printable characters only.

from string import printable as pt

allowed_set = set(pt)
df[col] = df[col].apply(lambda x:  ''.join([' ' if  s not in  allowed_set else s for s in x]))

I hope it helps.

loss nan when trying to work with tensorflow feature columns

The reason for getting nan in the loss is that your target values are in the extremes. They are anywhere from e^-32 to e^31. This you can see easily.

df['zg500']
'''
0      -3.996248e-29
1       2.476790e+11
2      -1.010202e+08
3      -1.407987e-02
4       2.240596e-32
            ...     
1742   -1.682389e+11
1743   -4.802401e+00
1744   -3.480795e+31
1745    1.026754e+21
1746    1.790822e+23
Name: zg500, Length: 1739, dtype: float64
'''

The workaround against this that we scale the target. Although this is not recommended but we have no choice. Below is the slight modification using Standard Scaler to scale the targets.

ss = StandardScaler()
# A utility method to create a tf.data dataset from a Pandas Dataframe
def df_to_dataset(dataframe, shuffle=True, batch_size=32):
    dataframe = dataframe.copy()
    labels = ss.fit_transform(dataframe['zg500'].values.reshape(-1,1))
    ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
    if shuffle:
        ds = ds.shuffle(buffer_size=len(dataframe))
    ds = ds.batch(batch_size)
    return ds

After doing this the below are the results of training the model.

history = model.fit(train_ds, epochs=2)
'''
Consider rewriting this model with the Functional API.
109/109 [==============================] - 1s 804us/step - loss: 27.0520
Epoch 2/10
109/109 [==============================] - 0s 769us/step - loss: 1.0166
Epoch 3/10
109/109 [==============================] - 0s 753us/step - loss: 1.0148
Epoch 4/10
109/109 [==============================] - 0s 779us/step - loss: 1.0115
Epoch 5/10
109/109 [==============================] - 0s 775us/step - loss: 1.0107
Epoch 6/10
109/109 [==============================] - 0s 915us/step - loss: 1.0107
Epoch 7/10
109/109 [==============================] - 0s 1ms/step - loss: 1.0034
Epoch 8/10
109/109 [==============================] - 0s 784us/step - loss: 1.0092
Epoch 9/10
109/109 [==============================] - 0s 735us/step - loss: 1.0151
Epoch 10/10
109/109 [==============================] - 0s 803us/step - loss: 1.0105
'''

Deep-Learning Nan Loss Reasons