How to choose cross-entropy loss in TensorFlow?
Preliminary facts
In functional sense, the sigmoid is a partial case of the softmax function, when the number of classes equals 2. Both of them do the same operation: transform the logits (see below) to probabilities.
In simple binary classification, there's no big difference between the two,
however in case of multinomial classification, sigmoid allows to deal
with non-exclusive labels (a.k.a. multi-labels), while softmax deals
with exclusive classes (see below).A logit (also called a score) is a raw unscaled value associated with a class, before computing the probability. In terms of neural network architecture, this means that a logit is an output of a dense (fully-connected) layer.
Tensorflow naming is a bit strange: all of the functions below accept logits, not probabilities, and apply the transformation themselves (which is simply more efficient).
Sigmoid functions family
tf.nn.sigmoid_cross_entropy_with_logits
tf.nn.weighted_cross_entropy_with_logits
tf.losses.sigmoid_cross_entropy
tf.contrib.losses.sigmoid_cross_entropy
(DEPRECATED)
As stated earlier, sigmoid
loss function is for binary classification.
But tensorflow functions are more general and allow to do
multi-label classification, when the classes are independent.
In other words, tf.nn.sigmoid_cross_entropy_with_logits
solves N
binary classifications at once.
The labels must be one-hot encoded or can contain soft class probabilities.
tf.losses.sigmoid_cross_entropy
in addition allows to set the in-batch weights,
i.e. make some examples more important than others.tf.nn.weighted_cross_entropy_with_logits
allows to set class weights
(remember, the classification is binary), i.e. make positive errors larger than
negative errors. This is useful when the training data is unbalanced.
Softmax functions family
tf.nn.softmax_cross_entropy_with_logits
(DEPRECATED IN 1.5)tf.nn.softmax_cross_entropy_with_logits_v2
tf.losses.softmax_cross_entropy
tf.contrib.losses.softmax_cross_entropy
(DEPRECATED)
These loss functions should be used for multinomial mutually exclusive classification,
i.e. pick one out of N
classes. Also applicable when N = 2
.
The labels must be one-hot encoded or can contain soft class probabilities:
a particular example can belong to class A with 50% probability and class B
with 50% probability. Note that strictly speaking it doesn't mean that
it belongs to both classes, but one can interpret the probabilities this way.
Just like in sigmoid
family, tf.losses.softmax_cross_entropy
allows
to set the in-batch weights, i.e. make some examples more important than others.
As far as I know, as of tensorflow 1.3, there's no built-in way to set class weights.
[UPD] In tensorflow 1.5, v2
version was introduced and the original softmax_cross_entropy_with_logits
loss got deprecated. The only difference between them is that in a newer version, backpropagation happens into both logits and labels (here's a discussion why this may be useful).
Sparse functions family
tf.nn.sparse_softmax_cross_entropy_with_logits
tf.losses.sparse_softmax_cross_entropy
tf.contrib.losses.sparse_softmax_cross_entropy
(DEPRECATED)
Like ordinary softmax
above, these loss functions should be used for
multinomial mutually exclusive classification, i.e. pick one out of N
classes.
The difference is in labels encoding: the classes are specified as integers (class index),
not one-hot vectors. Obviously, this doesn't allow soft classes, but it
can save some memory when there are thousands or millions of classes.
However, note that logits
argument must still contain logits per each class,
thus it consumes at least [batch_size, classes]
memory.
Like above, tf.losses
version has a weights
argument which allows
to set the in-batch weights.
Sampled softmax functions family
tf.nn.sampled_softmax_loss
tf.contrib.nn.rank_sampled_softmax_loss
tf.nn.nce_loss
These functions provide another alternative for dealing with huge number of classes.
Instead of computing and comparing an exact probability distribution, they compute
a loss estimate from a random sample.
The arguments weights
and biases
specify a separate fully-connected layer that
is used to compute the logits for a chosen sample.
Like above, labels
are not one-hot encoded, but have the shape [batch_size, num_true]
.
Sampled functions are only suitable for training. In test time, it's recommended to
use a standard softmax
loss (either sparse or one-hot) to get an actual distribution.
Another alternative loss is tf.nn.nce_loss
, which performs noise-contrastive estimation (if you're interested, see this very detailed discussion). I've included this function to the softmax family, because NCE guarantees approximation to softmax in the limit.
Normalized Cross Entropy Loss Implementation Tensorflow/Keras
I think is just a matter of translating methods name:
# given y_pred as 1-hot and y-true the multiclass probabilities
def NCE(y_true, y_pred):
num = - tf.math.reduce_sum(tf.multiply(y_true, y_pred), axis=1)
denom = -tf.math.reduce_sum(y_pred, axis=1)
return tf.reduce_mean(num / denom)
t = tf.constant([[1,0,0], [0,0,1]], dtype=tf.float64)
y = tf.constant([[0.3,0.6,0.1], [0.1,0.1,0.8]], dtype=tf.float64)
NCE(t,y)
# <tf.Tensor: shape=(), dtype=float64, numpy=0.55>
Just check if the resulting loss is the same since I've not tested it
Keras Tensorflow Binary Cross entropy loss greater than 1
Keras binary_crossentropy
first convert your predicted probability to logits. Then it uses tf.nn.sigmoid_cross_entropy_with_logits
to calculate cross entropy and return to you the mean of that. Mathematically speaking, if your label is 1 and your predicted probability is low (like 0.1), the cross entropy can be greater than 1, like losses.binary_crossentropy(tf.constant([1.]), tf.constant([0.1]))
.
Implementing Binary Cross Entropy loss gives different answer than Tensorflow's
There's some issue with your implementation. Here is the correct one with numpy
.
def BinaryCrossEntropy(y_true, y_pred):
y_pred = np.clip(y_pred, 1e-7, 1 - 1e-7)
term_0 = (1-y_true) * np.log(1-y_pred + 1e-7)
term_1 = y_true * np.log(y_pred + 1e-7)
return -np.mean(term_0+term_1, axis=0)
print(BinaryCrossEntropy(np.array([1, 1, 1]).reshape(-1, 1),
np.array([1, 1, 0]).reshape(-1, 1)))
[5.14164949]
Note, during the tf. keras
model training, it's better to use keras
backend functionality. You can implement it, in the same way, using the keras
backend utilities.
def BinaryCrossEntropy(y_true, y_pred):
y_pred = K.clip(y_pred, K.epsilon(), 1 - K.epsilon())
term_0 = (1 - y_true) * K.log(1 - y_pred + K.epsilon())
term_1 = y_true * K.log(y_pred + K.epsilon())
return -K.mean(term_0 + term_1, axis=0)
print(BinaryCrossEntropy(
np.array([1., 1., 1.]).reshape(-1, 1),
np.array([1., 1., 0.]).reshape(-1, 1)
).numpy())
[5.14164949]
Does tensorflow compute the cross entropy only with single precision?
the use of (32bit) float
s would appear to be hard coded in the compute_weighted_loss()
function used by sigmoid_cross_entropy
in Tensorflow
as a minor point your numpy code for calculating ce
isn't very numerically stable — but it won't be affecting anything here. I'd implement it as:
ce = p * -np.log(sig) + (1-p) * -np.log1p(-sig)
the use of log1p
is the main change. your use of 1 - sig
will lose all precision as sig
approaches zero
how to add tensorflow loss functions?
You can just add them as multiple losses in model.compile
model.compile(loss = [loss1,loss2], loss_weights = [l1,l2], ...)
This translates to final_loss = l1*loss1 + l2*loss2
. Just set l1
and l2
as 1.
Related Topics
Access Elementtree Node Parent Node
Draw a Transparent Rectangles and Polygons in Pygame
Why Does Assigning to My Global Variables Not Work in Python
Get the Second Largest Number in a List in Linear Time
Subprocess.Call Using String VS Using List
Load Data from Txt with Pandas
Class Method Decorator with Self Arguments
How to Copy an Entire Directory of Files into an Existing Directory Using Python
Is There a Short Contains Function for Lists
JSONify a SQLalchemy Result Set in Flask
Calculating Direction of the Player to Shoot Pygame
Recommendations of Python Rest (Web Services) Framework
How to Save a Trained Model in Pytorch
Normalize Columns of a Dataframe
What's the Best Way to Parse Command Line Arguments
Making a Request to a Restful API Using Python
Pandas Convert Dataframe to Array of Tuples
How to Properly Subclass Dict and Override _Getitem_ & _Setitem_