What Are Logits? Differencebetween Softmax and Softmax_Cross_Entropy_With_Logits

What are logits? What is the difference between softmax and softmax_cross_entropy_with_logits?

The softmax+logits simply means that the function operates on the unscaled output of earlier layers and that the relative scale to understand the units is linear. It means, in particular, the sum of the inputs may not equal 1, that the values are not probabilities (you might have an input of 5). Internally, it first applies softmax to the unscaled output, and then and then computes the cross entropy of those values vs. what they "should" be as defined by the labels.

tf.nn.softmax produces the result of applying the softmax function to an input tensor. The softmax "squishes" the inputs so that sum(input) = 1, and it does the mapping by interpreting the inputs as log-probabilities (logits) and then converting them back into raw probabilities between 0 and 1. The shape of output of a softmax is the same as the input:

a = tf.constant(np.array([[.1, .3, .5, .9]]))
print s.run(tf.nn.softmax(a))
[[ 0.16838508  0.205666    0.25120102  0.37474789]]

See this answer for more about why softmax is used extensively in DNNs.

tf.nn.softmax_cross_entropy_with_logits combines the softmax step with the calculation of the cross-entropy loss after applying the softmax function, but it does it all together in a more mathematically careful way. It's similar to the result of:

sm = tf.nn.softmax(x)
ce = cross_entropy(sm)

The cross entropy is a summary metric: it sums across the elements. The output of tf.nn.softmax_cross_entropy_with_logits on a shape [2,5] tensor is of shape [2,1] (the first dimension is treated as the batch).

If you want to do optimization to minimize the cross entropy AND you're softmaxing after your last layer, you should use tf.nn.softmax_cross_entropy_with_logits instead of doing it yourself, because it covers numerically unstable corner cases in the mathematically right way. Otherwise, you'll end up hacking it by adding little epsilons here and there.

Edited 2016-02-07:
If you have single-class labels, where an object can only belong to one class, you might now consider using tf.nn.sparse_softmax_cross_entropy_with_logits so that you don't have to convert your labels to a dense one-hot array. This function was added after release 0.6.0.

What's the difference between sparse_softmax_cross_entropy_with_logits and softmax_cross_entropy_with_logits?

Having two different functions is a convenience, as they produce the same result.

The difference is simple:

For sparse_softmax_cross_entropy_with_logits, labels must have the shape [batch_size] and the dtype int32 or int64. Each label is an int in range [0, num_classes-1].
For softmax_cross_entropy_with_logits, labels must have the shape [batch_size, num_classes] and dtype float32 or float64.

Labels used in softmax_cross_entropy_with_logits are the one hot version of labels used in sparse_softmax_cross_entropy_with_logits.

Another tiny difference is that with sparse_softmax_cross_entropy_with_logits, you can give -1 as a label to have loss 0 on this label.

what's the difference between softmax_cross_entropy_with_logits and losses.log_loss?

Those methods are not so different in theory, however have number of differences in implementation:

1) tf.nn.softmax_cross_entropy_with_logitsis designed for single-class labels, while tf.losses.log_losscan be used for multi-class classification. tf.nn.softmax_cross_entropy_with_logits won't throw an error if you feed multi-class labels, however your gradients won't be calculated correctly and training most probably will fail.

From official documentation:

NOTE: While the classes are mutually exclusive, their probabilities need not be. All that is required is that each row of labels is a valid probability distribution. If they are not, the computation of the gradient will be incorrect.

2) tf.nn.softmax_cross_entropy_with_logits calculates (as it's seen from the name) soft-max function on top of your predictions first, while log_loss doesn't do this.

3) tf.losses.log_loss has a little wider functionality in a sense that you can weight each element of the loss function or you can specify epsilon, which is used in calculations, to avoid log(0) value.

4) Finally, tf.nn.softmax_cross_entropy_with_logits returns loss for every entry in the batch, while tf.losses.log_loss returns reduced (sum over all samples by default) value which can be directly used in optimizer.

UPD: Another difference is the way the calculate the loss, Logarithmic loss takes into account negative classes (those where you have 0s in the vector). Shortly, cross-enthropy loss forces network to produce maximum input for the correct class and does not care about negative classes. Logarithmic loss does both at the same time, it forces correct classes to have larger values and negative lesser. In mathematic expression it looks as following:

Cross-enthropy loss:

Sample Image

Logarithmic Loss:

Sample Image

Where i is the corresponding class.

So for example, if you have labels=[1,0] and predictions_with_softmax = [0.7,0.3], then:

1) Cross-Enthropy Loss: -(1 * log(0.7) + 0 * log(0.3)) = 0.3567

2) Logarithmic Loss: - (1*log(0.7) + (1-1) * log(1 - 0.7) +0*log(0.3) + (1-0) log (1- 0.3)) = - (log(0.7) + log (0.7)) = 0.7133

And then if you use default value for tf.losses.log_loss you then need to divide the log_loss output by the number of non-zero elements (here it's 2). So finally: tf.nn.log_loss = 0.7133 / 2 = 0.3566

In this case we got equal outputs, however it is not always the case

Tensorflow, difference between tf.nn.softmax_cross_entropy_with_logits and tf.nn.sparse_softmax_cross_entropy_with_logits

The difference is that tf.nn.softmax_cross_entropy_with_logits doesn't assume that the classes are mutually exclusive:

Measures the probability error in discrete classification tasks in
which each class is independent and not mutually exclusive. For
instance, one could perform multilabel classification where a picture
can contain both an elephant and a dog at the same time.

Compare with sparse_*:

Measures the probability error in discrete classification tasks in
which the classes are mutually exclusive (each entry is in exactly one
class). For example, each CIFAR-10 image is labeled with one and only
one label: an image can be a dog or a truck, but not both.

As such, with sparse functions, the dimensions of logits and labels are not the same: labels contain one number per example, whereas logits the number of classes per example, denoting probabilities.

Why is doing softmax and crossentropy separately produce different result than doing them together using softmax_cross_entropy_with_logits?

If you use tf.reduce_sum() in the upper example, as you did in the lower one, you should be able to achieve similar results with both methods: cost = tf.reduce_mean(tf.reduce_sum( tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=Y))).

I increased the number of training epochs to 50 and achieved accuracies of 93.06% (tf.nn.softmax_cross_entropy_with_logits()) and 93.24% (softmax and cross entropy separately), so the results are quite similar.

In Tensorflow, what is the difference between sampled_softmax_loss and softmax_cross_entropy_with_logits

If your target vocabulary(or in other words amount of classes you want to predict) is really big, it is very hard to use regular softmax, because you have to calculate probability for every word in dictionary. By Using sampled_softmax_loss you only take in account subset V of your vocabulary to calculate your loss.

Sampled softmax only makes sense if we sample(our V) less than vocabulary size. If your vocabulary(amount of labels) is small, there is no point using sampled_softmax_loss.

You can see implementation details in this paper:
http://arxiv.org/pdf/1412.2007v2.pdf

Also you can see example where it is used - Sequence to sequence translation in this example