Tensorflow: How to Replace or Modify Gradient

Tensorflow: How to replace or modify gradient?

For TensorFlow 1.7 and TensorFlow 2.0 look at edit blow.

First define your custom gradient:

@tf.RegisterGradient("CustomGrad")
def _const_mul_grad(unused_op, grad):
  return 5.0 * grad

Since you want nothing to happen in the forward pass, override the gradient of an identity operation with your new gradient:

g = tf.get_default_graph()
with g.gradient_override_map({"Identity": "CustomGrad"}):
  output = tf.identity(input, name="Identity")

Here is a working example with a layer that clips gradients in the backwards pass and does nothing in the forwards pass, using the same method:

import tensorflow as tf

@tf.RegisterGradient("CustomClipGrad")
def _clip_grad(unused_op, grad):
  return tf.clip_by_value(grad, -0.1, 0.1)

input = tf.Variable([3.0], dtype=tf.float32)

g = tf.get_default_graph()
with g.gradient_override_map({"Identity": "CustomClipGrad"}):
  output_clip = tf.identity(input, name="Identity")
grad_clip = tf.gradients(output_clip, input)

# output without gradient clipping in the backwards pass for comparison:
output = tf.identity(input)
grad = tf.gradients(output, input)

with tf.Session() as sess:
  sess.run(tf.global_variables_initializer())
  print("with clipping:", sess.run(grad_clip)[0])
  print("without clipping:", sess.run(grad)[0])

Edit for TensorFlow 1.7 and TensorFlow 2.0

Since 1.7 there is a new way to redefine the gradient with shorter syntax, which also works with Tensorflow 2.0. It also allows to redefine the gradient of multiple operations at the same time. Here are the examples from above, rewritten for TensorFlow 1.7 and TensorFlow 2.0:

Layer that scales gradients in the backward pass:

@tf.custom_gradient
def scale_grad_layer(x):
  def grad(dy):
    return 5.0 * dy
  return tf.identity(x), grad

Example with a layer that clips gradients in the backward pass:

@tf.custom_gradient
def clip_grad_layer(x):
  def grad(dy):
    return tf.clip_by_value(dy, -0.1, 0.1)
  return tf.identity(x), grad

How to use gradient_override_map in Tensorflow 2.0?

There is no built-in mechanism in TensorFlow 2.0 to override all gradients for a built-in operator within a scope. However, if you are able to modify the call-site for each call to the built-in operator, you can use the tf.custom_gradient decorator as follows:

@tf.custom_gradient
def custom_square(x):
  def grad(dy):
    return tf.constant(0.0)
  return tf.square(x), grad

with tf.Graph().as_default() as g:
  x = tf.Variable(5.0)
  with tf.GradientTape() as tape:
    s_2 = custom_square(x)

  with tf.compat.v1.Session() as sess:
    sess.run(tf.compat.v1.global_variables_initializer())            
    print(sess.run(tape.gradient(s_2, x)))

How to override gradient vector calculation method for optimization algos in Keras, Tensorflow?

You should be able to do the following:

class CustomModel(keras.Model):
    def train_step(self, data):
        # Unpack the data. Its structure depends on your model and
        # on what you pass to `fit()`.
        x, y = data

        with tf.GradientTape() as tape:
            y_pred = self(x, training=True)  # Forward pass
            # Compute the loss value
            # (the loss function is configured in `compile()`)
            loss = self.compiled_loss(y, y_pred, regularization_losses=self.losses)

        # Compute gradients
        trainable_vars = self.trainable_variables
        gradients = tape.jacobian(loss, trainable_vars)

        new_gradients = []
        for grad in gradients:
            new_grad = do_something_to(grad)
            new_gradients.append(new_grad)

        # Update weights
        self.optimizer.apply_gradients(zip(new_gradients, trainable_vars))
        # Update metrics (includes the metric that tracks the loss)
        self.compiled_metrics.update_state(y, y_pred)
        # Return a dict mapping metric names to current value
        return {m.name: m.result() for m in self.metrics}

Some important notes: loss returned by the compiled_loss function must not average over the batch axis, i.e. I'm assuming it is a tensor of shape (batch_size, ), not a scalar.

This will cause the jacobian to return gradients of the shape (batch_size, ) + variable_shape, that is, you now have per-batch-element gradients. You can now manipulate these gradients however you want, and should at some point get rid of the additional batch axis of course (e.g. averaging). That is, new_grad should have the same shape as the corresponding variable.

Regarding your last comment: As I mentioned, the loss function indeed needs to return one loss per data point, i.e. must not average over the batch. However, this is not enough because if you were to give this vector to tape.gradient, the gradient function will simply sum up the loss values (since it only works with scalars). This is why jacobian is necessary.

Finally, jacobian can be very slow. In the worst case, run time may be multiplied by batch size because it needs to compute that many separate gradients. However, this is done in parallel to some degree so the slowdown might not be as bad.

Tensorflow: How to Replace or Modify Gradient