How to add an attention mechanism in keras?

If you want to have an attention along the time dimension, then this part of your code seems correct to me:

activations = LSTM(units, return_sequences=True)(embedded)

# compute importance for each step
attention = Dense(1, activation='tanh')(activations)
attention = Flatten()(attention)
attention = Activation('softmax')(attention)
attention = RepeatVector(units)(attention)
attention = Permute([2, 1])(attention)

sent_representation = merge([activations, attention], mode='mul')

You've worked out the attention vector of shape (batch_size, max_length):

attention = Activation('softmax')(attention)

I've never seen this code before, so I can't say if this one is actually correct or not:

K.sum(xin, axis=-2)

Further reading (you might have a look):



How to add attention layer to a Bi-LSTM

This can be a possible custom solution with a custom layer that computes attention on the positional/temporal dimension

from tensorflow.keras.layers import Layer
from tensorflow.keras import backend as K

class Attention(Layer):

def __init__(self, return_sequences=True):
self.return_sequences = return_sequences

def build(self, input_shape):

self.W=self.add_weight(name="att_weight", shape=(input_shape[-1],1),
self.b=self.add_weight(name="att_bias", shape=(input_shape[1],1),


def call(self, x):

e = K.tanh(,self.W)+self.b)
a = K.softmax(e, axis=1)
output = x*a

if self.return_sequences:
return output

return K.sum(output, axis=1)

it's build to receive 3D tensors and output 3D tensors (return_sequences=True) or 2D tensors (return_sequences=False). below a dummy example

# dummy data creation

max_len = 100
max_words = 333
emb_dim = 126

n_sample = 5
X = np.random.randint(0,max_words, (n_sample,max_len))
Y = np.random.randint(0,2, n_sample)

with return_sequences=True

model = Sequential()
model.add(Embedding(max_words, emb_dim, input_length=max_len))
model.add(Bidirectional(LSTM(32, return_sequences=True)))
model.add(Attention(return_sequences=True)) # receive 3D and output 3D
model.add(Dense(1, activation='sigmoid'))

model.compile('adam', 'binary_crossentropy'),Y, epochs=3)

with return_sequences=False

model = Sequential()
model.add(Embedding(max_words, emb_dim, input_length=max_len))
model.add(Bidirectional(LSTM(32, return_sequences=True)))
model.add(Attention(return_sequences=False)) # receive 3D and output 2D
model.add(Dense(1, activation='sigmoid'))

model.compile('adam', 'binary_crossentropy'),Y, epochs=3)

You can integrate it into your networks easily

here the running notebook

