Using a pre-trained word embedding (word2vec or Glove) in TensorFlow
There are a few ways that you can use a pre-trained embedding in TensorFlow. Let's say that you have the embedding in a NumPy array called embedding
, with vocab_size
rows and embedding_dim
columns and you want to create a tensor W
that can be used in a call to tf.nn.embedding_lookup()
.
Simply create
W
as atf.constant()
that takesembedding
as its value:W = tf.constant(embedding, name="W")
This is the easiest approach, but it is not memory efficient because the value of a
tf.constant()
is stored multiple times in memory. Sinceembedding
can be very large, you should only use this approach for toy examples.Create
W
as atf.Variable
and initialize it from the NumPy array via atf.placeholder()
:W = tf.Variable(tf.constant(0.0, shape=[vocab_size, embedding_dim]),
trainable=False, name="W")
embedding_placeholder = tf.placeholder(tf.float32, [vocab_size, embedding_dim])
embedding_init = W.assign(embedding_placeholder)
# ...
sess = tf.Session()
sess.run(embedding_init, feed_dict={embedding_placeholder: embedding})This avoid storing a copy of
embedding
in the graph, but it does require enough memory to keep two copies of the matrix in memory at once (one for the NumPy array, and one for thetf.Variable
). Note that I've assumed that you want to hold the embedding matrix constant during training, soW
is created withtrainable=False
.If the embedding was trained as part of another TensorFlow model, you can use a
tf.train.Saver
to load the value from the other model's checkpoint file. This means that the embedding matrix can bypass Python altogether. CreateW
as in option 2, then do the following:W = tf.Variable(...)
embedding_saver = tf.train.Saver({"name_of_variable_in_other_model": W})
# ...
sess = tf.Session()
embedding_saver.restore(sess, "checkpoint_filename.ckpt")
how to replace keras embedding with pre-trained word embedding to CNN
This reads the text file containing the weights, stores the words and their weights in a dictionary, then maps them into a new matrix using the vocabulary of your fit tokenizer.
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM, Convolution1D, Flatten, Dropout
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
from keras.callbacks import TensorBoard
from tensorflow import keras
import itertools
import numpy as np
# Using keras to load the dataset with the top_words
top_words = 10000
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=top_words)
word_index = keras.datasets.imdb.get_word_index()
embedding_vecor_length = 300 # same as the embeds to be loaded below
embeddings_dictionary = dict()
glove_file = open('./embeds/glove.6B.300d.txt', 'rb')
for line in glove_file:
records = line.split() # seperates each line by a white space
word = records[0] # the first element is the word
vector_dimensions = np.asarray(
records[1:], dtype='float32') # the rest are the weights
# storing in dictionary
embeddings_dictionary[word] = vector_dimensions
glove_file.close()
# len_of_vocab = len(word_index)
embeddings_matrix = np.zeros((top_words, embedding_vecor_length))
# mapping to a new matrix, using only the words in your tokenizer's vocabulary
for word, index in word_index.items():
if index>=top_words:
continue
# the weights of the individual words in your vocabulary
embedding_vector = embeddings_dictionary.get(bytes(word, 'utf-8'))
if embedding_vector is not None:
embeddings_matrix[index] = embedding_vector
# Pad the sequence to the same length
max_review_length = 1600
X_train = sequence.pad_sequences(X_train, maxlen=max_review_length)
X_test = sequence.pad_sequences(X_test, maxlen=max_review_length)
# Using embedding from Keras
model = Sequential()
model.add(Embedding(top_words, embedding_vecor_length,
input_length=max_review_length, name="embeddinglayer", weights=[embeddings_matrix], trainable=True))
# Convolutional model (3x conv, flatten, 2x dense)
model.add(Convolution1D(64, 3, padding='same'))
model.add(Convolution1D(32, 3, padding='same'))
model.add(Convolution1D(16, 3, padding='same'))
model.add(Flatten())
model.add(Dropout(0.2))
model.add(Dense(180, activation='sigmoid'))
model.add(Dropout(0.2))
model.add(Dense(1, activation='sigmoid'))
# Log to tensorboard
tensorBoardCallback = TensorBoard(log_dir='./logs', write_graph=True)
model.compile(loss='binary_crossentropy',
optimizer='adam', metrics=['accuracy'])
model.fit(X_train, y_train, epochs=3, callbacks=[
tensorBoardCallback], batch_size=64)
# Evaluation on the test set
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))
Is there a way to use pre-trained Embedding with Tf-Idf in tensorflow?
The most common approach is to multiply each word vector by its corresponding tf_idf
score. One often sees this approach in academic papers. You could do something like this:
Create tfidf
scores:
import tensorflow as tf
import numpy as np
import gensim.downloader as api
from sklearn.feature_extraction.text import TfidfVectorizer
import collections
def td_idf_word2weight(text):
print("Creating TfidfVectorizer...")
tfidf = TfidfVectorizer(preprocessor=' '.join)
tfidf.fit(text)
# if a word was never seen - it is considered to be at least as infrequent as any of the known words
max_idf = max(tfidf.idf_)
return collections.defaultdict(
lambda: max_idf,
[(w, tfidf.idf_[i]) for w, i in tfidf.vocabulary_.items()])
text = [['she let the balloon float up into the air with her hopes and dreams'],
['the old rusted farm equipment surrounded the house predicting its demise'],
['he was so preoccupied with whether or not he could that he failed to stop to consider if he should']]
td_idf = td_idf_word2weight(text)
text = np.concatenate(text)
tokenizer = tf.keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(text)
text_sequences = tokenizer.texts_to_sequences(text)
text_sequences = tf.keras.preprocessing.sequence.pad_sequences(text_sequences, padding='post')
vocab_size = len(tokenizer.word_index) + 1
print(td_idf.items())
print(vocab_size)
Creating TfidfVectorizer...
dict_items([('she', 1.6931471805599454), ('let', 1.6931471805599454), ('the', 1.2876820724517808), ('balloon', 1.6931471805599454), ('float', 1.6931471805599454), ('up', 1.6931471805599454), ('into', 1.6931471805599454), ('air', 1.6931471805599454), ('with', 1.2876820724517808), ('her', 1.6931471805599454), ('hopes', 1.6931471805599454), ('and', 1.6931471805599454), ('dreams', 1.6931471805599454), ('old', 1.6931471805599454), ('rusted', 1.6931471805599454), ('farm', 1.6931471805599454), ('equipment', 1.6931471805599454), ('surrounded', 1.6931471805599454), ('house', 1.6931471805599454), ('predicting', 1.6931471805599454), ('its', 1.6931471805599454), ('demise', 1.6931471805599454), ('he', 1.6931471805599454), ('was', 1.6931471805599454), ('so', 1.6931471805599454), ('preoccupied', 1.6931471805599454), ('whether', 1.6931471805599454), ('or', 1.6931471805599454), ('not', 1.6931471805599454), ('could', 1.6931471805599454), ('that', 1.6931471805599454), ('failed', 1.6931471805599454), ('to', 1.6931471805599454), ('stop', 1.6931471805599454), ('consider', 1.6931471805599454), ('if', 1.6931471805599454), ('should', 1.6931471805599454)])
38
Create tf_idf
-weighted embeddings matrix:
model = api.load("glove-twitter-25")
embedding_dim = 25
weight_matrix = np.zeros((vocab_size, embedding_dim))
for word, i in tokenizer.word_index.items():
try:
embedding_vector = model[word] * td_idf[word]
weight_matrix[i] = embedding_vector
except KeyError:
weight_matrix[i] = np.random.uniform(-5, 5, embedding_dim)
print(weight_matrix.shape)
(38, 25)
Load Pretrained Word2Vec Embedding in Tensorflow
Yes, the fit
step tells the vocab_processor
the index of each word (starting from 1) in the vocab
array. transform
just reversed this lookup and produces the index from the words and uses 0
to pad the output to the max_document_size
.
You can see that in a short example here:
vocab_processor = learn.preprocessing.VocabularyProcessor(5)
vocab = ['a', 'b', 'c', 'd', 'e']
pretrain = vocab_processor.fit(vocab)
pretrain == vocab_processor
# True
np.array(list(pretrain.transform(['a b c', 'b c d', 'a e', 'a b c d e'])))
# array([[1, 2, 3, 0, 0],
# [2, 3, 4, 0, 0],
# [1, 5, 0, 0, 0],
# [1, 2, 3, 4, 5]])
#
How do I create a Keras Embedding layer from a pre-trained word embedding dataset?
You will need to pass an embeddingMatrix to the Embedding
layer as follows:
Embedding(vocabLen, embDim, weights=[embeddingMatrix], trainable=isTrainable)
vocabLen
: number of tokens in your vocabularyembDim
: embedding vectors dimension (50 in your example)embeddingMatrix
: embedding matrix built from glove.6B.50d.txtisTrainable
: whether you want the embeddings to be trainable or froze the layer
The glove.6B.50d.txt
is a list of whitespace-separated values: word token + (50) embedding values. e.g. the 0.418 0.24968 -0.41242 ...
To create a pretrainedEmbeddingLayer
from a Glove file:
# Prepare Glove File
def readGloveFile(gloveFile):
with open(gloveFile, 'r') as f:
wordToGlove = {} # map from a token (word) to a Glove embedding vector
wordToIndex = {} # map from a token to an index
indexToWord = {} # map from an index to a token
for line in f:
record = line.strip().split()
token = record[0] # take the token (word) from the text line
wordToGlove[token] = np.array(record[1:], dtype=np.float64) # associate the Glove embedding vector to a that token (word)
tokens = sorted(wordToGlove.keys())
for idx, tok in enumerate(tokens):
kerasIdx = idx + 1 # 0 is reserved for masking in Keras (see above)
wordToIndex[tok] = kerasIdx # associate an index to a token (word)
indexToWord[kerasIdx] = tok # associate a word to a token (word). Note: inverse of dictionary above
return wordToIndex, indexToWord, wordToGlove
# Create Pretrained Keras Embedding Layer
def createPretrainedEmbeddingLayer(wordToGlove, wordToIndex, isTrainable):
vocabLen = len(wordToIndex) + 1 # adding 1 to account for masking
embDim = next(iter(wordToGlove.values())).shape[0] # works with any glove dimensions (e.g. 50)
embeddingMatrix = np.zeros((vocabLen, embDim)) # initialize with zeros
for word, index in wordToIndex.items():
embeddingMatrix[index, :] = wordToGlove[word] # create embedding: word index to Glove word embedding
embeddingLayer = Embedding(vocabLen, embDim, weights=[embeddingMatrix], trainable=isTrainable)
return embeddingLayer
# usage
wordToIndex, indexToWord, wordToGlove = readGloveFile("/path/to/glove.6B.50d.txt")
pretrainedEmbeddingLayer = createPretrainedEmbeddingLayer(wordToGlove, wordToIndex, False)
model = Sequential()
model.add(pretrainedEmbeddingLayer)
...
Related Topics
Explaining Python's '_Enter_' and '_Exit_'
Python Image Library Fails with Message "Decoder Jpeg Not Available" - Pil
R Foverlaps Equivalent in Python
Error When Installing Rpy2 Module in Python with Easy_Install
Matplotlib Analog of R's 'Pairs'
Using Beautiful Soup to Convert CSS Attributes to Individual HTML Attributes
Best Way to Set Entry Background Color in Python Gtk3 and Set Back to Default
Passing a Matplotlib Figure to HTML (Flask)
Generating HTML Documents in Python
Understand the Find() Function in Beautiful Soup
How Find Specific Data Attribute from HTML Tag in Beautifulsoup4
Using a Django Variable in a CSS File
Does Python Have an "Or Equals" Function Like ||= in Ruby
How to Serve Multiple Clients Using Just Flask App.Run() as Standalone
Xcode 3.2 Ruby and Python Templates
Running Windows Shell Commands with Python
How to Make the Python Interpreter Correctly Handle Non-Ascii Characters in String Operations