Is there a simpler way to handle batch inputs from tfrecords?
The whole process is simplied using the Dataset API
. Here are both the parts: (1): Convert numpy array to tfrecords
and (2): read the tfrecords to generate batches
.
1. Creation of tfrecords from a numpy array:
Example arrays:
inputs = np.random.normal(size=(5, 32, 32, 3))
labels = np.random.randint(0,2,size=(5,))
def npy_to_tfrecords(inputs, labels, filename):
with tf.io.TFRecordWriter(filename) as writer:
for X, y in zip(inputs, labels):
# Feature contains a map of string to feature proto objects
feature = {}
feature['X'] = tf.train.Feature(float_list=tf.train.FloatList(value=X.flatten()))
feature['y'] = tf.train.Feature(int64_list=tf.train.Int64List(value=[y]))
# Construct the Example proto object
example = tf.train.Example(features=tf.train.Features(feature=feature))
# Serialize the example to a string
serialized = example.SerializeToString()
# write the serialized objec to the disk
writer.write(serialized)
npy_to_tfrecords(inputs, labels, 'numpy.tfrecord')
2. Read the tfrecords using the Dataset API:
filenames = ['numpy.tfrecord']
dataset = tf.data.TFRecordDataset(filenames)
# for version 1.5 and above use tf.data.TFRecordDataset
# example proto decode
def _parse_function(example_proto):
keys_to_features = {'X':tf.io.FixedLenFeature(shape=(32, 32, 3), dtype=tf.float32),
'y': tf.io.FixedLenFeature((), tf.int64, default_value=0)}
parsed_features = tf.io.parse_single_example(example_proto, keys_to_features)
return parsed_features['X'], parsed_features['y']
# Parse the record into tensors.
dataset = dataset.map(_parse_function)
# Generate batches
dataset = dataset.batch(5)
Check the generated batches are proper:
for data in dataset:
break
np.testing.assert_allclose(inputs[0] ,data[0][0])
np.testing.assert_allclose(labels[0] ,data[1][0])
Numpy array to TFrecord
This tutorial will walk you through the process of creating TFRecords from your data:
https://medium.com/mostly-ai/tensorflow-records-what-they-are-and-how-to-use-them-c46bc4bbb564
However there are easier ways of dealing with preprocessing now using the Dataset input pipeline. I prefer to keep my data in it's most original format and build a preprocessing pipeline to deal with it. Here's the primary guide you want to read to learn about the Dataset preprocessing pipeline:
https://www.tensorflow.org/programmers_guide/datasets
Converting a Numpy file to TFRecord where each row contains a number, and a variable length list
I assume you want to add numbers feature and list feature respectively here.
import tensorflow as tf
writer = tf.python_io.TFRecordWriter('test.tfrecords')
for index in range(my_data.shape[0]):
example = tf.train.Example(features=tf.train.Features(feature={
'num_value':tf.train.Feature(int64_list=tf.train.Int64List(value=[my_data[index][0]])),
'list_value':tf.train.Feature(int64_list=tf.train.Int64List(value=my_data[index][1]))
}))
writer.write(example.SerializeToString())
writer.close()
#read data from tfrecords
record_iterator = tf.python_io.tf_record_iterator('test.tfrecords')
for _ in range(2):
seralized_img_example = next(record_iterator)
example = tf.train.Example()
example.ParseFromString(seralized_img_example)
num_value = example.features.feature['num_value'].int64_list.value[0]
list_value = example.features.feature['list_value'].int64_list.value
print(num_value,list_value)
#print
1446549 [491827, 30085, 1417541, 799563, 879302, 1997973, 1373049, 1460602, 2240973, 1172992, 1186011, 147536, 1958456, 3095889, 319954, 2191582, 1113354, 302626, 1985611, 1186704, 2231212, 2642148, 386962, 3072993, 1131255, 15085, 2714264, 1363205]
406529 [900479, 660976, 1270383, 1287181]
Is there a simple way to set epochs when using TFRecords with Tensorflow Estimators
You can set number of epoch with dataset.repeat(num_epochs)
. Dataset pipeline outputs a dataset object, a tuple (features, labels) of batch size, that is inputed to model.train()
dataset = tf.data.TFRecordDataset(file.tfrecords)
dataset = tf.shuffle().repeat()
...
dataset = dataset.batch()
In order to make it work, you set model.train(steps=None, max_steps=None)
In this case, you let Dataset API to handle epochs count by generating tf.errors.OutOfRange
error or StopIteration
exception once num_epoch is reached.
How to use Dataset API to read TFRecords file of lists of variant length?
After hours of searching and trying, I believe the answer emerges. Below is my code.
def _int64_feature(value):
# value must be a numpy array.
return tf.train.Feature(int64_list=tf.train.Int64List(value=value.flatten()))
# Write an array to TFrecord.
# a is an array which contains lists of variant length.
a = np.array([[0, 54, 91, 153, 177],
[0, 50, 89, 147, 196],
[0, 38, 79, 157],
[0, 49, 89, 147, 177],
[0, 32, 73, 145]])
writer = tf.python_io.TFRecordWriter('file')
for i in range(a.shape[0]): # i = 0 ~ 4
x_train = np.array(a[i])
feature = {'i' : _int64_feature(np.array([i])),
'data': _int64_feature(x_train)}
# Create an example protocol buffer
example = tf.train.Example(features=tf.train.Features(feature=feature))
# Serialize to string and write on the file
writer.write(example.SerializeToString())
writer.close()
# Check TFRocord file.
record_iterator = tf.python_io.tf_record_iterator(path='file')
for string_record in record_iterator:
example = tf.train.Example()
example.ParseFromString(string_record)
i = (example.features.feature['i'].int64_list.value)
data = (example.features.feature['data'].int64_list.value)
print(i, data)
# Use Dataset API to read the TFRecord file.
filenames = ["file"]
dataset = tf.data.TFRecordDataset(filenames)
def _parse_function(example_proto):
keys_to_features = {'i':tf.VarLenFeature(tf.int64),
'data':tf.VarLenFeature(tf.int64)}
parsed_features = tf.parse_single_example(example_proto, keys_to_features)
return tf.sparse_tensor_to_dense(parsed_features['i']), \
tf.sparse_tensor_to_dense(parsed_features['data'])
# Parse the record into tensors.
dataset = dataset.map(_parse_function)
# Shuffle the dataset
dataset = dataset.shuffle(buffer_size=1)
# Repeat the input indefinitly
dataset = dataset.repeat()
# Generate batches
dataset = dataset.batch(1)
# Create a one-shot iterator
iterator = dataset.make_one_shot_iterator()
i, data = iterator.get_next()
with tf.Session() as sess:
print(sess.run([i, data]))
print(sess.run([i, data]))
print(sess.run([i, data]))
There are few things to note.
1. This SO question helps a lot.
2. tf.VarLenFeature
would return SparseTensor, thus, using tf.sparse_tensor_to_dense
to convert to dense tensor is necessary.
3. In my code, parse_single_example()
can't be replaced with parse_example()
, and it bugs me for a day. I don't know why parse_example()
doesn't work out. If anyone know the reason, please enlighten me.
Related Topics
Django Template Can't Loop Defaultdict
Output Pyodbc Cursor Results as Python Dictionary
Run Code After Flask Application Has Started
How to Add Default Parameters to Functions When Using Type Hinting
How to Efficiently Handle European Decimal Separators Using the Pandas Read_CSV Function
Attributeerror: 'Client' Object Has No Attribute 'Send_Message' (Discord Bot)
How to Force/Ensure Class Attributes Are a Specific Type
Fastest Way to Search a List in Python
How Does Python's Comma Operator Work During Assignment
Password Authentication in Python Paramiko Fails, But Same Credentials Work in Ssh/Sftp Client
Finding Moving Average from Data Points in Python
How to Enable MySQL Client Auto Re-Connect with MySQLdb
Builtins.Typeerror: Must Be Str, Not Bytes
Value Error Trying to Install Python for Windows Extensions
How to Straighten a Rotated Rectangle Area of an Image Using Opencv in Python
Passing Command Line Arguments to Argv in Jupyter/Ipython Notebook
Child Processes Created with Python Multiprocessing Module Won't Print