Numpy to Tfrecords: Is There a More Simple Way to Handle Batch Inputs from Tfrecords

Is there a simpler way to handle batch inputs from tfrecords?

The whole process is simplied using the Dataset API. Here are both the parts: (1): Convert numpy array to tfrecords and (2): read the tfrecords to generate batches.

1. Creation of tfrecords from a numpy array:

Example arrays:
inputs = np.random.normal(size=(5, 32, 32, 3))
labels = np.random.randint(0,2,size=(5,))

def npy_to_tfrecords(inputs, labels, filename):
  with tf.io.TFRecordWriter(filename) as writer:
    for X, y in zip(inputs, labels):
        # Feature contains a map of string to feature proto objects
        feature = {}
        feature['X'] = tf.train.Feature(float_list=tf.train.FloatList(value=X.flatten()))
        feature['y'] = tf.train.Feature(int64_list=tf.train.Int64List(value=[y]))

        # Construct the Example proto object
        example = tf.train.Example(features=tf.train.Features(feature=feature))

        # Serialize the example to a string
        serialized = example.SerializeToString()

        # write the serialized objec to the disk
        writer.write(serialized)

npy_to_tfrecords(inputs, labels, 'numpy.tfrecord')

2. Read the tfrecords using the Dataset API:

filenames = ['numpy.tfrecord']
dataset = tf.data.TFRecordDataset(filenames)
# for version 1.5 and above use tf.data.TFRecordDataset

# example proto decode
def _parse_function(example_proto):
    keys_to_features = {'X':tf.io.FixedLenFeature(shape=(32, 32, 3), dtype=tf.float32),
                      'y': tf.io.FixedLenFeature((), tf.int64, default_value=0)}
    parsed_features = tf.io.parse_single_example(example_proto, keys_to_features)
    return parsed_features['X'], parsed_features['y']

# Parse the record into tensors.
dataset = dataset.map(_parse_function)  
  
# Generate batches
dataset = dataset.batch(5)

Check the generated batches are proper:

for data in dataset:
    break
np.testing.assert_allclose(inputs[0] ,data[0][0])
np.testing.assert_allclose(labels[0] ,data[1][0])

Numpy array to TFrecord

This tutorial will walk you through the process of creating TFRecords from your data:

https://medium.com/mostly-ai/tensorflow-records-what-they-are-and-how-to-use-them-c46bc4bbb564

However there are easier ways of dealing with preprocessing now using the Dataset input pipeline. I prefer to keep my data in it's most original format and build a preprocessing pipeline to deal with it. Here's the primary guide you want to read to learn about the Dataset preprocessing pipeline:

https://www.tensorflow.org/programmers_guide/datasets

Converting a Numpy file to TFRecord where each row contains a number, and a variable length list

I assume you want to add numbers feature and list feature respectively here.

import tensorflow as tf

writer = tf.python_io.TFRecordWriter('test.tfrecords')
for index in range(my_data.shape[0]):
    example = tf.train.Example(features=tf.train.Features(feature={
        'num_value':tf.train.Feature(int64_list=tf.train.Int64List(value=[my_data[index][0]])),
        'list_value':tf.train.Feature(int64_list=tf.train.Int64List(value=my_data[index][1]))
    }))
    writer.write(example.SerializeToString())
writer.close()

#read data from tfrecords
record_iterator = tf.python_io.tf_record_iterator('test.tfrecords')
for _ in range(2):
    seralized_img_example = next(record_iterator)
    example = tf.train.Example()
    example.ParseFromString(seralized_img_example)
    num_value = example.features.feature['num_value'].int64_list.value[0]
    list_value = example.features.feature['list_value'].int64_list.value
    print(num_value,list_value)

#print
1446549 [491827, 30085, 1417541, 799563, 879302, 1997973, 1373049, 1460602, 2240973, 1172992, 1186011, 147536, 1958456, 3095889, 319954, 2191582, 1113354, 302626, 1985611, 1186704, 2231212, 2642148, 386962, 3072993, 1131255, 15085, 2714264, 1363205]
406529 [900479, 660976, 1270383, 1287181]

Is there a simple way to set epochs when using TFRecords with Tensorflow Estimators

You can set number of epoch with dataset.repeat(num_epochs). Dataset pipeline outputs a dataset object, a tuple (features, labels) of batch size, that is inputed to model.train()

dataset = tf.data.TFRecordDataset(file.tfrecords)
dataset = tf.shuffle().repeat()
...
dataset = dataset.batch()

In order to make it work, you set model.train(steps=None, max_steps=None) In this case, you let Dataset API to handle epochs count by generating tf.errors.OutOfRange error or StopIteration exception once num_epoch is reached.

How to use Dataset API to read TFRecords file of lists of variant length?

After hours of searching and trying, I believe the answer emerges. Below is my code.

def _int64_feature(value):
    # value must be a numpy array.
    return tf.train.Feature(int64_list=tf.train.Int64List(value=value.flatten()))

# Write an array to TFrecord.
# a is an array which contains lists of variant length.
a = np.array([[0, 54, 91, 153, 177],
              [0, 50, 89, 147, 196],
              [0, 38, 79, 157],
              [0, 49, 89, 147, 177],
              [0, 32, 73, 145]])

writer = tf.python_io.TFRecordWriter('file')

for i in range(a.shape[0]): # i = 0 ~ 4
    x_train = np.array(a[i])
    feature = {'i'   : _int64_feature(np.array([i])), 
               'data': _int64_feature(x_train)}

    # Create an example protocol buffer
    example = tf.train.Example(features=tf.train.Features(feature=feature))

    # Serialize to string and write on the file
    writer.write(example.SerializeToString())

writer.close()

# Check TFRocord file.
record_iterator = tf.python_io.tf_record_iterator(path='file')
for string_record in record_iterator:
    example = tf.train.Example()
    example.ParseFromString(string_record)

    i = (example.features.feature['i'].int64_list.value)
    data = (example.features.feature['data'].int64_list.value)
    print(i, data)

# Use Dataset API to read the TFRecord file.
filenames = ["file"]
dataset = tf.data.TFRecordDataset(filenames)
def _parse_function(example_proto):
    keys_to_features = {'i':tf.VarLenFeature(tf.int64),
                        'data':tf.VarLenFeature(tf.int64)}
    parsed_features = tf.parse_single_example(example_proto, keys_to_features)
    return tf.sparse_tensor_to_dense(parsed_features['i']), \
           tf.sparse_tensor_to_dense(parsed_features['data'])
# Parse the record into tensors.
dataset = dataset.map(_parse_function)
# Shuffle the dataset
dataset = dataset.shuffle(buffer_size=1)
# Repeat the input indefinitly
dataset = dataset.repeat()  
# Generate batches
dataset = dataset.batch(1)
# Create a one-shot iterator
iterator = dataset.make_one_shot_iterator()
i, data = iterator.get_next()
with tf.Session() as sess:
    print(sess.run([i, data]))
    print(sess.run([i, data]))
    print(sess.run([i, data]))

There are few things to note.

1. This SO question helps a lot.

2. tf.VarLenFeature would return SparseTensor, thus, using tf.sparse_tensor_to_dense to convert to dense tensor is necessary.

3. In my code, parse_single_example() can't be replaced with parse_example(), and it bugs me for a day. I don't know why parse_example() doesn't work out. If anyone know the reason, please enlighten me.