Tf.Data.Dataset: How to Get the Dataset Size (Number of Elements in an Epoch)

tf.data.Dataset: how to get the dataset size (number of elements in an epoch)?

tf.data.Dataset.list_files creates a tensor called MatchingFiles:0 (with the appropriate prefix if applicable).

You could evaluate

tf.shape(tf.get_default_graph().get_tensor_by_name('MatchingFiles:0'))[0]

to get the number of files.

Of course, this would work in simple cases only, and in particular if you have only one sample (or a known number of samples) per image.

In more complex situations, e.g. when you do not know the number of samples in each file, you can only observe the number of samples as an epoch ends.

To do this, you can watch the number of epochs that is counted by your Dataset. repeat() creates a member called _count, that counts the number of epochs. By observing it during your iterations, you can spot when it changes and compute your dataset size from there.

This counter may be buried in the hierarchy of Datasets that is created when calling member functions successively, so we have to dig it out like this.

d = my_dataset
# RepeatDataset seems not to be exposed -- this is a possible workaround 
RepeatDataset = type(tf.data.Dataset().repeat())
try:
  while not isinstance(d, RepeatDataset):
    d = d._input_dataset
except AttributeError:
  warnings.warn('no epoch counter found')
  epoch_counter = None
else:
  epoch_counter = d._count

Note that with this technique, the computation of your dataset size is not exact, because the batch during which epoch_counter is incremented typically mixes samples from two successive epochs. So this computation is precise up to your batch length.

In TensorFlow 2.0, how can I see the number of elements in a dataset?

If it's possible to know the length then you could use:

tf.data.experimental.cardinality(dataset)

but the problem is that a TF dataset is inherently lazily loaded. So we might not know the size of the dataset up front. Indeed, it's perfectly possible to have a dataset represent an infinite set of data!

If it is a small enough dataset you could also just iterate over it to get the length. I've used the following ugly little construct before but it depends on the dataset being small enough for us to be happy to load into memory and it's really not an improvement over your for loop above!

dataset_length = [i for i,_ in enumerate(dataset)][-1] + 1

How to get number of samples in a tf.dataset for steps_per_epoch?

The previous answer is good, yet I would like to point out two matters:

The code below works, no need to use the experimental package anymore.

import tensorflow as tf
dataset = tf.data.Dataset.range(42)
#Still prints 42
print(dataset.cardinality().numpy())

If you use the filter predicate, the cardinality may return value -2, hence unknown; if you do use filter predicates on your dataset, ensure that you have calculated in another manner the length of your dataset( for example length of pandas dataframe before applying .from_tensor_slices() on it.

Another important point is how to set the parameters steps_per_epoch and validation_steps : steps_per_epoch == length_of_training_dataset // batch_size, validation_steps == length_of_validation_dataset // batch_size

A full example is available here : How to use repeat() function when building data in Keras?

Get length of a dataset in Tensorflow

I was working with tf.data.FixedLengthRecordDataset() and ran into a similar problem.
In my case, I was trying to only take a certain percentage of the raw data.
Since I knew all the records have a fixed length, a workaround for me was:

totalBytes = sum([os.path.getsize(os.path.join(filepath, filename)) for filename in os.listdir(filepath)])
numRecordsToTake = tf.cast(0.01 * percentage * totalBytes / bytesPerRecord, tf.int64)
dataset = tf.data.FixedLengthRecordDataset(filenames, recordBytes).take(numRecordsToTake)

In your case, my suggestion would be to count directly in python the number of records in 'primary.csv' and 'secondary.csv'. Alternatively, I think for your purpose, to set the buffer_size argument doesn't really require counting the files. According to the accepted answer about the meaning of buffer_size, a number that's greater than the number of elements in the dataset will ensure a uniform shuffle across the whole dataset. So just putting in a really big number (that you think will surpass the dataset size) should work.

For an infinite dataset, is the data used in each epoch the same?

No, the data used is different. steps_per_epoch is used by keras to determine the length of each epoch (as generators got no length), so it knows when to end the training (or call checkpointers etc.).

initial_epoch is a number displayed for epoch and useful when you want to restart training from checkpoint (see fit method), it has nothing to do with data iteration.

If you pass the same dataset to model.fit method, it will reset after each function call (thanks for the info OP).

tf.Dataset will not repeat without - WARNING:tensorflow:Your input ran out of data; interrupting training

Hmm, maybe you should not be explicitly defining the batch_size and steps_per_epoch in model.fit(...). Regarding the batch_size parameter in model.fit(...), the docs state:

[...] Do not specify the batch_size if your data is in the form of datasets,
generators, or keras.utils.Sequence instances (since they generate
batches).

This seems to work:

import tensorflow as tf

x = tf.random.normal((1000, 1))
y = tf.random.normal((1000, 1))

ds = tf.data.Dataset.from_tensor_slices((x, y)).repeat(2)
ds = ds.shuffle(2000).cache('train_cache').batch(15, drop_remainder=True ).prefetch(tf.data.AUTOTUNE)

val_ds = tf.data.Dataset.from_tensor_slices((tf.random.normal((300, 1)), tf.random.normal((300, 1))))
val_ds = val_ds.shuffle(300).cache('val_cache').batch(15, drop_remainder=True).prefetch(tf.data.AUTOTUNE)

inputs = tf.keras.layers.Input(shape = (1,))
x = tf.keras.layers.Dense(10, activation='relu')(inputs)
outputs = tf.keras.layers.Dense(1)(x)
model = tf.keras.Model(inputs, outputs)

model.compile(optimizer='adam', loss='mse')
model.fit(ds, validation_data=val_ds, epochs = 5)

Epoch 1/5
133/133 [==============================] - 1s 4ms/step - loss: 1.0355 - val_loss: 1.1205
Epoch 2/5
133/133 [==============================] - 0s 3ms/step - loss: 0.9847 - val_loss: 1.1050
Epoch 3/5
133/133 [==============================] - 0s 3ms/step - loss: 0.9810 - val_loss: 1.0982
Epoch 4/5
133/133 [==============================] - 0s 3ms/step - loss: 0.9792 - val_loss: 1.0937
Epoch 5/5
133/133 [==============================] - 0s 3ms/step - loss: 0.9779 - val_loss: 1.0903
<keras.callbacks.History at 0x7f3acb3e5ed0>

133 * batch_size = 1995 --> remainder was dropped.

Tf.Data.Dataset: How to Get the Dataset Size (Number of Elements in an Epoch)