Does Caffe Need Data to Be Shuffled

Does Caffe need data to be shuffled?

Should you shuffle the samples? Think about the learning process if you don't shuffle; caffe sees only 0 samples - what do you expect the algorithm to deduce? simply predict 0 all the time and everything is cool. If you have plenty of 0 before you hit the first 1 caffe will be very confident in predicting always 0. It will be very difficult to move the model from this point.

On the other hand, if it constantly sees a mix of 0 and 1 it learns from the beginning meaningful features for separating the examples.

Bottom line: it is very advantageous to shuffle the training samples, especially when using SGD-based approaches.

AFAIK, caffe does not randomly sample batch_size samples, but rather goes sequentially over the input DB batch_size after batch_size samples.

TL;DR
shuffle.

Caffe's way of doing data shuffling

Using convert_imageset tool creates a copy of your training/validation data in a binary database file (either in lmdb or leveldb format). The data encoded in the dataset includes pairs of example and its corresponding label.

Therefore, when shuffle-ing the dataset the labels are shuffled with the data to maintain the correspondence between data and its ground-truth label.

There is no need to shuffle the data again during training.

Shuffle in caffe with multiple tmdbs

If you use layer type "Data" you can't use shuffle as there is no shuffle parameter in data_param.
As for layer type "ImageData" you can't use lmdb as data source as source file should be a text file with image address and label. But it has shuffle parameter. If you look inside image_data_layer.cpp you'll find if shuffle is true then image sources are shuffled in each epoch using Fisher–Yates algorithm. If you use two different ImageData layer then ShuffleImages() will be called for each of them and it is unlikely that two shuffle will generate the same sequence. So you can't use shuffle in any of these two ImageData layer.

Does machine learning framework caffe support different data type precisions?

The mean file and the trained parameter you are using in the tutorial are stored in single precision values. Changing float to double in the program does not change the stored values, thus trying to read stored single-precision values as double-precision results with you reading "garbage". You'll have to manually convert the files to double precision values

BatchNorm and Reshuffle train images after each epoch

If you use the ImageData Layer as your input, set "shuffle" to true.

For example, if you have:

layer {
  name: "data"
  type: "ImageData"
  top: "data"
  top: "label"
  transform_param {
    mirror: false
    crop_size: 227
    mean_file: "data/ilsvrc12/imagenet_mean.binaryproto"
  }
  image_data_param {
    source: "examples/_temp/file_list.txt"
    batch_size: 50
    new_height: 256
    new_width: 256
  }
}

Just add:

layer {
  name: "data"
  type: "ImageData"
  top: "data"
  top: "label"
  transform_param {
    mirror: false
    crop_size: 227
    mean_file: "data/ilsvrc12/imagenet_mean.binaryproto"
  }
  image_data_param {
    source: "examples/_temp/file_list.txt"
    batch_size: 50
    new_height: 256
    new_width: 256
    shuffle: true
  }
}

For documentation, see:

http://caffe.berkeleyvision.org/tutorial/layers.html#images
https://github.com/BVLC/caffe/blob/master/src/caffe/proto/caffe.proto#L770

You can also find the source code here:

https://github.com/BVLC/caffe/blob/master/src/caffe/layers/image_data_layer.cpp

Of particular interest is the code within the function load_batch which re-shuffles the data at the end of each epoch:

lines_id_++;
if (lines_id_ >= lines_size) {
  // We have reached the end. Restart from the first.
  DLOG(INFO) << "Restarting data prefetching from start.";
  lines_id_ = 0;
  if (this->layer_param_.image_data_param().shuffle()) {
    ShuffleImages();
  }
}

How to input multiple N-D arrays to a net in caffe?

You want to caffe to use several N-D signals for each training sample. You are concerned with the fact that the default "Data" layer can only handle one image as a training sample.

There are several solutions for this concern:

Using several "Data" layers (as was done in the model you linked to). In order to sync between the three "Data" layers you'll have you need to know that caffe reads the samples from the underlying LMDB sequentially. So, if you prepare your three LMDBs in the same order caffe will read one sample at a time from each of the LMDBs in the order in which the samples were put there, so the three inputs will be in sync during training/validation.

Note that convert_imageset has a 'shuffle' flag, do NOT use it as it will shuffle your samples differently in each of the three LMDBs and you will have no sync. You are strongly advised to shuffle the samples yourself before preparing the LMDBs but in a way that the same "shuffle" is applied to all three inputs leaving them in sync with each other.
Using 5 channel input. caffe can store N-D data in LMDB and not only color/gray images. You can use python to create LMDB with each "image" is a 5-channel array with the first three channels are image's RGB and the last two are the ground-truth labels and the weight for the per-pixel loss.

In your model you only need to add a "Slice" layer on top of your "Data":
```
layer {
  name: "slice_input"
  type: "Slice"
  bottom: "raw_input" # 5-channel "image" stored in LMDB
  top: "rgb"
  top: "gt"
  top: "weight"
  slice_param { 
    axis: 1
    slice_point: 3
    slice_point: 4
  }
}
```
Using "HDF5Data" layer (my personal favorite). You can store your inputs in a binary hdf5 format and have caffe read from these files. Using "HDF5Data" is much more flexible in caffe and allows you to shape the inputs as much as you like. In your case you need to prepare a binary hdf5 file with three "datasets": 'rgb', 'gt' and 'weight'. You need to make sure the samples are synced when you create the hdf5 file(s). Once you have the, ready you can have a "HDF5Data" layer with three "top"s ready to be used.
Write your own "Python" input layer. I will not go into the details here. But you can implement your own input layer in python. See this thread for more details.

Impact of data shuffling on results reproducibility in Pytorch

The main algorithm/principal deep learning is based on is weight optimization using stochastic gradient descend (and its variants). Being a stochastic algorithm you cannot expect to get exactly the same results if you run your algorithm multiple times.
In fact, you should see some variations, but they should be "roughly the same".

If you need to have exactly the same results when running your algorithm multiple times, you should look into reproducibility of results - which is a very delicate subject.

In summary:
1. If you do not shuffle at all, you will have perfect reproducibility, but the resulting accuracy are expected to be very low.

2. If you randomly shuffle (what most of the world does) you should expect slightly different accuracy value for each run, but they should all be significantly larger than the values of (1) "no shuffle".

3. If you follow the guidelines of reproducible results, you should have the exact same accuracy values for each run and they should be close to the values of (2) "shuffle".

Does Caffe Need Data to Be Shuffled