How to Get Thumbnail for Video in My /Sdcard/Android/Data/Mypackage/Files Folder

Taking first frame of video as thumbnail in android?

The first frame can be taken using the NDK and ffmpeg, but it's more trouble than it's worth.

The simple way is to use ThumbnailUtils as per this answer, provided you are on android-8 (Froyo) or later.

Auto selecting a thumbnail for a video

I want to be clear with the OP that this answer does not represent a formal description of the approach so as to describe the prospective approach in an intuitive way.

Suppose that a video is composed by n frames and that each of them can be represented as a 3D tensor (height, width, channel). It is possible to use a Convolutional Neural Networks (CNNs) to generate a latent representation for each frame.

A video can be represented as a sequence of frames (f_1, f_2, ..., f_n). The most suitable neural network architecture for sequence modeling are Recurrent Neural Networks (RNNs). We can use RNNs to encode a sequence of video frames latent representations generated by a CNN. After that, you will have a latent representations (f_1, f_2, ..., f_n) for each frame generated by the RNN which directly depends from the previous ones (this is a well-known property of RNNs).

As you can see in the recently released Youtube-8M dataset, there are high quality thumbnails associated to each video so you can use them as targets. In particular, given the latent representations generated by the RNNs applied on the sequence of frames, you can generate a context vector c which is generated as follows:

alpha = softmax(FNN(f_1), FNN(f_2), ..., FNN(f_n))
c = f_1 * alpha_1 + f_2 * alpha_2 + ... + f_n * alpha_n

where FNN is a feedforward neural network that receives the latent representation f_i for the frame f_i and generates a score which represents how much important is in the current sequence. We can exploit the context vector c in order to predict the most suitable frame of the video.

In my opinion there are two possible strategies to define a loss function for the minimization problem that the network should solve. The first one it is easier than the second one. I briefly describe them as follows:

  • predict thumbnail index: by exploiting the context vector c, we can train the network to predict an integer value which represents the position of the selected frame by minimizing a cross-entropy loss between the generated index and the target one;
  • reconstruction error: by exploiting the context vector c, we can train the network to generate a new image by minimizing the reconstruction error evaluated between the image generated by the model and the target one.

I have not tried any of them in practice so I cannot be sure that my approach will work but I am confident that it is something reasonable to do so as to effectively complete this task. Anyway, I hope that this answer may be helpful to the OP in order to better understand how this task can be solved.



Related Topics



Leave a reply



Submit