Video (language) Modeling: a Baseline for Generative Models of Natural Videos

ArXiv PrePrint

By: Marc'Aurelio Ranzato, Arthur Szlam, Joan Bruna, Michael Mathieu, Ronan Collobert, Sumit Chopra


We propose a strong baseline model for unsupervised feature learning using video data. By learning to predict missing frames or extrapolate future frames from an input video sequence, the model discovers both spatial and temporal correlations which are useful to represent complex deformations and motion patterns. The models we propose are largely borrowed from the language modeling literature, and adapted to the vision domain by quantizing the space of image patches into a large dictionary. We demonstrate the approach on both a filling and a generation task. For the first time, we show that, after training on natural videos, such a model can predict non-trivial motions over short video sequences.


Below here, we show some examples of what the model generates after training on the UCF-101 dataset. The first two frames of each video are the ground truth initialization (marked with a white dot in the top right corner), the subsequent 10 frames are generated by our model (a recurrent convolutional neural network operating in the space of quantize image patches).

These generations can be compared to a method based on optical flow, which makes predictions by assuming constant optical flow in time (courtesy of Piotr Dollar):

We also used the model for filling in missing frames in a video. In the following examples, we predict 3 missing frames given then first and the fifth frame (the first frame is marked with “1” and the last frame is marked with “5” on the top right corner; the remaining intermediate frames are generated). In each of the four examples, we show the prediction made by: our model, linear interpolation in the space of optical flow and linear interpolation in pixel space.

Filling example 1.

Filling example 2.

Filling example 3.

Filling example 4.