Video (language) Modeling: a Baseline for Generative Models of Natural Videos

ArXiv PrePrint


We propose a strong baseline model for unsupervised feature learning using video data. By learning to predict missing frames or extrapolate future frames from an input video sequence, the model discovers both spatial and temporal correlations which are useful to represent complex deformations and motion patterns. The models we propose are largely borrowed from the language modeling literature, and adapted to the vision domain by quantizing the space of image patches into a large dictionary. We demonstrate the approach on both a filling and a generation task. For the first time, we show that, after training on natural videos, such a model can predict non-trivial motions over short video sequences.


Below here, we show some examples of what the model generates after training on the UCF-101 dataset. The first two frames of each video are the ground truth initialization (marked with a white dot in the top right corner), the subsequent 10 frames are generated by our model (a recurrent convolutional neural network operating in the space of quantize image patches).

These generations can be compared to a method based on optical flow, which makes predictions by assuming constant optical flow in time (courtesy of Piotr Dollar):

We also used the model for filling in missing frames in a video. In the following examples, we predict 3 missing frames given then first and the fifth frame (the first frame is marked with “1” and the last frame is marked with “5” on the top right corner; the remaining intermediate frames are generated). In each of the four examples, we show the prediction made by: our model, linear interpolation in the space of optical flow and linear interpolation in pixel space.

Filling example 1.

Filling example 2.

Filling example 3.

Filling example 4.

Related Publications

All Publications

A Scalable Approach to Control Diverse Behaviors for Physically Simulated Characters

Jungdam Won, Deepak Gopinath, Jessica Hodgins

ACM SIGGRAPH - July 19, 2020

ARCH: Animatable Reconstruction of Clothed Humans

Zeng Huang, Yuanlu Xu, Christoph Lassner, Hao Li, Tony Tung

CVPR - June 15, 2020

In Defense of Grid Features for Visual Question Answering

Huaizu Jiang, Ishan Misra, Marcus Rohrbach, Erik Learned-Miller, Xinlei Chen

CVPR - June 14, 2020

Hierarchical Scene Coordinate Classification and Regression for Visual Localization

Xiaotian Li, Shuzhe Wang, Yi Zhao, Jakob Verbeek, Juho Kannala

CVPR - June 13, 2020

To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy