December 8, 2014

C3D: Generic Features for Video Analysis

By: Du Tran, Manohar Paluri

Multimedia on Internet is growing rapidly, resulting in an explosion in the number of videos being shared. The challenge we all face how to sort through all these videos, figure out what they’re about, and enable people to find the ones they’re interested in.

The computer vision community has worked on video analysis for decades and tackled different problems. Currently, we lack a generic feature descriptor for videos, which can be used for a variety of video processing problems. Such a generic feature descriptor helps in solving video analysis tasks in a homogeneous way, thus also enables large-scale video applications.

In a recent paper, we introduce C3D (Convolution3D), a new generic feature for videos. C3D is discriminative, compact, and efficient to compute. C3D is obtained by training a deep 3D convolutional network on a large annotated video dataset. The dataset contains various concepts encompassing objects, actions, scenes and other frequently occurring categories in videos. Figure 1 summarizes various uses cases of C3D.

post00078_image0001

Our deep network is built on 3D convolution and pooling operations. 3D convolution models spatiotemporal signals better than traditional 2D-based convolutional nets. Figure 2 shows the difference between 2D and 3D convolutions applied on multiple video frames, as it does not collapse all the temporal information. In this work, we pick the best performing architecture [1] and adapt it to 3D convolutions. To our knowledge this is the deepest convolutional 3D network trained on the largest annotated video dataset.

post00078_image0002

Using C3D features and a simple linear SVM, we achieve state of the art performance on scene classification (96.7% on YUPENN[2] and 77.7% on Maryland[3]), objet classification (15.3% on egocentric object[4]) and action similarity labeling (72.9% on ASLAN[5]) problems in the video domain. We also approach the current best performance on action classification (76.4% on UCF101[6]) without using optical flow. C3D is 91 times faster than improved dense trajectories [7] and 2 orders magnitude faster than two-stream networks [8].
Figure 3 visualizes the semantic embedding of C3D features and Imagenet [9] features for the UCF101 dataset using tSNE [10]. As can be seen, C3D features are more generic when applied to other video datasets on other tasks without further fine-tuning.

post00078_image0003

References

The pre-trained model and the code are available here.

[1] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.
[2] K. Derpanis, M. Lecce, K. Daniilidis, and R. Wildes. Dynamic scene understanding: The role of orientation features in space and time in scene classification. In CVPR, 2012.
[3] N. Shroff, P. K. Turaga, and R. Chellappa. Moving vistas:Exploiting motion for describing scenes. In CVPR, 2010.
[4] X. Ren and M. Philipose. Egocentric recognition of handled objects: Benchmark and analysis. In the First Workshop on Egocentric Vision, 2009.
[5] O. Kliper-Gross, T. Hassner, and L. Wolf. The action similarity labeling challenge. TPAMI, 2012.
[6] K. Soomro, A. R. Zamir, and M. Shah. UCF101: A dataset of 101 human action classes from videos in the wild. In CRCV-TR-12-01, 2012.
[7] H. Wang and C. Schmid. Action recognition with improved trajectories. In ICCV, 2013.
[8] K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, 2014.
[9] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In ICML, 2013.
[10] L. van der Maaten and G. Hinton. Visualizing data using t-sne. JMLR, 9(2579-2605):85, 2008.