Date of Award
Department of Computer Science
In this thesis, we investigate different representations and models for large-scale video understanding. These methods include a mid-level representation for action recognition, a deep-learned representation for video analysis, a generic convolutional network architecture for video voxel prediction, and a new high-level task and benchmark of video comprehension. First, we present EXMOVES, a mid-level representation for scalable action recognition. The entries in EXMOVES representation are the calibrated outputs of a set of movement classifiers over spatial-temporal volumes of the input video. Each movement classifier is a simple exemplar-SVM trained on low-level features. Our EXMOVES requires a minimal amount of supervision while also obtaining good action recognition accuracy. It is approximately 70 times faster than other mid-level video representations. Second, we propose an effective method for spatiotemporal feature learning using deep 3-dimensional convolutional networks (3D ConvNets) trained on a large-scale video dataset. We show that 3D ConvNets are more suitable for spatiotemporal feature learning compared to 2D ConvNets. Our learned features, C3D, with a simple linear classifier outperform state-of-the-art methods on four different benchmarks and are comparable with current best methods on the other two benchmarks. The features are also very compact, efficient to compute, and easy to use. Third, we develop a generic 3D ConvNet architecture for video voxel prediction. Our preliminary results show that our architecture can be applied for different voxel prediction problems with good results. Finally, we propose a new task, namely Video Comprehension, construct a large-scale benchmark, and develop a set of fundamental baselines as well as conduct a human study on the newly-proposed benchmark.
Tran, Du Le Hong, "Representations and Models for Large-Scale Video Understanding" (2016). Dartmouth College Ph.D Dissertations. 53.