Document Type

Technical Report

Publication Date

1-1-2013

Technical Report Number

TR2013-726

Abstract

This paper introduces MEXSVMs, a mid-level representation enabling efficient recognition of actions in videos. The entries in our descriptor are the outputs of several movement classifiers evaluated over spatial-temporal volumes of the image sequence, using space-time interest points as low-level features. Each movement classifier is a simple exemplar-SVM, i.e., an SVM trained using a single positive video and a large number of negative sequences. Our representation offers two main advantages. First, since our mid-level features are learned from individual video exemplars, they require minimal amount of supervision. Second, we show that even simple linear classification models trained on our global video descriptor yield action recognition accuracy comparable to the state-of-the-art. Because of the simplicity of linear models, our descriptor can efficiently learn classifiers for a large number of different actions and to recognize actions even in large video databases. Experiments on two of the most challenging action recognition benchmarks demonstrate that our approach achieves accuracy similar to the best known methods while performing 70 times faster than the closest competitor.

Share

COinS