Document Type

Technical Report

Publication Date


Technical Report Number



This paper introduces MEXSVMs, a mid-level representation enabling efficient recognition of actions in videos. The entries in our descriptor are the outputs of several movement classifiers evaluated over spatial-temporal volumes of the image sequence, using space-time interest points as low-level features. Each movement classifier is a simple exemplar-SVM, i.e., an SVM trained using a single positive video and a large number of negative sequences. Our representation offers two main advantages. First, since our mid-level features are learned from individual video exemplars, they require minimal amount of supervision. Second, we show that even simple linear classification models trained on our global video descriptor yield action recognition accuracy comparable to the state-of-the-art. Because of the simplicity of linear models, our descriptor can efficiently learn classifiers for a large number of different actions and to recognize actions even in large video databases. Experiments on two of the most challenging action recognition benchmarks demonstrate that our approach achieves accuracy similar to the best known methods while performing 70 times faster than the closest competitor.