Date of Award
Department of Computer Science
There is a natural correlation between the visual and auditive elements of a video. In this work, we use this correlation in order to learn strong and general features via cross-modal self-supervision with carefully chosen neural network architectures and calibrated curriculum learning. We suggest that this type of training is an effective way of pretraining models for further pursuits in video understanding, as they achieve on average 14.8% improvement over models trained from scratch. Furthermore, we demonstrate that these general features can be used for audio classification and perform on par with state-of-the-art results. Lastly, our work shows that using cross-modal self-supervision for pretraining is a good starting point for the development of multi-sensory models.
Korbar, Bruno, "Co-Training of Audio and Video Representations from Self-Supervised Temporal Synchronization" (2018). Dartmouth College Undergraduate Theses. 130.