Date of Award


Document Type

Thesis (Undergraduate)

Department or Program

Department of Computer Science

First Advisor

Lorenzo Torresani


There is a natural correlation between the visual and auditive elements of a video. In this work, we use this correlation in order to learn strong and general features via cross-modal self-supervision with carefully chosen neural network architectures and calibrated curriculum learning. We suggest that this type of training is an effective way of pretraining models for further pursuits in video understanding, as they achieve on average 14.8% improvement over models trained from scratch. Furthermore, we demonstrate that these general features can be used for audio classification and perform on par with state-of-the-art results. Lastly, our work shows that using cross-modal self-supervision for pretraining is a good starting point for the development of multi-sensory models.


Originally posted in the Dartmouth College Computer Science Technical Report Series, number TR2018-849.