Date of Award
5-29-2018
Document Type
Thesis (Undergraduate)
Department or Program
Department of Computer Science
First Advisor
Lorenzo Torresani
Abstract
There is a natural correlation between the visual and auditive elements of a video. In this work, we use this correlation in order to learn strong and general features via cross-modal self-supervision with carefully chosen neural network architectures and calibrated curriculum learning. We suggest that this type of training is an effective way of pretraining models for further pursuits in video understanding, as they achieve on average 14.8% improvement over models trained from scratch. Furthermore, we demonstrate that these general features can be used for audio classification and perform on par with state-of-the-art results. Lastly, our work shows that using cross-modal self-supervision for pretraining is a good starting point for the development of multi-sensory models.
Recommended Citation
Korbar, Bruno, "Co-Training of Audio and Video Representations from Self-Supervised Temporal Synchronization" (2018). Dartmouth College Undergraduate Theses. 130.
https://digitalcommons.dartmouth.edu/senior_theses/130
Comments
Originally posted in the Dartmouth College Computer Science Technical Report Series, number TR2018-849.