Date of Award


Document Type

Thesis (Ph.D.)

Department or Program

Computer Science

First Advisor

Soroush Vosoughi


The investigation of large multi-modal models (LMMs) has emerged as a focal point within the Deep Learning community, showcasing its prominence in contemporary research. LMMs exhibit the capacity to take data from diverse modalities, enabling them to execute a myriad of tasks by leveraging complementary information for enhanced predictive capabilities. The learning process of LMMs is bifurcated into two crucial stages: the computationally intensive pre-training stage, aimed at acquiring general representations from web-scale noisy data, and the subsequent fine-tuning stage, focusing on adapting pre-trained models to specific tasks.

Traditionally, the pre-training of foundational LMMs has been considered a privilege limited to research labs with abundant computational resources. In this thesis, we propose a new method for the effective pre-training of foundational vision-language models (VLMs). This involves mitigating the data demands by employing off-the-shelf frozen large language models (LLMs) through a specialized pre-training process. Additionally, we introduce an efficient VLM pre-training method that reduces redundancy in modality projection. Through our novel approach, the data requirements for training LLMs are substantially reduced from 129 million to 4 million instances, and the associated training budget can be curtailed to 1/10 without perceptible decreases in performance.

Furthermore, we present a straightforward yet potent temporal fusion mechanism for adapting pre-trained image-language models to downstream video tasks. Our video captioning models achieve competitive performance against state-of-the-art benchmarks without extensive pre-training on video-text datasets. Beyond the established domains of multi-modal research in computer vision and natural language processing, our research extends into the realm of bioinformatics by investigating protein-RNA models for multi-modal learning. Our findings demonstrate that pre-trained protein models encapsulate information about biological structures that can be shared with RNAs. Given the limited number of experimentally solved of RNA structures, our discovery opens avenues for novel research directions in transfer learning between proteins and RNAs.

Finally, we employ physical augmented simulations to train a T-cell-peptide model highlights that integrating such simulations in machine learning significantly enhances model training, especially with limited labeled data. This underscores the potential of merging simulations with machine learning, providing a valuable strategy for advancing LMMs training in the biological domain.