Date of Award
Winter 1-24-2025
Document Type
Thesis (Ph.D.)
Department or Program
Computer Science
First Advisor
Dr. Soroush Vosoughi
Second Advisor
Dr. Lorenzo Torresani
Third Advisor
Dr. Temiloluwa O. Prioleau
Abstract
Natural language describes entities in the world, some real and some abstract. It is also common practice to complement human learning of natural language with visual cues. This is evident in the heavily graphical nature of children’s literature which underscores the importance of visual cues in language acquisition. Similarly, the notion of “visual learners” is well recognized, reflecting the understanding that visual signals such as illustrations, gestures, and depictions effectively supplement language. In machine learning, two primary paradigms have emerged for training systems involving natural language. The first paradigm encompasses setups where pre-training and downstream tasks are exclusively in natural language. The second paradigm comprises models that require joint reasoning over both language and visual inputs during pre-training and downstream tasks. Given the widely acknowledged role of visual input in human language comprehension, it is pertinent to inquire whether visual information can similarly augment the comprehension of language-only tasks in machine learning. Despite the remarkable advancements in the capabilities of machine learning models across all domains in recent years, the concept of supplementing Natural Language Processing with visual signals remains insufficiently explored. This is in part due to the absence of clear and effective strategies for integrating visual information into language models, given the limited availability of extensive, high-quality image-language paired datasets. In this thesis, we address this challenge and propose two frameworks for incorporating visual information into natural language pre-training leveraging multimodal models as intermediaries between visual information and language models. Empirical evaluations conducted on language pre-training datasets of varying sizes demonstrate the efficacy of the proposed frameworks across diverse downstream language tasks. In addition, we introduce methods for training effective multimodal models through architectural innovations and novel multimodal data augmentation techniques. The representations generated by our multimodal models lead to improved performance in zero-shot image categorization, visual question answering, visual entailment, and cross-modal retrieval tasks in downstream evaluations. Finally, this thesis presents a novel method for constructing effective neural networks by selection from randomly initialized parameters in contrast to the conventional practice of parameter updates via gradient descent.
Recommended Citation
Aladago, Maxwell Mbabilla, "Incorporating Visual Information into Natural Language Processing" (2025). Dartmouth College Ph.D Dissertations. 335.
https://digitalcommons.dartmouth.edu/dissertations/335