Date of Award

Winter 1-24-2025

Document Type

Thesis (Ph.D.)

Department or Program

Computer Science

First Advisor

Dr. Soroush Vosoughi

Second Advisor

Dr. Lorenzo Torresani

Third Advisor

Dr. Temiloluwa O. Prioleau

Abstract

Natural language describes entities in the world, some real and some abstract. It is also common practice to complement human learning of natural language with visual cues. This is evident in the heavily graphical nature of children’s literature which underscores the importance of visual cues in language acquisition. Similarly, the notion of “visual learners” is well recognized, reflecting the understanding that visual signals such as illustrations, gestures, and depictions effectively supplement language. In machine learning, two primary paradigms have emerged for training systems involving natural language. The first paradigm encompasses setups where pre-training and downstream tasks are exclusively in natural language. The second paradigm comprises models that require joint reasoning over both language and visual inputs during pre-training and downstream tasks. Given the widely acknowledged role of visual input in human language comprehension, it is pertinent to inquire whether visual information can similarly augment the comprehension of language-only tasks in machine learning. Despite the remarkable advancements in the capabilities of machine learning models across all domains in recent years, the concept of supplementing Natural Language Processing with visual signals remains insufficiently explored. This is in part due to the absence of clear and effective strategies for integrating visual information into language models, given the limited availability of extensive, high-quality image-language paired datasets. In this thesis, we address this challenge and propose two frameworks for incorporating visual information into natural language pre-training leveraging multimodal models as intermediaries between visual information and language models. Empirical evaluations conducted on language pre-training datasets of varying sizes demonstrate the efficacy of the proposed frameworks across diverse downstream language tasks. In addition, we introduce methods for training effective multimodal models through architectural innovations and novel multimodal data augmentation techniques. The representations generated by our multimodal models lead to improved performance in zero-shot image categorization, visual question answering, visual entailment, and cross-modal retrieval tasks in downstream evaluations. Finally, this thesis presents a novel method for constructing effective neural networks by selection from randomly initialized parameters in contrast to the conventional practice of parameter updates via gradient descent.

Recommended Citation

Aladago, Maxwell Mbabilla, "Incorporating Visual Information into Natural Language Processing" (2025). Dartmouth College Ph.D Dissertations. 335.
https://digitalcommons.dartmouth.edu/dissertations/335

Download

Included in

Artificial Intelligence and Robotics Commons

COinS

Dartmouth College Ph.D Dissertations

Incorporating Visual Information into Natural Language Processing

Date of Award

Document Type

Department or Program

First Advisor

Second Advisor

Third Advisor

Abstract

Recommended Citation

Included in

Browse

Search

Contribute

Questions?

Dartmouth College Ph.D Dissertations

Incorporating Visual Information into Natural Language Processing

Author

Date of Award

Document Type

Department or Program

First Advisor

Second Advisor

Third Advisor

Abstract

Recommended Citation

Included in

Share

Browse

Search

Contribute

Questions?