Author ORCID Identifier

https://orcid.org/0009-0005-1013-5130

Date of Award

Spring 2026

Document Type

Thesis (Ph.D.)

Department or Program

Computer Science

First Advisor

SouYoung Jin, PhD.

Abstract

Multimodal large language models have achieved impressive performance on vision-language benchmarks by integrating visual encoders with large language models. Yet a critical gap persists between benchmark accuracy and genuine multimodal understanding: current evaluation frameworks assess performance by final answers alone, rewarding confident predictions while leaving systematic reasoning failures undetected.

This thesis addresses this gap through a unified framework that progresses from understanding to reasoning, using video as the most comprehensive multimodal testbed. Video inherently combines vision, audio, and language with temporal dynamics and massive token redundancy; techniques developed for video's comprehensive challenges transfer naturally to simpler multimodal tasks.

On understanding multimodality, we introduce describable windows that identify which temporal segments contain queryable moments in long-form videos, achieving +4.52\% on Ego4D and +4.1\% on MAD. This establishes that precision beats exhaustive search.

On better fusion, STAMP (Soft Token Attention Masking Process) dynamically filters redundant tokens through soft attention masking. Benefits scale with data complexity: large multimodal datasets show substantial gains while small datasets show minimal improvement, confirming that adaptive mechanisms become essential as sequences grow longer.

On efficient fusion, MoDA (Modulation Adapter) performs instruction-guided channel-wise modulation to address semantic entanglement in visual representations, achieving substantial improvements on fine-grained tasks (+12.0 points on MMVP hallucination detection) with minimal computational overhead (less than 1\% FLOPs). This demonstrates that targeted modulation outperforms brute-force scaling.

On transparent reasoning, CRYSTAL evaluates 15 state-of-the-art MLLMs step by step through verifiable intermediate checkpoints, revealing that answer correctness and reasoning quality are only weakly correlated. Reinforcement learning with step-level rewards transforms guessing into structured reasoning, proving that process supervision enables genuine understanding.

Together, these contributions demonstrate that robust multimodal understanding requires rethinking how we design, train, and evaluate AI systems. The path forward demands optimizing for understanding, not accuracy alone.

Recommended Citation

Barrios, Wayner, "From Attention to Reasoning: Beyond Accuracy in Multimodal AI" (2026). Dartmouth College Ph.D Dissertations. 524.
https://digitalcommons.dartmouth.edu/dissertations/524

Download

Included in

Artificial Intelligence and Robotics Commons, Data Science Commons

COinS

Dartmouth College Ph.D Dissertations

From Attention to Reasoning: Beyond Accuracy in Multimodal AI

Author ORCID Identifier

Date of Award

Document Type

Department or Program

First Advisor

Abstract

Recommended Citation

Included in

Browse

Search

Contribute

Questions?

Dartmouth College Ph.D Dissertations

From Attention to Reasoning: Beyond Accuracy in Multimodal AI

Author

Author ORCID Identifier

Date of Award

Document Type

Department or Program

First Advisor

Abstract

Recommended Citation

Included in

Share

Browse

Search

Contribute

Questions?