Author ORCID Identifier

https://orcid.org/0009-0005-1013-5130

Date of Award

Spring 2026

Document Type

Thesis (Ph.D.)

Department or Program

Computer Science

First Advisor

SouYoung Jin, PhD.

Abstract

Multimodal large language models have achieved impressive performance on vision-language benchmarks by integrating visual encoders with large language models. Yet a critical gap persists between benchmark accuracy and genuine multimodal understanding: current evaluation frameworks assess performance by final answers alone, rewarding confident predictions while leaving systematic reasoning failures undetected.

This thesis addresses this gap through a unified framework that progresses from understanding to reasoning, using video as the most comprehensive multimodal testbed. Video inherently combines vision, audio, and language with temporal dynamics and massive token redundancy; techniques developed for video's comprehensive challenges transfer naturally to simpler multimodal tasks.

On understanding multimodality, we introduce describable windows that identify which temporal segments contain queryable moments in long-form videos, achieving +4.52\% on Ego4D and +4.1\% on MAD. This establishes that precision beats exhaustive search.

On better fusion, STAMP (Soft Token Attention Masking Process) dynamically filters redundant tokens through soft attention masking. Benefits scale with data complexity: large multimodal datasets show substantial gains while small datasets show minimal improvement, confirming that adaptive mechanisms become essential as sequences grow longer.

On efficient fusion, MoDA (Modulation Adapter) performs instruction-guided channel-wise modulation to address semantic entanglement in visual representations, achieving substantial improvements on fine-grained tasks (+12.0 points on MMVP hallucination detection) with minimal computational overhead (less than 1\% FLOPs). This demonstrates that targeted modulation outperforms brute-force scaling.

On transparent reasoning, CRYSTAL evaluates 15 state-of-the-art MLLMs step by step through verifiable intermediate checkpoints, revealing that answer correctness and reasoning quality are only weakly correlated. Reinforcement learning with step-level rewards transforms guessing into structured reasoning, proving that process supervision enables genuine understanding.

Together, these contributions demonstrate that robust multimodal understanding requires rethinking how we design, train, and evaluate AI systems. The path forward demands optimizing for understanding, not accuracy alone.

Share

COinS