Date of Award

Spring 5-19-2026

Document Type

Thesis (Master's)

Department or Program

Computer Science

First Advisor

SouYoung Jin

Abstract

Understanding long-form video requires tracking events, motivations, and relationships across time rather than describing isolated frames. However, existing video--language models (VLMs) and audio description (AD) systems often generate short-horizon descriptions that omit narrative context, causal intent, and story continuity, limiting accessibility for blind and low-vision (BLV) audiences. This thesis investigates how long-form AD can be grounded in narrative memory without relying on expensive supervised training pipelines or heavily curated annotations.

We propose StoryTeller, a training-free retrieval-augmented framework for long-form audio description. Instead of depending solely on frame-level perception, StoryTeller summarizes observations into structured narrative facts that capture who did what and under what context, and stores them in a lightweight evolving memory across scenes. During generation, this memory guides retrieval of story-relevant evidence, enabling more coherent and factually grounded narration with reduced hallucination. Unlike prior approaches, the framework does not require ground-truth AD, subtitles, prior captions, precomputed metadata, or precomputed character banks, making it scalable to new movies and domains without copyright-sensitive retraining or manually constructed datasets.

Beyond generation, this thesis studies how AD systems should be evaluated for narrative usefulness. Existing evaluation protocols largely measure whether descriptions reflect visible or already observed content, but they do not test whether prior narration preserves information needed to understand what happens next. Motivated by the way sighted viewers use visual cues to anticipate upcoming events, we introduce ForseeBench, a prospective AD evaluation benchmark that withholds a future human-written AD sentence and asks whether preceding AD context supports reasoning about that upcoming visual event. ForseeBench filters shortcut solutions caused by generic movie priors, repeated wording, local continuation, or answer-choice artifacts, and PrediCC@k measures how effectively prior AD improves future-event reasoning beyond answer-choice priors.

Experiments on standard AD benchmarks and ForseeBench show that narrative memory improves long-form AD generation while prospective evaluation exposes the difficulty of preserving forward-relevant context for accessible movie understanding.

Share

COinS