Dartmouth College Master’s Theses

StoryTeller: Training-free Narrative Grounding and ForseeBench: Evaluation for Long Form Audio Description

Seung Hyun Hahm, Dartmouth CollegeFollow

Date of Award

Spring 5-19-2026

Document Type

Thesis (Master's)

Department or Program

Computer Science

First Advisor

SouYoung Jin

Abstract

Understanding long-form video requires tracking events, motivations, and relationships across time rather than describing isolated frames. However, existing video--language models (VLMs) and audio description (AD) systems often generate short-horizon descriptions that omit narrative context, causal intent, and story continuity, limiting accessibility for blind and low-vision (BLV) audiences. This thesis investigates how long-form AD can be grounded in narrative memory without relying on expensive supervised training pipelines or heavily curated annotations.

We propose StoryTeller, a training-free retrieval-augmented framework for long-form audio description. Instead of depending solely on frame-level perception, StoryTeller summarizes observations into structured narrative facts that capture who did what and under what context, and stores them in a lightweight evolving memory across scenes. During generation, this memory guides retrieval of story-relevant evidence, enabling more coherent and factually grounded narration with reduced hallucination. Unlike prior approaches, the framework does not require ground-truth AD, subtitles, prior captions, precomputed metadata, or precomputed character banks, making it scalable to new movies and domains without copyright-sensitive retraining or manually constructed datasets.

Beyond generation, this thesis studies how AD systems should be evaluated for narrative usefulness. Existing evaluation protocols largely measure whether descriptions reflect visible or already observed content, but they do not test whether prior narration preserves information needed to understand what happens next. Motivated by the way sighted viewers use visual cues to anticipate upcoming events, we introduce ForseeBench, a prospective AD evaluation benchmark that withholds a future human-written AD sentence and asks whether preceding AD context supports reasoning about that upcoming visual event. ForseeBench filters shortcut solutions caused by generic movie priors, repeated wording, local continuation, or answer-choice artifacts, and PrediCC@k measures how effectively prior AD improves future-event reasoning beyond answer-choice priors.

Experiments on standard AD benchmarks and ForseeBench show that narrative memory improves long-form AD generation while prospective evaluation exposes the difficulty of preserving forward-relevant context for accessible movie understanding.

Recommended Citation

Hahm, Seung Hyun, "StoryTeller: Training-free Narrative Grounding and ForseeBench: Evaluation for Long Form Audio Description" (2026). Dartmouth College Master’s Theses. 301.
https://digitalcommons.dartmouth.edu/masters_theses/301

Download

Included in

Artificial Intelligence and Robotics Commons, Social Work Commons

COinS

Dartmouth College Master’s Theses

StoryTeller: Training-free Narrative Grounding and ForseeBench: Evaluation for Long Form Audio Description

Date of Award

Document Type

Department or Program

First Advisor

Abstract

Recommended Citation

Included in

Browse

Search

Contribute

Questions?

Dartmouth College Master’s Theses

StoryTeller: Training-free Narrative Grounding and ForseeBench: Evaluation for Long Form Audio Description

Author

Date of Award

Document Type

Department or Program

First Advisor

Abstract

Recommended Citation

Included in

Share

Browse

Search

Contribute

Questions?