Author ORCID Identifier

https://orcid.org/0009-0004-0761-6585

Date of Award

Spring 6-14-2026

Document Type

Thesis (Undergraduate)

Department

Linguistics

First Advisor

Rolando Coto Solano

Abstract

Interlinear glossing is a major task in Indigenous language documentation. In this paper, I explore how effectively two Large Language Models, ByT5 and Gemini 2.5 Flash, can produce interlinear glossed text. I also examine how prompting an LLM with different types of information (dictionary entries, other training samples, and translations) can augment model performance. I apply these models to two under-resourced Indigenous languages: Bribri, which is morphologically complex from Costa Rica, and Cook Islands Māori, which has a simpler morphology and is from the Cook Islands in the Pacific Ocean. ByT5 exhibits much better performance when glossing Cook Islands Māori than when glossing Bribri. For the former, it attains 60%+ accuracy; for the latter it wildly hallucinates. Gemini exhibits strong performance for both languages. This is because a Retrieval-Augmented Generation architecture allows Gemini to be prompted with other, similar training samples from the corpora. In future work, I will use these tools to develop publicly available tools to aid linguistic documentation and revitalization.

Available for download on Sunday, June 14, 2026

Share

COinS