Author ORCID Identifier
https://orcid.org/0009-0004-0761-6585
Date of Award
Spring 6-14-2026
Document Type
Thesis (Undergraduate)
Department
Linguistics
First Advisor
Rolando Coto Solano
Abstract
Interlinear glossing is a major task in Indigenous language documentation. In this paper, I explore how effectively two Large Language Models, ByT5 and Gemini 2.5 Flash, can produce interlinear glossed text. I also examine how prompting an LLM with different types of information (dictionary entries, other training samples, and translations) can augment model performance. I apply these models to two under-resourced Indigenous languages: Bribri, which is morphologically complex from Costa Rica, and Cook Islands Māori, which has a simpler morphology and is from the Cook Islands in the Pacific Ocean. ByT5 exhibits much better performance when glossing Cook Islands Māori than when glossing Bribri. For the former, it attains 60%+ accuracy; for the latter it wildly hallucinates. Gemini exhibits strong performance for both languages. This is because a Retrieval-Augmented Generation architecture allows Gemini to be prompted with other, similar training samples from the corpora. In future work, I will use these tools to develop publicly available tools to aid linguistic documentation and revitalization.
Recommended Citation
Anderson, Carter D., "Automatic Glossing in Under-Resourced Languages: Case Studies in Bribri and Cook Islands Māori" (2026). Linguistics Undergraduate Senior Theses. 2.
https://digitalcommons.dartmouth.edu/linguistics_senior_theses/2
Included in
Computational Linguistics Commons, Computer Sciences Commons, Language Description and Documentation Commons, Latin American Languages and Societies Commons, Morphology Commons, Polynesian Studies Commons
