Author ORCID Identifier
https://orcid.org/0009-0006-6519-8857
Date of Award
Spring 6-15-2025
Document Type
Thesis (Master's)
Department or Program
Computer Science
First Advisor
Soroush Vosoughi
Abstract
The preservation and revitalization of endangered languages, particularly those with minimal digital presence, presents significant challenges for computational linguistics. This thesis addresses these challenges by proposing novel methods for language identification and data generation, focusing on underrepresented Indigenous languages, specifically Nüshu, Native American and Native Alaskan languages.
In the first study, a COLING 2025 paper, we present NüshuRescue, an AI-driven framework designed to facilitate the preservation of Nüshu, an endangered script used exclusively by Yao women in China. Using minimal seed data, we demonstrate how GPT-4-Turbo can generate new translations, expanding a publicly available Nüshu-Chinese corpus, achieving 48.69% accuracy in translating unseen examples.
The second study, a NAACL 2025 paper, introduces a Random Forest classifier tailored for identifying Native American languages, including Navajo, which has been consistently misidentified by existing language identification tools like Google’s LangID. By leveraging a custom dataset and training on misidentified languages, we achieve near-perfect classification accuracy, illustrating the potential of lightweight, decentralized language identification models.
Finally, in the third study, an ACL 2025 Findings paper, we extend this work to 20 Native Alaskan languages, using few-shot prompting with LLMs and fine-tuning with XLM-RoBERTa to achieve near-perfect identification accuracy for these endangered languages. These results underscore the feasibility of building robust language technologies for low-resource languages with minimal data, contributing to both the technical field and the broader efforts to preserve and revitalize Indigenous languages. This thesis provides a scalable approach to endangered language identification, with the potential to make a significant impact on linguistic diversity preservation in the digital age.
Original Citation
@inproceedings{yang2025nushurescue, title={N{\"u}shuRescue: Reviving the Endangered N{\"u}shu Language with AI}, author={Yang, Ivory and Ma, Weicheng and Vosoughi, Soroush}, booktitle={Proceedings of the 31st International Conference on Computational Linguistics}, pages={7020--7034}, year={2025} }
@inproceedings{yang-etal-2025-navajo, title = "Is It {N}avajo? Accurate Language Detection for Endangered Athabaskan Languages", author = "Yang, Ivory and Ma, Weicheng and Zhang, Chunhui and Vosoughi, Soroush", booktitle = "Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)", pages = "277--284", year = "2025", }
Recommended Citation
Yang, Ivory, "Revitalization Of Endangered Languages With AI" (2025). Dartmouth College Master’s Theses. 215.
https://digitalcommons.dartmouth.edu/masters_theses/215
