Author ORCID Identifier

https://orcid.org/0009-0006-6519-8857

Date of Award

Spring 6-15-2025

Document Type

Thesis (Master's)

Department or Program

Computer Science

First Advisor

Soroush Vosoughi

Abstract

The preservation and revitalization of endangered languages, particularly those with minimal digital presence, presents significant challenges for computational linguistics. This thesis addresses these challenges by proposing novel methods for language identification and data generation, focusing on underrepresented Indigenous languages, specifically Nüshu, Native American and Native Alaskan languages.

In the first study, a COLING 2025 paper, we present NüshuRescue, an AI-driven framework designed to facilitate the preservation of Nüshu, an endangered script used exclusively by Yao women in China. Using minimal seed data, we demonstrate how GPT-4-Turbo can generate new translations, expanding a publicly available Nüshu-Chinese corpus, achieving 48.69% accuracy in translating unseen examples.

The second study, a NAACL 2025 paper, introduces a Random Forest classifier tailored for identifying Native American languages, including Navajo, which has been consistently misidentified by existing language identification tools like Google’s LangID. By leveraging a custom dataset and training on misidentified languages, we achieve near-perfect classification accuracy, illustrating the potential of lightweight, decentralized language identification models.

Finally, in the third study, an ACL 2025 Findings paper, we extend this work to 20 Native Alaskan languages, using few-shot prompting with LLMs and fine-tuning with XLM-RoBERTa to achieve near-perfect identification accuracy for these endangered languages. These results underscore the feasibility of building robust language technologies for low-resource languages with minimal data, contributing to both the technical field and the broader efforts to preserve and revitalize Indigenous languages. This thesis provides a scalable approach to endangered language identification, with the potential to make a significant impact on linguistic diversity preservation in the digital age.

Original Citation

@inproceedings{yang2025nushurescue, title={N{\"u}shuRescue: Reviving the Endangered N{\"u}shu Language with AI}, author={Yang, Ivory and Ma, Weicheng and Vosoughi, Soroush}, booktitle={Proceedings of the 31st International Conference on Computational Linguistics}, pages={7020--7034}, year={2025} }

@inproceedings{yang-etal-2025-navajo, title = "Is It {N}avajo? Accurate Language Detection for Endangered Athabaskan Languages", author = "Yang, Ivory and Ma, Weicheng and Zhang, Chunhui and Vosoughi, Soroush", booktitle = "Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)", pages = "277--284", year = "2025", }

Share

COinS