Author ORCID Identifier
Date of Award
This thesis describes several approaches to better understand how large language models interpret different dialects of the English language. Our goal is to consider multiple contexts of textual data and to analyze how English-language dialects are realized in them, as well as how a variety of machine learning techniques handle these differences. We focus on two genres of text data: news and social media. In the news context, we establish a dataset covering news articles from five countries and four US states and consider language modeling analysis, topic and sentiment distributions, and manual analysis before performing nine experiments and evaluating the results to see that, on the whole, augmenting models with dialectal information improves performance for certain tasks. In the social media context, we construct a dataset of 1.4 million Tweets from six countries and consider manual linguistic analysis, vali- dation from individuals from these different regions, probing with ChatGPT, and an analysis of three pretrained large language models (BERT, BART, and T5) in the context of specific topics within the data to understand how individuals and models interact with dialects in this context. Overall, we find that, although we see dialec- tal differences in these Tweets from a linguistic perspective, these distinctions are decidedly less clear cut to individuals attempting to identify Tweets from their own regions; ChatGPT cannot accurately identify these differences; and pretrained large language models cna be fine-tuned to distinguish them to a moderate degree.
Datta, Samiha, "Investigating English-Language Dialect-Adjusted Models" (2023). Computer Science Senior Theses. 11.
Available for download on Monday, June 03, 2024