Author ORCID Identifier

https://orcid.org/0009-0000-4722-4595

Date of Award

Spring 5-31-2023

Document Type

Thesis (Undergraduate)

Department

Computer Science

First Advisor

Soroush Vosoughi

Abstract

This thesis describes several approaches to better understand how large language models interpret different dialects of the English language. Our goal is to consider multiple contexts of textual data and to analyze how English-language dialects are realized in them, as well as how a variety of machine learning techniques handle these differences. We focus on two genres of text data: news and social media. In the news context, we establish a dataset covering news articles from five countries and four US states and consider language modeling analysis, topic and sentiment distributions, and manual analysis before performing nine experiments and evaluating the results to see that, on the whole, augmenting models with dialectal information improves performance for certain tasks. In the social media context, we construct a dataset of 1.4 million Tweets from six countries and consider manual linguistic analysis, vali- dation from individuals from these different regions, probing with ChatGPT, and an analysis of three pretrained large language models (BERT, BART, and T5) in the context of specific topics within the data to understand how individuals and models interact with dialects in this context. Overall, we find that, although we see dialec- tal differences in these Tweets from a linguistic perspective, these distinctions are decidedly less clear cut to individuals attempting to identify Tweets from their own regions; ChatGPT cannot accurately identify these differences; and pretrained large language models cna be fine-tuned to distinguish them to a moderate degree.

Recommended Citation

Datta, Samiha, "Investigating English-Language Dialect-Adjusted Models" (2023). Computer Science Senior Theses. 11.
https://digitalcommons.dartmouth.edu/cs_senior_theses/11

Download

Included in

Computer Sciences Commons, Linguistics Commons

COinS

Computer Science Senior Theses

Investigating English-Language Dialect-Adjusted Models

Author ORCID Identifier

Date of Award

Document Type

Department

First Advisor

Abstract

Recommended Citation

Included in

Browse

Search

Contribute

Links

Questions?

Computer Science Senior Theses

Investigating English-Language Dialect-Adjusted Models

Author

Author ORCID Identifier

Date of Award

Document Type

Department

First Advisor

Abstract

Recommended Citation

Included in

Share

Browse

Search

Contribute

Links

Questions?