Author ORCID Identifier

https://orcid.org/0000-0003-1714-5102

Date of Award

Spring 6-4-2023

Document Type

Thesis (Undergraduate)

Department

Computer Science

First Advisor

Soroush Vosoughi

Abstract

This thesis describes a variety of approaches in examining how language models encode stereotypes (understanding stereotypes from a model point-of-view), debiasing language models, and using language models to understand how stereotypes affect conversations (understanding stereotypes from a conversational point-of-view). We present a novel approach for textual clues analysis that makes language models more interpretable, combining the understanding of what stereotypes the internal structures of language models have encoded during their initial training (via attention-based analysis) and understanding what textual clues are most relevant to identifying stereotypes for models trained to detect stereotypes (via SHAP-based analysis). We find that different pre-trained language models may encode different stereotypes for a certain minority group and that different attention heads within the same pre-trained model may also encode different stereotypes. We propose debiasing techniques to make the outputs of generative language models less stereotypical, with promising results shown (p-value=0.008). We then look into the nuanced effects of stereotypes on conversations between speakers of different dialects (Standard American English and African American Vernacular English), finding significant differences in the content when speakers of different dialects interact versus when speakers of the same dialect interact. Further study on the robustness of these techniques, as well as using new techniques, could allow for better understanding of how language models encode stereotypes and how stereotypes affect conversations, allowing us to progress towards less biased language models and bridge gaps in critical information between different groups of people.

Share

COinS