Author ORCID Identifier

Date of Award


Document Type

Thesis (Ph.D.)

Department or Program

Computer Science

First Advisor

Soroush Vosoughi


Pre-trained language models (PLMs), like GPT-4, which powers ChatGPT, face various safety issues, including biased responses and a lack of alignment with users' backgrounds and expectations. These problems threaten their sociability and public application. Present strategies for addressing these safety concerns primarily involve data-driven approaches, requiring extensive human effort in data annotation and substantial training resources. Research indicates that the nature of these safety issues evolves over time, necessitating continual updates to data and model re-training—an approach that is both resource-intensive and time-consuming. This thesis introduces a novel, model-centric strategy for understanding and mitigating the safety issues of PLMs by leveraging model interpretations. It aims to comprehensively understand how PLMs encode ``harmful phenomena'' such as stereotypes and cultural misalignments and to use this understanding to mitigate safety issues efficiently, minimizing resource and time costs. This is particularly relevant for large, over-parameterized language models. Furthermore, this research explores enhancing the consistency and robustness of interpretation methods through the use of small, heuristically constructed control datasets. These improvements in interpretation methods are expected to increase the effectiveness of reducing safety issues in PLMs, contributing to their positive societal impact in an era of widespread global use.