Author ORCID Identifier
https://orcid.org/0000-0001-7494-9874
Date of Award
2024
Document Type
Thesis (Ph.D.)
Department or Program
Computer Science
First Advisor
Soroush Vosoughi
Abstract
Pre-trained language models (PLMs), like GPT-4, which powers ChatGPT, face various safety issues, including biased responses and a lack of alignment with users' backgrounds and expectations. These problems threaten their sociability and public application. Present strategies for addressing these safety concerns primarily involve data-driven approaches, requiring extensive human effort in data annotation and substantial training resources. Research indicates that the nature of these safety issues evolves over time, necessitating continual updates to data and model re-training—an approach that is both resource-intensive and time-consuming. This thesis introduces a novel, model-centric strategy for understanding and mitigating the safety issues of PLMs by leveraging model interpretations. It aims to comprehensively understand how PLMs encode ``harmful phenomena'' such as stereotypes and cultural misalignments and to use this understanding to mitigate safety issues efficiently, minimizing resource and time costs. This is particularly relevant for large, over-parameterized language models. Furthermore, this research explores enhancing the consistency and robustness of interpretation methods through the use of small, heuristically constructed control datasets. These improvements in interpretation methods are expected to increase the effectiveness of reducing safety issues in PLMs, contributing to their positive societal impact in an era of widespread global use.
Recommended Citation
Ma, Weicheng, "Mitigating Safety Issues in Pre-trained Language Models: A Model-Centric Approach Leveraging Interpretation Methods" (2024). Dartmouth College Ph.D Dissertations. 280.
https://digitalcommons.dartmouth.edu/dissertations/280