Dartmouth College Ph.D Dissertations

Mitigating Safety Issues in Pre-trained Language Models: A Model-Centric Approach Leveraging Interpretation Methods

Weicheng Ma, Dartmouth CollegeFollow

Author ORCID Identifier

https://orcid.org/0000-0001-7494-9874

Date of Award

2024

Document Type

Thesis (Ph.D.)

Department or Program

Computer Science

First Advisor

Soroush Vosoughi

Abstract

Pre-trained language models (PLMs), like GPT-4, which powers ChatGPT, face various safety issues, including biased responses and a lack of alignment with users' backgrounds and expectations. These problems threaten their sociability and public application. Present strategies for addressing these safety concerns primarily involve data-driven approaches, requiring extensive human effort in data annotation and substantial training resources. Research indicates that the nature of these safety issues evolves over time, necessitating continual updates to data and model re-training—an approach that is both resource-intensive and time-consuming. This thesis introduces a novel, model-centric strategy for understanding and mitigating the safety issues of PLMs by leveraging model interpretations. It aims to comprehensively understand how PLMs encode ``harmful phenomena'' such as stereotypes and cultural misalignments and to use this understanding to mitigate safety issues efficiently, minimizing resource and time costs. This is particularly relevant for large, over-parameterized language models. Furthermore, this research explores enhancing the consistency and robustness of interpretation methods through the use of small, heuristically constructed control datasets. These improvements in interpretation methods are expected to increase the effectiveness of reducing safety issues in PLMs, contributing to their positive societal impact in an era of widespread global use.

Recommended Citation

Ma, Weicheng, "Mitigating Safety Issues in Pre-trained Language Models: A Model-Centric Approach Leveraging Interpretation Methods" (2024). Dartmouth College Ph.D Dissertations. 280.
https://digitalcommons.dartmouth.edu/dissertations/280

Download

Included in

Artificial Intelligence and Robotics Commons

COinS

Dartmouth College Ph.D Dissertations

Mitigating Safety Issues in Pre-trained Language Models: A Model-Centric Approach Leveraging Interpretation Methods

Author ORCID Identifier

Date of Award

Document Type

Department or Program

First Advisor

Abstract

Recommended Citation

Included in

Browse

Search

Contribute

Questions?

Dartmouth College Ph.D Dissertations

Mitigating Safety Issues in Pre-trained Language Models: A Model-Centric Approach Leveraging Interpretation Methods

Author

Author ORCID Identifier

Date of Award

Document Type

Department or Program

First Advisor

Abstract

Recommended Citation

Included in

Share

Browse

Search

Contribute

Questions?