Author ORCID Identifier

Date of Award

Spring 6-9-2024

Document Type

Thesis (Ph.D.)

Department or Program

Computer Science

First Advisor

Soroush Vosoughi


The field of Natural Language Processing (NLP) has undergone a significant transformation with the emergence of large language models (LMs). These models have enabled the development of human-like conversational assistants (e.g., OpenAI's ChatGPT), and expert-level AI software engineering agents (e.g., Devin from Cognition Lab). However, these models face a fundamental challenge related to their training methodology. Predominantly trained on vast datasets scraped from the web, their self-supervised learning objective---predict missing tokens---unintentionally perpetuates the biases, inaccuracies, and sensitive information inherent in their training data. These issues lead to what is termed misaligned}behaviors and pose a significant hurdle in the development of reliable and trustworthy AI systems.

The goal of "aligning language models with the human world'' is to mitigate these challenges by ensuring that language models align more closely with human knowledge and societal values. This thesis introduces innovative training paradigms that enable language models to derive learning from grounded experiences or simulated social interactions, enhancing their alignment with human expectations. This work also explores the concept of "scalable oversight", an area focused on leveraging the superior capabilities of AI models to assist humans in ensuring that LLMs operate in a manner that is consistent with human values and knowledge. Through these innovations, this thesis contributes to the development of more reliable, accurate, and societally aligned LMs, addressing one of the key challenges in the field of NLP.