Author ORCID Identifier

https://orcid.org/0009-0008-5167-9048

Date of Award

Spring 5-20-2026

Document Type

Thesis (Master's)

Department or Program

Computer Science

First Advisor

Soroush Vosoughi

Abstract

Vision Language Models (VLMs) have become a popular tool for everyday use in many domains. However, these models have showcased many flaws, including poor visual acuity, geometric reasoning, and internalized biases. In this paper, we will focus on improving model performance on the visual acuity and geometric reasoning aspects. We work on tasks where the VLM is asked to check if two circles are touching, count the number of shapes in an image, and so on. We contribute a schema-based framework for improving model performance on these tasks. This framework comprises three key components: (1) a proof-of-concept pipeline adaptable for inference, (2) a Reinforcement Learning framework designed to optimize the schemas, and (3) a model-based verification and refinement pipeline that iteratively improves schema quality. We evaluate and test this pipeline across closed and open-source models, showing significant performance boosts as well as indications of the concept’s efficacy. We also make our code available at https://github.com/carlosguealv/vlms_symb_project.

Available for download on Thursday, May 20, 2027

Share

COinS