Author ORCID Identifier

https://orcid.org/0009-0008-4516-0135

Date of Award

1-2025

Document Type

Thesis (Master's)

Department or Program

Computer Science

First Advisor

Michael Casey

Second Advisor

Tim Tregubov

Third Advisor

Elizabeth Murnane

Abstract

This master's thesis introduces OpenMUSE (Open Multimodal Unified Sound Engine), a platform that demonstrates the potential of open-source AI music generation by integrating state-of-the-art deep learning models into a unified system. By unifying ten different open-source models, including MusicGen, AudioLDM2, and custom-trained text-to-symbolic music generation models, OpenMUSE aims to create a user-friendly interface that empowers artists to produce complex, adaptive musical compositions. The system enhances accessibility by providing a simple web interface and natural language controls, while improving controllability through features like melody conditioning and semantic audio editing. Specifically, OpenMUSE offers a digital audio workstation (DAW)-inspired interface that lowers the technical barriers to music production while providing AI-powered capabilities.

Quantitative and qualitative evaluations demonstrate significant improvements in both objective audio quality metrics and user satisfaction compared to existing single-model approaches. Through user studies, the research validates that the integration of specialized models for different aspects of the music creation process—from melody generation to stem separation—enhances creative control without sacrificing ease of use. The system's architecture demonstrates how thoughtful interface design and model integration can make sophisticated AI music tools accessible to users of varying technical expertise.

Key contributions include:

  • A web interface that integrates multiple open-source music generation models, providing intuitive controls for practical music creation workflows.
  • A custom-trained text-to-symbolic music generation model that translates textual descriptions into symbolic musical representations in MIDI format.
  • A unified system that bridges different modalities (text, audio, image) and distills the capabilities of various models into five core music creation tasks: accompaniment generation, text-to-music generation, audio editing, stem extraction, and text-to-MIDI generation.
  • Detailed documentation of the attempts and challenges in integrating and optimizing various AI models for this music generation platform.

Share

COinS