Events

IFML Seminar

IFML Seminar: 09/19/25 - Speech Generation and Sound Understanding in the Era of Large Language Models

David Harwath, Assistant Professor, Computer Science, UT Austin

-

The University of Texas at Austin
Gates Dell Complex (GDC 6.302)
2317 Speedway
Austin, TX 78712
United States

Event Registration
David Harwath

Abstract:  LLMs have not only revolutionized text-based natural language processing, but their multimodal extensions have proven to be extremely powerful models for vision, speech, and natural sounds. By tokenizing input from disparate modalities and mapping them into the same input/output space as text, these models are capable of learning not only how to reason over multimodal inputs, but also generate new speech, audio, and visual outputs guided by text instructions - multiplying the number of capabilities these models have. In my talk, I will discuss several recent works in this direction from my lab at UT Austin. 

The first part of my talk will describe our work on VoiceCraft, a neural codec language model capable of performing voice cloning text-to-speech synthesis, as well as targeted edits of speech recordings where words can be arbitrarily inserted, deleted, or substituted in the waveform itself. I will discuss our recent work on text-controllable TTS models that can not only manipulate basic attributes of the speech signal such as pitch and speaking rate, but also more abstract and higher-level vocal styles such as "husky", "nasal", "sleepy", and so forth.

In the second part of my talk, I will discuss our work on spatial sound understanding. I will introduce SpatialSoundQA, a dataset containing 800,000 ambisonic waveforms and accompanying question-answer pairs, which can be used to train and evaluate models on their ability to answer questions such as “Is the sound of the telephone further to the left than the sound of the barking dog?” I will also describe our BAT model, an extension of the LLaMA LLM that is capable of taking spatial audio recordings as input and reasoning about them using natural language. 

Event Registration