IFML Seminar: 09/19/25 - Speech Generation and Sound Understanding in the Era of Large Language Models

David Harwath, Assistant Professor, Computer Science, UT Austin

12:15 - 1:15pm

The University of Texas at Austin
Gates Dell Complex (GDC 6.302)
2317 Speedway
Austin, TX 78712
United States

Abstract: LLMs have not only revolutionized text-based natural language processing, but their multimodal extensions have proven to be extremely powerful models for vision, speech, and natural sounds. By tokenizing input from disparate modalities and mapping them into the same input/output space as text, these models are capable of learning not only how to reason over multimodal inputs, but also generate new speech, audio, and visual outputs guided by text instructions - multiplying the number of capabilities these models have. In my talk, I will discuss several recent works in this direction from my lab at UT Austin.

The first part of my talk will describe our work on VoiceCraft, a neural codec language model capable of performing voice cloning text-to-speech synthesis, as well as targeted edits of speech recordings where words can be arbitrarily inserted, deleted, or substituted in the waveform itself. I will discuss our recent work on text-controllable TTS models that can not only manipulate basic attributes of the speech signal such as pitch and speaking rate, but also more abstract and higher-level vocal styles such as "husky", "nasal", "sleepy", and so forth.

In the second part of my talk, I will discuss our work on spatial sound understanding. I will introduce SpatialSoundQA, a dataset containing 800,000 ambisonic waveforms and accompanying question-answer pairs, which can be used to train and evaluate models on their ability to answer questions such as “Is the sound of the telephone further to the left than the sound of the barking dog?” I will also describe our BAT model, an extension of the LLaMA LLM that is capable of taking spatial audio recordings as input and reasoning about them using natural language.

Bio: David Harwath is an assistant professor in the computer science department at UT Austin, where he leads the Speech, Audio, and Language Technologies (SALT) Lab. His group's research focuses on developing novel machine learning methods applied to speech, audio, and multimodal data for tasks such as automatic speech recognition, text to speech synthesis, and acoustic scene analysis. He has received the NSF CAREER award (2023), an ASRU best paper nomination (2015), and was awarded the 2018 George M. Sprowls Award for best computer science PhD thesis at MIT. He holds a B.S. in electrical engineering from UIUC (2010), a S.M. in computer science from MIT (2013), and a Ph.D. in computer science from MIT (2018).

Event Registration