Large Language Models Working with Audio
Weiran Wang, assistant professor at the University of Iowa, has defined his career by exploring machine learning and speech processing. Google is helping fund his personal research on advancing audio comprehension within Large Language Models. These Large Language Models, or LLMs, are the programs that AI systems use to analyze user inputs.
Professor Wang’s research explores audio-based Large Language Models and the “hallucinations” they may suffer. While LLMs are designed to process complex audio inputs and generate human-like responses, they might “hallucinate” confident outputs that are factually incorrect or untethered from the original audio input. For the average user, and even for professionals, these errors can be deceptive. Since the AI presents its “hallucinations” with authority, it can lead users to accept incorrect information without them realizing that this data is imaginary.
Reducing Hallucinations
An example of an AI “hallucination” could appear when the LLM processes the sound of a glass breaking. The danger isn’t simply that the AI could mislabel the sound, even if the input is only a single noise. A conversational LLM could try to elaborate by describing a burglar shattering a window or someone knocking a glass over at a party. In these instances, the AI “hears” a story that doesn’t exist. Instead of remaining grounded in factual acoustic data, the model invents contexts and events to explain the sound.
Professor Wang’s research focuses on preventing these “phantom” narratives and limiting the AI’s responses to facts present in the audio. By reducing the frequency of these “hallucinations,” Wang’s work aims to increase the reliability of audio-based models. As these systems become more grounded, users and professionals can deploy them to effectively handle more complex tasks.
Old Models Versus New Models
Before Google funded his current research, Professor Wang’s work centered on specialized audio processing models. Those were designed for narrow, predefined tasks that took an audio input and returned a specific output, based strictly on how researchers trained the model. Professor Wang describes those traditional systems like “black boxes” in terms of their flexibility. Unlike modern LLMs, which users can guide through natural language, those older models have instructions baked into their functionality. Researchers can’t easily alter the models’ behavior or ask them to perform new tasks afterward.
One example of a black box would be a system programmed to add a label or timestamp when it recognizes a specific sound, or a system that extracts common threads between two formats, like audio and text. These systems may work well for these tasks, but they can’t adapt to new directions from users. This lack of adaptability limits their utility.
Professor Wang’s current work involves generative, audio-based LLMs. Generative systems understand human language, so users can guide the system as it works, tailoring it to do exactly what they want. For example, the same generative LLM can provide subtitles for audio, compare text and audio, and create timestamps for sounds—depending on how the user prompts the system.
Generative LLMs are more adaptable, but they can still “hallucinate” false results. Professor Wang’s work seeks to improve the reliability of these systems and overcome the obstacles which past models couldn’t handle.
Advancing the Future of LLM Research
The LLM research communities include many types of models, both audio-based systems and video-language systems. These communities are facing similar issues with hallucinations, but the video language community is much larger than those concerned with audio. Professor Wang explains that he and other researchers are building a foundation for the audio-based LLM field by borrowing from literature the video-language community is currently developing at a faster rate.
Before researchers can eliminate unreliable hallucinations from LLMs, they need robust ways to track these issues. A significant phase of Professor Wang’s work involves designing specialized standards to evaluate when hallucinations appear. This gives researchers and the broader scientific community a metric to systematically reduce the number of hallucinations and improve the trustworthiness of AI.
The goal of Professor Wang and other researchers is to teach an LLM to align between the sound of glass breaking and an understanding of glass breaking. As LLMs become more reliable and adaptable, they provide better experiences for users and researchers alike.
To further advance LLMs, Professor Wang is exploring the use of Reinforcement Learning, or RL. Much like how positive reinforcement encourages certain behaviors, reinforcement learning uses a system of “rewards” to train the AI. This takes the form of mathematical benchmarks. The model is rewarded when it generates outputs grounded in factual audio and penalized when it hallucinates. Models that learn from their own generations using reinforcement learning are effectively able to self-correct. Professor Wang is especially interested in pushing the limits of reinforcement learning algorithms and optimizing them to handle complex modern AI systems.
Exciting Progress on the Horizon
For those who are interested in learning more about audio-based AI, or AI in general, Professor Wang suggests looking beyond the classroom.
“What we’re doing is relatively new,” he says, “so it’s hard to find traditional textbook instructions. Instead, I recommend following the latest research articles from top machine learning conferences. Archive sites are the best way to keep up with rapid progress in the field. I frequently search them for new papers myself.”
He also notes that undergraduates who are interested in joining this line of research are welcome to reach out to him directly.
While Professor Wang is dedicated to solving the fundamental challenges of audio LLMs, he is equally passionate about applying these advancements outside of computer science. AI can help other scientific fields process complex data more efficiently, and those fields push researchers to build models that can handle unique obstacles. For example, Professor Wang is collaborating with the University of Iowa’s chemistry department as a key member of the interdisciplinary FACET team, which is leveraging AI to accelerate breakthroughs in complex chemistry research, specifically within the realm of radiochemistry.
The UIowa Department of Computer Science is proud to be home to faculty who are thought leaders and innovators in AI, like Weiran Wang, whose research keeps the university on the leading edge of technological advancement.