Amazon today introduced Nova Sonic, an advanced speech-to-speech model that enables developers to build applications that can converse with human-like voices in real time. Amazon claims the new acoustic model offers industry-leading price performance and low latency.
Typically, developing a voice-enabled application requires developers to work with multiple models at the same time:
- Speech recognition model for converting audio to text.
- Large Language Model (LLM) for understanding and generating responses.
- Text-to-speech model.
This approach is not only complex, but also often misses important acoustic contexts such as tone, prosody, and speaking style.

Nova Sonic addresses this challenge by integrating speech understanding and generation into a single model. This unified approach helps the model capture tone, style, and audio input, creating more natural dialogue. It also determines when to respond appropriately and better handles barge-ins.
Nova Sonic supports both male and female voices in a variety of English accents, including American and British. Developers can access the model via Amazon Bedrock using a two-way streaming API, which supports function calling. The model also includes built-in protection features such as content moderation and watermarking.
In this regard, last month OpenAI announced a new generation of speech-to-text models – gpt-4o-transcribe and gpt-4o-mini-transcribe – with significant improvements in word error rate, language recognition, and accuracy over previous Whisper models.