三月里的幸福饼    发表于  昨天 13:08 | 显示全部楼层 |阅读模式 8 0
OpenAI’s latest API adds emotion to its more natural voices.jpg
OpenAI’s latest development is an updated speech-to-speech model that it says allows for more accurate, natural, and expressive speech. The Realtime API, called gpt-realtime, is now generally available for developers and enterprises, while initially being introduced as a beta option last October.

The brand indicated that the model will work well for real-world situations, including customer support, personal assistance, and education. Current partners already using the API include Stubhub, Zillow, T-Mobile, Lemonade, and Oscar Health.

Users will be able to utilize gpt-realtime voice agents in tasks such as reading detailed disclaimers, repeating alphanumeric sequences, and switching between languages mid-sentence.

The OpenAI team discussed the details of the newly deployed update in a short livestream on Thursday. They detailed how the gpt-realtime model has improved since its initial introduction and its benefits to developers and enterprise users.

Boosting the audio quality on gpt-realtime API has an ever-increasing natural sound to its voice output. This model focuses on emotion, inflection, and pace to create a better experience for users listening to chats and applications featuring the model and better direction following for the agent.

One demo showed the model telling a story with the inflection of five different emotions: fear, sadness, curiosity, joy, and excitement. The response mimicked that of an adult telling a bedtime story, and the emotions added context to the story, making it easier to understand. Another prompt showed the model copying the instructor, transitioning from English to Spanish to French, while answering their question.  

OpenAI also announced that it is introducing two new voices, Marin and Cedar, which will be exclusive to the gpt-realtime API. However, the speech updates coming to the model will be added to the eight voices already available in the brand’s ecosystem.

OpenAI detailed in a press release that its work on improving intelligence and comprehension on the gpt-realtime API has been apparent, with benchmarks showing 82.8% accuracy, in comparison to the 65.6% accuracy score on Big Bench Audio that the December 2024 model received. Similarly, the company instruction following benchmarks showed gpt-realtime scoring 30.5% on MultiChallenge audio, in comparison to the December 2024 score of 20.6%.

Some of the other notable updates on the gpt-realtime model include the ability to enable MCP support remotely, allowing users to gain access to tools without having to manually connect to a server. Session Initiation Protocol (SIP) support allows users to enable phone calling ability within applications. Users will also be able to save and reuse prompts across Realtime API sessions.

The Realtime API is priced at $32 per 1 million audio input tokens or $0.40 for cached input tokens, and $64 per 1 million audio output tokens. This pricing is 20% less than the gpt-4o-realtime-preview, OpenAI noted. Users can also manage their input with intelligent token limits and control multiple turns.   


您需要登录后才可以回帖 登录 | 立即注册

Archiver|手机版|关于我们

Copyright © 2001-2025, 公立边.    Powered by gonglubian|网站地图

GMT+8, 2025-9-10 07:53 , Processed in 0.098056 second(s), 32 queries .