close
close

OpenAI can clone voices with just 15 seconds of audio

OpenAI can clone voices with just 15 seconds of audio

March 31 (UPI) – A new language model unveiled by ChatGPT developer OpenAI can clone a person’s voice using just a few seconds worth of audio data, the company announced as it shared preliminary findings from exploring the capabilities of this technology.

The artificial intelligence model, called Voice Engine, needs only a single 15-second audio sample to generate speech that mimics that of the original speaker, OpenAI announced in a blog post on Friday. The technology was first developed in late 2022 and has been used to power the preset voices available in the Text-to-Speech API as well as the ChatGPT Voice and Read Aloud features.

The technology has been tested with OpenAI’s corporate partners and achieved groundbreaking results, including sharing a touching audio recording of a young girl with the Norman Prince Neurosciences Institute thanking doctors Fatima Mirza, Rohaid Ali and Konstantina Svokos.

The girl lost the ability to speak normally due to a vascular brain tumor. While she can still form words and sentences, her voice doesn’t sound the same as it used to. Doctors used an audio clip she recorded for a school project to give her back her normal voice, so she no longer sounds impaired when she speaks.

“We are taking a cautious and informed approach to wider release as there could be misuse of synthetic voices,” the company said. “We hope to spark a dialogue about the responsible use of synthetic voices and how society can adapt to these new possibilities.”

OpenAI, which has not released the model as a standalone product or a more comprehensive tool, said it had begun testing its capabilities privately with a “small group of trusted partners” and was “impressed by the potential applications.” But the company said it was continuing to have discussions about whether and how to deploy the technology at scale.

OpenAI said Voice Engine could be used to help non-readers and children read, among other things. The company has partnered with Age of Learning, an educational technology company that is using the technology to create scripted educational content.

OpenAI released a 15-second audio sample from the company in which a male speaker defines “force” in a physics context. The model was then applied to other topics, allowing the AI ​​to generate audio on biology, chemistry, reading and math.

HeyGen, another adopter of this technology, is an AI-based visual storytelling platform that works with other companies to create human-like avatars for product marketing and sales demonstrations. They use Voice Engine to translate the audio in their videos.

“When used for translation, Voice Engine preserves the native accent of the original speaker: for example, generating English with an audio sample from a French speaker will produce French-accented speech,” OpenAI said.

As a source clip, the company released an audio recording of an American-sounding woman speaking English. The clip was then translated into Spanish, Mandarin, German, French and Japanese – all in the voice of the original author.

Additionally, Livox’s tool has been used to assist non-speaking people. The Brazilian company is developing an AI-based alternative communication app that allows non-speaking users to speak using voices powered by a voice engine.

“For example, a non-speaking person can have a unique voice that is not robotic and sounds exactly the same in multiple languages,” Livox said on social media. “We hope Livox users will be able to access these voices soon!”

The news comes after OpenAI unveiled its video generation model Sora, which can create realistic videos from a text prompt. Critics are increasingly concerned about the implications of artificial intelligence models, including the ability to create deepfakes for audio and videos.