close
close

Microsoft’s AI voice cloning technology is so good you can’t use it

Microsoft’s AI voice cloning technology is so good you can’t use it

Microsoft’s research team has unveiled VALL-E 2, a new AI speech synthesis system capable of generating “human-level” voices indistinguishable from the source using just a few seconds of audio.

“(VALL-E 2 is) the latest development in neural codec language models, representing a milestone in zero-shot text-to-speech (TTS) synthesis and achieving human parity for the first time,” the research paper states. The system builds on its predecessor VALL-E, which was introduced in early 2023. Neural codec language models represent language as code sequences.

What sets VALL-E 2 apart from other voice cloning techniques, according to the team, is the method of “repetition aware sampling” and the adaptive switching between sampling techniques. The strategies improve consistency and solve the most common problems of the traditional generative voice.

“VALL-E 2 consistently synthesizes high-quality speech, even for sentences that have traditionally been challenging due to their complexity or repetitive phrases,” the researchers wrote, noting that the technology could help generate speech for people who have lost the ability to speak.

As impressive as the tool is, it is not made available to the public.

“We currently have no plans to incorporate VALL-E 2 into a product or expand public access,” Microsoft said in its ethics statement, noting that such tools pose risks such as voice imitation without consent and the use of convincing AI voices in fraud and other criminal activities.

The research team emphasized the need for a standard method for digitally tagging AI generations and recognized that high-precision detection of AI-generated content remains a challenge.

“If the model is generalized to unseen speakers in the real world, it should include a protocol to ensure that the speaker consents to the use of his or her voice and a model for recognizing synthetic speech,” they wrote.

However, the results of VALL-E 2 are very accurate compared to other tools. In a series of tests conducted by the research team, VALL-E 2 outperformed human benchmarks in terms of robustness, naturalness, and similarity of the generated language.

Image: Microsoft

VALL-E-2 was able to achieve these results with just 3 seconds of audio. However, the research team found that “using 10-second speech samples resulted in even better quality.”

Microsoft isn’t the only AI company that has demonstrated cutting-edge AI models without releasing them. Meta’s Voicebox and OpenAI’s Voice Engine are two impressive voice clones, but they face similar limitations.

“There are many exciting use cases for generative language models, but because of the potential risk of misuse, we are not currently making the Voicebox model or code publicly available,” said a Meta AI spokesperson. Decrypt last year.

OpenAI also stated that it wants to resolve the security issue before bringing its synthetic voice model to market.

“In line with our approach to AI safety and our voluntary commitments, we have decided to preview this technology at this time, but not make it generally available,” OpenAI said in an official blog post.

This call for ethical guidelines is spreading throughout the AI ​​community, especially as regulators begin to raise concerns about the impact of generative AI on our daily lives.

Edited by Ryan Ozawa.