close
close

Microsoft’s AI reaches parity with the human voice, public release put on hold

Microsoft’s AI reaches parity with the human voice, public release put on hold

Free artist's image of artificial intelligence (AI). This illustration shows language models generating text. It was created by Wes Cockx as part of the Visualising AI project.

Microsoft claims its new AI text-to-speech (TTS) generator has reached “human levels” as it can now accurately imitate a human voice.

VALL-E 2 replaces the first-generation model released in January 2023 and is designed to produce speech identical to that of a human using just a few seconds of real audio, according to a research paper published by Microsoft researchers.

“VALL-E 2 can produce precise, natural speech in the exact voice of the original speaker, comparable to human performance,” the researchers wrote in a blog post.

Outperform competitors

The US Sun reported that VALL-E 2 outperformed audio samples from existing speech libraries and datasets such as LibriSpeech and VCTK in evaluating speaker similarity, naturalness and speech quality.

In addition, the new AI tool showed promising performance in automatically generating not only simple but also complicated thought sentences using zero-shot learning. This suggests that VALL-E 2 does not need previous examples to understand and replicate ideas.

“VALL-E 2 is the latest development in speech models using neural codecs and marks a milestone in zero-shot text-to-speech synthesis, achieving human parity for the first time. In addition, VALL-E 2 consistently synthesizes high-quality speech, even for sentences that have traditionally been challenging due to their complexity or repetitive phrases,” the researchers explained.

Remove repetitions

When creating the second-generation AI model, two advanced features were applied to improve speech synthesis and reach human levels.

The first is Repetition Aware Sampling, which solves performance problems caused by the repetition of tokens. A token is a single linguistic unit of sound or phrase that can trip up AI machines when repeated, similar to how humans stutter when dealing with difficult alliteration-rich sentences.

The second technique, Groupe Code Modeling, also treated tokens by minimizing their number in each input sequence processed by VALL-E 2, resulting in faster generation.

Not yet ready for the public

Despite the unprecedented breakthrough that makes VALL-E 2 useful for people with aphasia and amyotrophic lateral sclerosis, Microsoft stressed that its new AI tool cannot yet be made available to the public, classifying the technology as a research project rather than a product.

This is due to ongoing concerns about voice impersonation and voice spoofing fraud such as vishing, which is hindering the use of AI tools.

“Currently, we have no plans to incorporate VALL-E 2 into a product or expand access to the public,” the researchers explained in an ethics statement at the end of the blog. “Although VALL-E 2 can speak in the speaker’s voice… it may pose potential risks if the model is misused, such as spoofing voice recognition or imitating a specific speaker.”

This is not the first time Microsoft has backed off from launching its technology due to security and privacy concerns. The company has already put its voice assistant Recall AI on hold after it faced disputes and concerns among its target customers.

OpenAI also faced a similar obstacle and was forced to limit some of its models and develop a deep fake detector that allows its users to distinguish between AI-generated and human-created images.