close
close

Microsoft develops AI that imitates the “exact voice” of humans – but publication is too dangerous

Microsoft develops AI that imitates the “exact voice” of humans – but publication is too dangerous

Microsoft’s new AI speech generator VALL-E 2 is so convincing that it reproduces an “exact human voice” from just a few seconds of audio, but is not being released due to the risk of misuse

VALL-E 2 achieved or exceeded human speech quality in benchmarks((Getty Images)

With VALL-E 2, Microsoft has developed a groundbreaking speech generator based on artificial intelligence (AI) that is so realistic that it will never be made publicly available.

This text-to-speech (TTS) wonder can mimic a human voice using just a few seconds of audio. According to a paper published on June 17 on the preprint server arXiv, “VALL-E 2 was able to produce ‘precise, natural speech in the exact voice of the original speaker, comparable to human performance.’”/




“VALL-E 2 is the latest advancement in speech models with neural codecs and marks a milestone in zero-shot text-to-speech (TTS) synthesis, achieving human parity for the first time,” the researchers write in the article.

READ MORE: First AI-generated beauty pageant winner officially announced in historic competition

VALL-E 2 is a pure research project with no plans for integration into a project((Getty Images)

Human parity in this context means that the speech generated by VALL-E 2 matches or exceeds the quality of human speech in the benchmarks used by Microsoft. “In addition, VALL-E 2 consistently synthesizes high-quality speech, even for sentences that have traditionally been challenging due to their complexity or repetitive phrases.”

There are two main innovative features: Repetition Aware Sampling and Grouped Code Modeling. These advancements improve the fluency and efficiency of speech generation by taking linguistic repetition into account and optimizing the processing of input sequences.

Microsoft’s groundbreaking artificial intelligence (AI) speech generator, VALL-E 2, will not be made available to the public due to risks of misuse((Getty Images)

The researchers further explained: “Our experiments conducted on the LibriSpeech and VCTK datasets showed that VALL-E 2 outperforms previous zero-shot TTS systems in terms of speech robustness, naturalness, and speaker similarity,” adding: “It is the first of its kind to reach human levels on these benchmarks.”

In the article, they also pointed out that the quality of VALL-E 2 output depends on the length and quality of the voice prompts as well as environmental factors such as background noise.