Microsoft develops AI that imitates the “exact voice” of humans – but publication is too dangerous
![Microsoft develops AI that imitates the “exact voice” of humans – but publication is too dangerous Microsoft develops AI that imitates the “exact voice” of humans – but publication is too dangerous](https://i2-prod.dailystar.co.uk/incoming/article33215841.ece/ALTERNATES/s1200/3_AI-chatbot-Artificial-Intelligence-digital-concept.jpg)
Microsoft’s new AI speech generator VALL-E 2 is so convincing that it reproduces an “exact human voice” from just a few seconds of audio, but is not being released due to the risk of misuse
With VALL-E 2, Microsoft has developed a groundbreaking speech generator based on artificial intelligence (AI) that is so realistic that it will never be made publicly available.
This text-to-speech (TTS) wonder can mimic a human voice using just a few seconds of audio. According to a paper published on June 17 on the preprint server arXiv, “VALL-E 2 was able to produce ‘precise, natural speech in the exact voice of the original speaker, comparable to human performance.’”/
“VALL-E 2 is the latest advancement in speech models with neural codecs and marks a milestone in zero-shot text-to-speech (TTS) synthesis, achieving human parity for the first time,” the researchers write in the article.
READ MORE: First AI-generated beauty pageant winner officially announced in historic competition
Human parity in this context means that the speech generated by VALL-E 2 matches or exceeds the quality of human speech in the benchmarks used by Microsoft. “In addition, VALL-E 2 consistently synthesizes high-quality speech, even for sentences that have traditionally been challenging due to their complexity or repetitive phrases.”
There are two main innovative features: Repetition Aware Sampling and Grouped Code Modeling. These advancements improve the fluency and efficiency of speech generation by taking linguistic repetition into account and optimizing the processing of input sequences.
The researchers further explained: “Our experiments conducted on the LibriSpeech and VCTK datasets showed that VALL-E 2 outperforms previous zero-shot TTS systems in terms of speech robustness, naturalness, and speaker similarity,” adding: “It is the first of its kind to reach human levels on these benchmarks.”
In the article, they also pointed out that the quality of VALL-E 2 output depends on the length and quality of the voice prompts as well as environmental factors such as background noise.
Despite the progress and capabilities, Microsoft will not release VALL-E 2 because the company believes it poses “potential risks of misusing the model, such as spoofing voice recognition or imitating a specific speaker.” Meanwhile, AI companies like OpenAI have also imposed similar restrictions on their speech technology.
“VALL-E 2 is a research project only. We currently have no plans to integrate VALL-E 2 into a product or expand access to the public,” the researchers explained in a blog post.
However, they suggested that it could “synthesize speech while preserving the speaker’s identity and use it for educational learning, entertainment, journalistic purposes, self-authored content, accessibility features, interactive voice response systems, translations, chatbots, etc.,” the researchers added.
They continued: “If the model is generalized to unseen speakers in the real world, it should include a protocol to ensure that the speaker consents to the use of his or her voice, as well as a model for recognizing synthetic speech.”