Microsoft’s VALL-E can faithfully reproduce a voice after listening to a three-second recording

news7f01/11/2023

1 2 minutes read

Microsoft's VALL-E can faithfully reproduce a voice after listening to a three-second recording

A team of researchers at Microsoft demonstrated a new AI system capable of mimicking a person’s voice after training with a three-second audio recording. The team explains the development of the new app in an article published on arXiv print server available. They also posted a website demonstrate the capabilities of the application.

Artificial intelligence applications require training on huge amounts of data. But in this new effort, the team at Microsoft has shown that’s not always the case.

The new app is built using Meta’s EnCodec audio compression technology and was originally intended as a way to improve the quality of phone conversations. Subsequent work shows it is capable of doing much more—not only can it imitate a voiceit can also simulate the tone and even the sound of the environment in which the original recording was made.

Of course, Microsoft hasn’t eliminated the need for a huge dataset; Instead, the researchers moved where it was used. The app is taught to “listen” to a string of words and then reproduces its sound using Meta’s Libri-light dataset, which has over 60,000 hours of recordings made by 7,000 English speakers.

The examples Microsoft provided demonstrate that the system works much better for some voices than others, and that the system has problems with accents. But since the app is still in its early stages, it is likely that its functionality will improve over time.

Microsoft didn’t make it Source code to the VALL-E public and likely will not do so, noting that it can be used in irresponsible ways—such as recording politicians’ hoaxes. When combined with deepfake video, results may be lost”Hearsay” to the next level. Microsoft’s example has shown what is possible; therefore, it seems likely that similar systems by others will emerge soon.

More information:
Chengyi Wang et al., The Neural Codec Language Model is a Zero-Shot Text-to-Speech Synthesizer, arXiv (2023). DOI: 10.48550/arxiv.2301.02111

Journal information:
arXiv

quote: Microsoft’s VALL-E was able to faithfully reproduce voice after listening to a three-second recording (2023, 11 January) retrieved 11 January 2023 from https://techxplore.com/news/ 2023-01-microsoft-vall-e-faithfully-voice.html

This document is the subject for the collection of authors. Other than any fair dealing for private learning or research purposes, no part may be reproduced without written permission. The content provided is for informational purposes only.