Tech

Microsoft’s VALL-E can faithfully reproduce a voice after listening to a three-second recording


Microsoft's VALL-E can faithfully reproduce a voice after listening to a three-second recording

Overview of VALL-E. Unlike the previous procedure (e.g. phoneme → mel-spectrogram → waveform), VALL-E’s procedure is phoneme → discrete code → waveform. VALL-E generates discrete audio codecs based on phoneme and audio codec prompts, corresponding to the target content and the speaker’s voice. Live VALL-E enables various speech synthesis applications, such as zero-shot TTS, voice editing, and content creation in conjunction with other synthetic AI models such as GPT-3 [Brown et al., 2020]. Credit: arXiv (2023). DOI: 10.48550/arxiv.2301.02111

A team of researchers at Microsoft demonstrated a new AI system capable of mimicking a person’s voice after training with a three-second audio recording. The team explains the development of the new app in an article published on arXiv print server available. They also posted a website demonstrate the capabilities of the application.

Artificial intelligence applications require training on huge amounts of data. But in this new effort, the team at Microsoft has shown that’s not always the case.

The new app is built using Meta’s EnCodec audio compression technology and was originally intended as a way to improve the quality of phone conversations. Subsequent work shows it is capable of doing much more—not only can it imitate a voiceit can also simulate the tone and even the sound of the environment in which the original recording was made.

Of course, Microsoft hasn’t eliminated the need for a huge dataset; Instead, the researchers moved where it was used. The app is taught to “listen” to a string of words and then reproduces its sound using Meta’s Libri-light dataset, which has over 60,000 hours of recordings made by 7,000 English speakers.

The examples Microsoft provided demonstrate that the system works much better for some voices than others, and that the system has problems with accents. But since the app is still in its early stages, it is likely that its functionality will improve over time.

Microsoft didn’t make it Source code to the VALL-E public and likely will not do so, noting that it can be used in irresponsible ways—such as recording politicians’ hoaxes. When combined with deepfake video, results may be lost”Hearsay” to the next level. Microsoft’s example has shown what is possible; therefore, it seems likely that similar systems by others will emerge soon.

More information:
Chengyi Wang et al., The Neural Codec Language Model is a Zero-Shot Text-to-Speech Synthesizer, arXiv (2023). DOI: 10.48550/arxiv.2301.02111

Journal information:
arXiv


© 2023 Science X Network

quote: Microsoft’s VALL-E was able to faithfully reproduce voice after listening to a three-second recording (2023, 11 January) retrieved 11 January 2023 from https://techxplore.com/news/ 2023-01-microsoft-vall-e-faithfully-voice.html

This document is the subject for the collection of authors. Other than any fair dealing for private learning or research purposes, no part may be reproduced without written permission. The content provided is for informational purposes only.

news7f

News7F: Update the world's latest breaking news online of the day, breaking news, politics, society today, international mainstream news .Updated news 24/7: Entertainment, Sports...at the World everyday world. Hot news, images, video clips that are updated quickly and reliably

Related Articles

Back to top button