What a privilege it is to be one of the last, the whole human race.
I know that in a tangible future, artists formerly known as humans will be a moving mix of meat and chips.
Perhaps I shouldn’t be surprised when Microsoft researchers arrive to hasten a desperate future.
It all seems very naive and very scientific. Title of the researchers’ paper creatively opaque: “The Neural Codec Language Model is a Zero-Shot Text-to-Speech Synthesizer.”
What do you imagine this means? There’s a new, faster way for a machine to write down your words?
The researchers’ abstract began to be benign enough. It uses a lot of words, phrases, and acronyms that are unfamiliar to many human language models. It explains that the neural codec language model is called VALL-E.
Surely this name will soften your heart. What could be so terrifying about a technology that almost looks like the cute little robot in a heartwarming movie?
Wells, this is probably: “VALL-E features contextual learning and can be used for high-quality personalized speech synthesis with just a 3-second registered recording of an unseen speaker as an audio prompt.”
I often want to promote my ability to learn. Instead, I have to wait for them to appear.
And what emerged from the researchers’ last sentence was shivers. Microsoft’s big brain now only needs 3 seconds when you say something to fake longer sentences and possibly long speeches that aren’t made by you, but sound pretty much like you.
I won’t go into the science too much, because neither of us will benefit from that.
I will just mention that VALL-E uses a sound library compiled by one of the world’s most trusted and admired companies — Meta. Call LibraryLightit’s a repository of 7,000 people talking for a total of 60,000 hours.
Of course, I’ve listened to VALL-E’s work.
I listened to a man speak for 3 seconds. Then I listened for 8 seconds as his version of VALL-E was reminded: “They then moved cautiously back to the previous fumbling hut and around them to find something to prove. that Warrenton has fulfilled his mission.”
I challenge you to notice many, if any, differences.
It is true that many of the suggestions sound like very bad excerpts of 18th-century literature. Sample: “This humane and righteous father so comforted his unfortunate daughter, and her mother, hugs her. her again, doing all she could to ease her feelings.”
But what can I do but listen to more examples presented by researchers? Some versions of VALL-E are more suspicious than others. The dictionary doesn’t feel right. They feel connected.
However, the overall effect is really scary.
You’ve been warned, couse. You know that when scammers call you, you shouldn’t talk to them in case they record you and then recreate your phrasing to make your abstract voice order products. unjustly expensive.
However, this seems to be another level of complexity. Maybe I’ve watched too many episodes of Peacock’s”take a shot,” where deepfakes are presented as a natural part of government. Maybe I really shouldn’t worry because today Microsoft is a good company and not offensive.
However, the idea that someone, anyone, could easily be fooled into believing I was saying something that I didn’t — and never will — doesn’t make me comfortable. Especially when the researchers claim that they can also reproduce the “emotional and acoustic environment” during the first 3 seconds of speech.
Then you’ll be relieved that researchers may have discovered this irritating ability. They provide: “Because VALL-E can synthesize speech to maintain the identity of the speaker, it can potentially pose a risk of misuse of the model, such as spoofing speech recognition or impersonating a person be specific.”
Solution? The researchers said building a detection system.
This might leave one or two people wondering, “So why are you doing this?”
Quite often in technology, the answer is: “Because we can.”