AI-generated art is appearing everywhere, but that’s only the beginning. Microsoft recently released a new artificial intelligence tool called VALL-E, which is similar to DALL-E but for voices. After listening to just three seconds of audio, VALL-E can replicate any voice.
If that sounds terrifying, that’s because it is. That’s not all, either. According to AITopics, Microsoft’s new tool easily matches emotion and tone, something many voice AI tools struggle with. The team trained VALL-E on roughly 60,000 hours of English speech data, and it demonstrated in-context learning abilities and could even replicate words it had never heard.
The report says that VALL-E is capable of prompt-based TTS, follows context, and doesn’t need pre-designed acoustics or any structural engineering to deliver a high-quality audio sample. Basically, this new AI tool is pretty impressive. All VALL-E needs is to hear about three seconds of any voice, and it’ll be able to quickly and easily imitate (or replicate) the voice.
There are several audio examples from the tool on GitHub, and while some sound great, others aren’t all that impressive and have a robotic tone. But when it works, it works very well. That said, this is still the early days of VALL-E, and things will get better over time. Plus, if the team used larger samples, it would likely be more accurate.
It’s important to note that VALL-E isn’t available to the public, at least not yet, so we can all let out a sigh of relief. If that does happen, it’ll undoubtedly have a slew of security, social, and ethical concerns, to say the least. While this technology certainly sounds impressive, it’s also pretty wild.
via Windows Central