“The biggest loss caused by AI will be the complete destruction of trust in anything you see or hear,” an article from Wired says, composing probably one of the most profound sentences I’ve ever read in a mainstream article.
Nearly every flavor of entity that controls our world, corporations, academics and governments are already “working furiously” to analyze and perfectly understand the human voice to replicate it. As innumerable power struggles over the development of a particular technology have occurred in the past century or so, apparently now a race to decode the human voice is underway.
Imagine a world where the sound of a person’s voice isn’t solid proof of the person speaking: where you could hear a family member speak from another part of the house and think they are there, but they are not. A robot is copying their voice and they are somewhere else.
Countries such as the US, China, and Estonia have gone into this territory, and entities that are nearing the power of small countries such as Facebook, Google, Apple and Amazon are trying to perfectly mimic individual’s voices.
It’s already not very difficult to create an artificial voice, and then make it absorb and reproduce words and phrases, as our smartphones currently do (like Siri). According to Wired:
“Making a natural-sounding voice involves algorithms that are far more complex and computationally expensive. But that technology is available now.
As any speech pathologist will attest, the human voice is far more than vocal-chord vibrations. These vibrations are caused by air escaping our lungs and forcing open our vocal folds, a process that produces tones as unique as a fingerprint because of the thousands of waveforms that are conjured simultaneously and in chorus. But a voice’s uniqueness is also tied to qualities we rarely consider, such as intonation, inflection and pacing.”
The factors listed above and more contribute to the mosaic that is an individual’s human voice.
Basically if a government or institution has money, they can pay researchers to pursue the arduous task of listing every single factor in what makes the human voice what it is (inflection, pacing, intonation), and then they can develop this technology to mimic a person’s voice as it senses each individual quality to replicate.
One software being pioneered by Adobe is being called the “Photoshop of soundwaves”: it is Project Voco.
It works by substituting waveforms for pixels, to create essentially a bridge between voice recordings and imitate a human voice to sound natural.
Adobe believes that if enough of one person’s speech can be recorded (or mined through surveillance of course), one could simply cut and paste artificial speech into a recording.
They say Adobe’s early results with the software sound eerie.
Continuing from the Wired article, which is surprisingly accurate:
“By 2018, a nefarious actor may easily be able to create a good enough vocal impersonation to trick, confuse, enrage or mobilise the public. Most citizens around the world will be simply unable to discern the difference between a fake Trump or Putin soundbite and the real thing.
When you consider the widespread distrust of the media, institutions and expert gatekeepers, audio fakery could be more than disruptive. It could start wars. Imagine the consequences of manufactured audio of a world leader making bellicose remarks, supported by doctored video. In 2018, will citizens – or military generals – be able to determine that it’s fake?”