Maybe you’ve played around with chatbots like ChatGPT and Bard, or image generators like Dall-E. If you thought they blurred the line between AI and human intelligence, you haven’t seen – or heard – anything yet.
For the past few months, The Wall Street Journal columnist Joanna Stern has been testing Synthesia, a tool that creates artificially intelligent avatars from recorded video and audio (also known as deep fakes). Type anything and your video avatar repeats.
Stern does a lot of voice and video work, so she thought AI could make her more productive and eliminate some of the hard work that AI promises. She recorded about 30 minutes of video and nearly two hours of audio that Synthesia would use to train the clone. A few weeks later, AI Joanna was ready.
She wondered if AI – paired with ChatGPT-generated text – could replace her real self in videos, meetings, and phone calls. Eventually, AI Joanna can write columns and host her videos. For now, she’s at her best illustrating the double-edged sword of generative AI voice and video tools.
Video is a lot of work and cost. Hair, makeup, wardrobe, cameras, lighting, microphones. Synthesia promises to eradicate that work, which is why corporations are already using it.
Why pay actors to star in a live-action version when AI can do it all? Synthesia charges $1,000 a year to create and maintain a custom avatar, plus an additional monthly subscription fee. Offers stock avatars for a lower monthly cost.
Stern asked ChatGPT to generate a TikTok script about an iOS tip, written in the voice of Joanna Stern. She tapped into Synthesia, clicked “generate” and suddenly “she” was talking, but without hand gestures and facial expressions. For quick sentences, the avatar can be quite convincing. The longer the text, the more its bot nature shows.
Joanna Stern, personal technology columnist, dives into an obscure iPhone feature—Back Tap gestures. Be sure to watch until the end 🤫 📷: Bloomberg #joannastern #iphone #iphonetips #tech #ai #chatgpt #wsj #wallstreetjournal #thewallstreetjournal #ForYouPizza
On TikTok, these computer attributes are less noticeable. Still, some quickly caught on.
The bot was very obvious on work video calls. Stern downloaded clips of her saying common meeting comments (“Hey everyone!” “Sorry I got muted.”) Then she used software to get them into Google Meet. On the surface, AI Joanna’s perfect posture and lack of humor were clear giveaways.
It will all get better, though. Synthesia has a few avatars in beta that can wave up and down, raise their eyebrows, and more.
In phone calls, the columnist used voice generated by ElevenLabs, a developer of AI speech software. About 90 minutes of his voice from previous videos were gathered and loaded into the tool – no need to visit the studio. In less than two minutes, he cloned the girl’s voice. In ElevenLabs’ web-based tool, type any text, click Generate, and within seconds, “your” voice says it out loud. Creating a voice clone with ElevenLabs starts at $5 per month.
Compared to Joanna from Synthesia, ElevenLabs sounds more human, with better intonations and flow.
Calling her sister, who calls several times a week, her sister said the bot sounded like Stern, but noted that the bot wouldn’t take a breath. When she called her dad and asked for his Social Security number, he only knew something was up because it sounded like a recording.
ElevenLabs’ voice was so good that it fooled the voice biometric system on your credit card.
AI Joanna answered several things the system would ask, then called customer service. In the biometric step, when the automated system asked for her name and address, the AI Joanna responded. Upon hearing the bot’s voice, the system recognized it as her and immediately connected with a representative. When the newspaper’s video intern called and did his best impression of Joanna, the automated system requested additional verification.
A spokeswoman for the card provider said the bank uses voice biometrics, along with other tools, to verify that callers are who they say they are.
She added that the feature is intended for customers to quickly and securely identify themselves, but to complete transactions and other financial requests, customers must provide additional information.
More worryingly, ElevenLabs made a very good clone without much else. All you have to do is click a button saying you have the “necessary rights or consents” to upload audio files and create the clone, and that you would not use it for fraudulent purposes.
This means that anyone on the internet can take hours of anyone’s voice on the planet to save and use. The FTC (Federal Trade Commission) is already warning about AI voice-related scams.
In Synthesia’s case, the company requires that audio and video include verbal consent.
ElevenLabs only allows cloning on paid accounts, so any use of cloned voice that violates company policies can be traced back to the account holder, said company co-founder Mati Staniszewski.
The company is working on an authentication tool so that people can send any audio to verify that it was created with ElevenLabs technology.
Both systems allowed the columnist to generate some horrible things with her voice, including death threats.
A spokesperson for Synthesia said that Stern’s account was designated for use with a news organization, meaning she can say words and phrases that might otherwise be filtered out.
The company claimed that its moderators flagged and deleted their problematic phrases afterward. When your account was changed to the standard type, it was no longer able to generate those same phrases.
Staniszewski said ElevenLabs can identify all content made with its software. If the content violates the company’s terms of service, he added, ElevenLabs can ban its source account and, in case of violation of the law, assist the authorities.
Hany Farid, a digital forensics expert at the University of California, Berkeley, said it was very difficult to detect synthetic audio and video. “Not only can I generate these things, I can bombard the internet with them,” he said, adding that you can’t turn everyone into an AI detective.
“Not only can I generate these things, I can bombard the internet with them,” he said, adding that you can’t turn everyone into AI detectives.
However, there is a content authenticity initiative led by Adobe. More than 1,000 media and technology companies, academics, and more aim to create an embedded “nutrition label” for the media. Photos, videos, and audio on the internet may one day come with verifiable information attached. Synthesia is a member of the initiative.