Researcher Says 2026 Will Be the Year You’re Fooled by a Deepfake. Voice Cloning Has Crossed the ‘Indistinguishable Threshold’

During 2025, deepfakes advanced significantly. AI-generated faces, voices, and full-body performances that imitate real people improved in quality far beyond what many experts anticipated just a few years ago. They were also increasingly utilized to deceive individuals.

In numerous everyday situations, particularly low-resolution video calls and media shared on social media platforms, their realism has now reached a level high enough to reliably deceive non-expert viewers. In practical terms, synthetic media has become indistinguishable from authentic recordings for ordinary people and, in some instances, even for institutions.

Moreover, this increase is not restricted to quality. The quantity of deepfakes has grown exponentially: A cybersecurity firm estimates an increase from approximately 500,000 online deepfakes in 2023 to around 8 million in 2025, with annual growth approaching 900%.

I am a computer scientist who studies deepfakes and other synthetic media. From my perspective, I observe that the situation will be different in 2026 as deepfakes become synthetic performers capable of reacting to people in real time. https://www.youtube.com/embed/2DhHxitgzX0?wmode=transparent&start=0 Almost anyone can now create a deepfake video.

Dramatic improvements

Several technical changes underlie this significant escalation. First, video realism made a substantial leap due to video generation models specifically designed for [purpose not specified]. These models produce videos with coherent motion, consistent identities of the people depicted, and content that makes sense from one frame to the next. The models separate the information related to representing a person’s identity from the information about motion so that the same motion can be [used in a certain way not specified], or the same identity can have multiple types of motions.

These models generate stable, coherent faces without the flicker, warping, or structural distortions around the eyes and jawline that once served as reliable forensic evidence of deepfakes.

Second, voice cloning has crossed what I would term the “indistinguishable threshold.” A few seconds of audio are now sufficient to generate a [voice not fully described] – complete with natural intonation, rhythm, emphasis, emotion, pauses, and breathing noise. This ability is already fueling large-scale fraud. Some major retailers report receiving [reports not fully described] per day. The perceptual cues that once revealed synthetic voices have largely vanished.

Third, consumer tools have pushed the technical barrier almost to zero. Upgrades from OpenAI’s [tools not specified] and Google’s [tools not specified], along with a wave of startups, mean that anyone can describe an idea, have a large language model such as OpenAI’s ChatGPT or Google’s Gemini draft a script, and [complete the process not fully described]. AI agents can automate the entire process. The capacity to generate coherent, storyline-driven deepfakes on a large scale has effectively been made accessible to the public.

This combination of a surging quantity and personas that are nearly indistinguishable from real humans creates serious [issues not fully described], especially in a media environment where people’s attention is fragmented and content spreads faster than it can be verified. There has already been real-world harm – from [harm not fully described] to [harm not fully described] and [harm not fully described] – enabled by deepfakes that spread before people have the opportunity to realize what is happening. https://www.youtube.com/embed/syNN38cu3Vw?wmode=transparent&start=0 AI researcher Hany Farid explains how deepfakes work and how good they are becoming.

The future is real time

Looking ahead, the trend for next year is clear: Deepfakes are moving towards real-time synthesis that can produce videos closely resembling the subtleties of a human’s appearance, making it easier for them to evade detection systems. The focus is shifting from static visual realism to temporal and behavioral coherence: models that [function not fully described] rather than pre-rendered clips.

Identity modeling is converging into unified systems that capture not only how a person looks but also how they [behave not fully described]. The result goes beyond “this resembles person X” to “this behaves like person X over time.” I anticipate entire video-call participants being synthesized in real time; interactive AI-driven actors whose faces, voices, and mannerisms instantly adapt to a prompt; and scammers using responsive avatars instead of fixed videos.

As these capabilities mature, the perceptual gap between synthetic and authentic human media will continue to narrow. The meaningful line of defense will shift away from human judgment. Instead, it will rely on infrastructure-level protections. These include secure provenance such as media signed cryptographically, and AI content tools that use the [specifications not fully described]. It will also depend on multimodal forensic tools such as my lab’s [tools not fully described].

Mere scrutiny of pixels will no longer be sufficient.

, Professor of Computer Science and Engineering; Director, UB Media Forensic Lab,

This article is republished from under a Creative Commons license. Read the.

The Conversation