Deepfakes

Manyanshi Joshi
10 hours ago
12 min read

Deepfakes are synthetic media—usually videos, images, or audio—created using artificial intelligence to make it look or sound like someone is saying or doing something they never actually did.

What they look like

How they work

Deepfakes rely on techniques from Machine Learning, especially Neural Networks. A common approach uses Generative Adversarial Networks, where:

One AI generates fake content
Another AI tries to detect if it’s fake
They improve together until the result looks realistic

Common uses

Not all deepfakes are harmful. They’re used in:

Movies and TV (de-aging actors, visual effects)
Video games and virtual avatars
Education and historical recreations

But they’re also used for:

Misinformation and fake news
Scams and impersonation (voice cloning, fake calls)
Non-consensual content

Why they matter

Deepfakes blur the line between real and fake media. This creates risks like:

Damaging reputations
Influencing elections or public opinion
Making it harder to trust video/audio evidence

How to spot a deepfake (basic clues)

Unnatural blinking or facial movements
Mismatched lighting or shadows
Blurry edges around the face
Audio that doesn’t perfectly match lip movement

Here’s a high-level walkthrough of how deepfakes are typically created. This stays at a conceptual level and avoids operational misuse details.

How Deepfakes Are Made — Step by Step

1. Collect training data

Creators gather lots of images or video clips of:

The target person (whose face or voice will appear)
The source actor (the person performing the actions)

The AI needs many angles, lighting conditions, and facial expressions to learn realistic patterns.

Examples:

Interviews
Social media videos
Photos
Podcasts (for voice cloning)

2. Extract and align faces

Software detects faces frame-by-frame in videos.

The system:

Finds eyes, nose, mouth
Rotates and crops the face
Standardizes size and position

This creates a clean dataset the AI can learn from.

This step uses concepts from Computer Vision.

3. Train the AI model

The core model learns how a face behaves.

Common architectures include:

Autoencoder
Generative Adversarial Networks
Diffusion-based image generators

The AI studies:

Facial structure
Expressions
Lighting
Skin texture
Movement patterns

Conceptually, the system tries to learn a transformation:

f(source face)→target facef(\text{source face}) \rightarrow \text{target face}f(source face)→target face

Training can take hours or days depending on:

Dataset size
Video quality
Hardware power

4. Generate swapped frames

Once trained, the AI processes each video frame:

Detects the source face
Predicts a transformed version
Generates a synthetic face

The output tries to preserve:

Expression
Head pose
Eye direction
Emotion

while changing identity.

5. Blend the fake face into the video

The generated face is composited onto the original frame.

Extra processing helps realism:

Color matching
Edge smoothing
Motion stabilization
Lighting correction

Without this step, deepfakes often look obviously fake.

6. Synchronize audio (optional)

For speaking videos:

Voice cloning models imitate tone and speech patterns
Lip-sync models align mouth movements to audio

This uses techniques from:

Speech Processing
Text-to-Speech

7. Render the final video

The processed frames are recombined into a finished video file.

Higher realism usually requires:

High-resolution generation
Frame consistency
Temporal smoothing across frames

Why Older Deepfakes Looked Bad

Early deepfakes often had:

Flickering faces
Strange blinking
Warped teeth
Inconsistent lighting

Modern AI models are much better because:

Training datasets are larger
GPUs are faster
Diffusion models improved realism

How Detection Works

Researchers look for:

Biological inconsistencies
Compression artifacts
Unrealistic eye reflections
Frame-to-frame anomalies

Some detectors analyze tiny frequency patterns invisible to humans.

Ethical and legal issues

Many countries now regulate malicious deepfakes involving:

Fraud
Election misinformation
Non-consensual explicit content
Identity impersonation

Legitimate film/VFX use is generally treated differently from deceptive use.

Professionals detect deepfakes by combining human review, AI analysis, forensic techniques, and metadata investigation. No single method is perfect, so experts usually layer multiple checks together.

1. Visual forensic analysis

Investigators examine frames for inconsistencies humans often miss.

Common clues:

Uneven lighting on the face
Blurry boundaries around hair or jawline
Warped glasses, earrings, or teeth
Inconsistent reflections in eyes
Skin texture changing between frames

They also check whether facial movement follows natural biomechanics.

Example concept:

Frame Consistency(t)≈Frame Consistency(t+1)\text{Frame Consistency}(t) \approx \text{Frame Consistency}(t+1)Frame Consistency(t)≈Frame Consistency(t+1)

Real videos usually maintain stable patterns across adjacent frames.

2. Temporal analysis (motion over time)

A fake frame may look convincing alone, but motion reveals problems.

Detection systems analyze:

Eye blinking frequency
Lip synchronization
Head movement continuity
Micro-expressions
Natural muscle motion

Older deepfakes often failed here because each frame was generated too independently.

This area uses techniques from Signal Processing and Optical Flow.

3. AI-based detectors

Modern detectors train AI against other AI.

A detector learns statistical fingerprints left by generators:

Pixel distribution anomalies
Frequency-domain artifacts
Unrealistic texture synthesis
Compression mismatches

Conceptually:

P(fake∣x)>P(real∣x)P(\text{fake}\mid x) > P(\text{real}\mid x)P(fake∣x)>P(real∣x)

where the model estimates whether media is likely fake.

Researchers often use architectures from:

Convolutional Neural Network
Transformer

4. Frequency-domain analysis

Humans mostly notice spatial patterns, but detectors inspect hidden mathematical patterns.

Using transforms like:

F(ω)=∫−∞∞f(t)e−iωt dtF(\omega)=\int_{-\infty}^{\infty} f(t)e^{-i\omega t}\,dtF(ω)=∫−∞∞f(t)e−iωtdt

experts analyze frequency signatures produced during AI generation.

AI-generated media can leave:

Repeating noise structures
Unnatural high-frequency details
Generator-specific fingerprints

This comes from Fourier Transform.

5. Metadata and provenance checks

Professionals inspect:

File creation history
Editing software traces
Camera metadata
Compression history
Upload timestamps

A “phone video” that lacks expected smartphone metadata can raise suspicion.

6. Source verification

Journalists and investigators often verify:

Original upload source
Reverse image/video search
Whether the scene existed before
Geolocation and weather consistency

This is common in Bellingcat-style investigations.

7. Biological signal analysis

Advanced systems analyze subtle human physiological signals:

Blood-flow color changes in skin
Heart-rate patterns from tiny facial color variations
Natural breathing rhythms

AI generators often fail to reproduce these perfectly.

8. Watermarks and cryptographic signatures

Some companies embed invisible authenticity markers into real media.

Efforts include:

Content provenance systems
Cryptographic signing
Camera authenticity standards

Organizations like Coalition for Content Provenance and Authenticity work on this.

Why detection is difficult

Detection is an arms race.

As generators improve:

Artifacts disappear
Motion becomes smoother
Voice synthesis improves
Lighting realism increases

So detectors constantly retrain against newer models.

Important reality

Even professionals sometimes cannot conclusively prove a sophisticated deepfake from visual inspection alone. That’s why investigators increasingly rely on:

provenance,
trusted capture systems,
and chain-of-custody evidence,

not just image analysis.

Both GANs and diffusion models generate realistic AI images, videos, or audio—but they do it in very different ways.

Core idea

GANs: “Generator vs Detective”

A Generative Adversarial Networks uses two neural networks competing against each other:

Generator → creates fake images
Discriminator → tries to detect fakes

The generator improves by trying to fool the discriminator.

Conceptually:

min⁡Gmax⁡D V(D,G)=Ex∼pdata[log⁡D(x)]+Ez∼pz[log⁡(1−D(G(z)))]\min_G\max_D\;V(D,G)=\mathbb{E}_{x\sim p_{data}}[\log D(x)] + \mathbb{E}_{z\sim p_z}[\log(1-D(G(z)))]minGmaxDV(D,G)=Ex∼pdata[logD(x)]+Ez∼pz[log(1−D(G(z)))]

Think of it like:

a counterfeiter vs
a detective.

Over time, both become highly skilled.

Diffusion models: “Destroy then rebuild”

Diffusion Model works differently.

The model:

Gradually adds noise to real images
Learns how to reverse that noise process
Generates new images by turning random noise into coherent pictures

Conceptually:

xt=1−βtxt−1+βtϵx_t=\sqrt{1-\beta_t}x_{t-1}+\sqrt{\beta_t}\epsilonxt=1−βtxt−1+βtϵ

Then the model learns the reverse process:

pθ(xt−1∣xt)p_\theta(x_{t-1}\mid x_t)pθ(xt−1∣xt)

A diffusion model is more like:

repeatedly refining static

until an image appears.

Visual intuition

GAN

Random noise → generator → fake imageDiscriminator says:

“looks fake”
“looks real”

Generator improves from feedback.

Diffusion

Random noise → slightly cleaner → clearer → detailed image → final result

The image emerges progressively.

Main differences

Feature	GANs	Diffusion Models
Generation style	Competitive game	Gradual denoising
Speed	Usually faster	Usually slower
Stability during training	Harder	More stable
Image quality	Sharp but sometimes flawed	Extremely realistic
Diversity	Can collapse to similar outputs	Better diversity
Modern popularity	Declining somewhat	Dominant today

Why GANs were revolutionary

GANs created the first truly convincing:

fake faces,
face swaps,
synthetic humans,
early deepfakes.

Projects like StyleGAN became famous for ultra-realistic fake faces.

Why diffusion models became dominant

Diffusion models power many modern systems because they:

produce more consistent images,
handle text prompts better,
scale effectively.

Examples include:

Stable Diffusion
DALL·E
Midjourney

These models are especially strong at:

photorealism,
artistic generation,
complex compositions.

In deepfakes specifically

GAN-based deepfakes

Older systems often:

swapped faces directly,
generated individual frames,
struggled with temporal consistency.

Diffusion-based deepfakes

Newer systems:

generate smoother details,
preserve lighting better,
create more realistic skin and motion,
improve frame coherence.

This is one reason modern AI video is advancing rapidly.

Weaknesses

GAN weaknesses

Training instability
“Mode collapse” (repeating similar outputs)
Hard balancing between networks

Example idea:

G(z1)≈G(z2)≈G(z3)G(z_1) \approx G(z_2) \approx G(z_3)G(z1)≈G(z2)≈G(z3)

where many inputs generate nearly identical outputs.

Diffusion weaknesses

Computationally expensive
Slower generation
Requires many denoising steps

Though newer techniques are speeding this up.

Simple analogy

GAN

A student artist competes against an art critic.

Diffusion

A sculptor slowly removes noise from a block of static until a picture appears.

Diffusion models are connected to thermodynamics because they mathematically resemble physical diffusion processes—the same kinds of processes that describe:

heat spreading,
smoke dispersing,
molecules moving randomly in fluids.

Modern AI diffusion models borrow equations and ideas from Statistical Mechanics and stochastic thermodynamics.

The core intuition

In physics:

A highly ordered system naturally becomes more disordered over time.

Examples:

Ice melts
Perfume spreads through air
Heat equalizes in a room

This trend toward disorder is related to:

Entropy

Diffusion models imitate this process

Forward process: adding noise

A diffusion model gradually destroys an image by adding random noise step-by-step.

Eventually:

Image → static noise

Mathematically:

xt=1−βtxt−1+βtϵx_t=\sqrt{1-\beta_t}x_{t-1}+\sqrt{\beta_t}\epsilonxt=1−βtxt−1+βtϵ

where:

xtx_txt = noisy image at step ttt
βt\beta_tβt = noise amount
ϵ\epsilonϵ = random Gaussian noise

This resembles physical diffusion:

particles spreading randomly,
information becoming disordered.

Thermodynamics connection

In thermodynamics, systems evolve toward maximum entropy.

Diffusion models deliberately push images toward a high-entropy state:

pure randomness.

Conceptually:

S=−kB∑ipiln⁡piS = -k_B \sum_i p_i \ln p_iS=−kB∑ipilnpi

This is the famous entropy equation from Ludwig Boltzmann.

As noise increases:

structure disappears,
entropy rises.

The reverse process is the magic

Physics says:

diffusion naturally increases disorder,
reversing it exactly is extremely difficult.

But diffusion models learn an approximate reverse process.

They learn:

Noise → slightly less noisy → recognizable structure → image

Mathematically:

pθ(xt−1∣xt)p_\theta(x_{t-1}\mid x_t)pθ(xt−1∣xt)

The AI estimates:

“Given this noisy image, what cleaner image likely came before it?”

Why this resembles statistical physics

The model treats image generation probabilistically.

Instead of storing one exact image, it learns:

probability distributions,
transitions between states,
stochastic trajectories.

This closely mirrors:

Brownian motion,
particle diffusion,
stochastic differential equations.

Brownian motion connection

The forward noising process resembles:

Brownian Motion

where particles randomly drift over time.

The mathematical framework often uses:

dx=f(x,t)dt+g(t)dWtdx = f(x,t)dt + g(t)dW_tdx=f(x,t)dt+g(t)dWt

This is a stochastic differential equation (SDE):

deterministic drift term,
random noise term.

These equations are common in:

thermodynamics,
quantum mechanics,
financial mathematics.

Why reversing diffusion works at all

Real thermodynamic systems lose information.

But diffusion models train on massive datasets and learn statistical structure:

faces,
textures,
lighting,
object relationships.

So during denoising, the model reconstructs likely structures—not the original exact image.

That’s why generated images are new creations rather than recovered originals.

Energy landscape intuition

Another physics analogy:

Imagine a landscape of possible images.

Random noise sits in chaotic high-energy regions.
Realistic images occupy stable low-energy regions.

The model learns how to “flow downhill” toward realistic states.

This idea relates to:

energy-based models,
free energy minimization,
equilibrium systems.

Why physicists became interested

Many researchers noticed that diffusion models:

behave like nonequilibrium thermodynamic systems,
can be analyzed with statistical mechanics tools,
resemble physical reversibility problems.

Some papers directly connect them to:

the Fokker–Planck Equation,
Langevin dynamics,
entropy production.

Simple analogy

Imagine:

You repeatedly smear ink across a painting until it becomes gray static.
Then an AI learns how to reverse the smearing process gradually.

That reversal process is mathematically related to reversing diffusion in physics.

The surprising part

Diffusion models work because:

the forward destruction process is simple,
but the learned reverse process captures incredibly rich statistical structure.

That combination turned out to be extraordinarily powerful for AI generation.

Face-swapping and voice cloning are both forms of synthetic media, but they operate on completely different kinds of data and AI problems.

Face-swapping modifies visual identity in images/video.
Voice cloning imitates someone’s speech characteristics in audio.

They overlap in deepfakes, but technically they use different pipelines.

Core difference

Technology	Input	Output
Face-swapping	Video/images	Synthetic face
Voice cloning	Audio samples	Synthetic speech

1. Face-swapping

Face-swapping replaces one person’s face with another while preserving:

expressions,
head movement,
eye direction,
lighting.

The AI learns facial geometry and appearance.

This relies heavily on:

Computer Vision
image generation models
facial landmark tracking

Simplified pipeline

Step A — Detect the face

The system identifies:

eyes,
nose,
mouth,
jawline.

This creates a facial map.

Step B — Encode facial features

The model compresses facial information into a latent representation.

Conceptually:

z=E(x)z = E(x)z=E(x)

where:

xxx = input face,
EEE = encoder,
zzz = latent features.

Step C — Generate target identity

A decoder reconstructs the target person’s face while preserving expression.

Conceptually:

x^=D(z)\hat{x}=D(z)x^=D(z)

Step D — Blend into video

The generated face is composited onto the original frame.

Extra processing handles:

lighting,
skin tone,
shadows,
temporal consistency.

2. Voice cloning

Voice cloning reproduces:

tone,
pitch,
cadence,
accent,
speaking style.

Unlike face-swapping, this is mostly an audio signal problem.

It draws from:

Speech Processing
Text-to-Speech

Simplified pipeline

Step A — Analyze voice samples

The model studies:

frequency patterns,
pronunciation,
rhythm,
timbre.

Audio is transformed into representations like spectrograms.

Step B — Build speaker embedding

The system creates a compact mathematical representation of the speaker’s identity.

Conceptually:

s=f(audio)s = f(\text{audio})s=f(audio)

where:

sss = speaker embedding.

This embedding captures what makes a voice unique.

Step C — Generate speech

The model combines:

text,
speaker embedding,
speech synthesis.

Conceptually:

Speech=G(text,s)\text{Speech}=G(\text{text},s)Speech=G(text,s)

Step D — Vocoder converts to waveform

A vocoder transforms the generated representation into actual audio waves.

This produces realistic speech output.

Major technical difference

Face-swapping

Mostly spatial:

pixels,
geometry,
visual consistency.

The challenge:

maintaining realism frame-by-frame.

Voice cloning

Mostly temporal:

sound over time,
phoneme transitions,
speech dynamics.

The challenge:

preserving natural timing and prosody.

Why voice cloning can be easier

Humans are extremely sensitive to faces.

Small visual mistakes are noticeable:

eyes,
teeth,
skin movement.

But many people are less sensitive to subtle vocal inaccuracies.

So modern voice cloning often becomes convincing faster with less data.

Why video deepfakes are harder

Video requires:

face generation,
motion consistency,
lip synchronization,
lighting continuity,
audio alignment.

Errors accumulate across frames.

That’s why realistic AI video is much more computationally difficult than static images or speech.

Common AI architectures

Task	Common Models
Face-swapping	GANs, diffusion, autoencoders
Voice cloning	Transformers, autoregressive speech models, diffusion audio models

Detection differences

Detecting face-swaps

Experts look for:

visual artifacts,
frame inconsistency,
lighting errors.

Detecting voice clones

Experts analyze:

unnatural frequency patterns,
prosody anomalies,
phase inconsistencies,
spectrogram artifacts.

Real-world uses

Legitimate uses

Film dubbing
Accessibility tools
AI assistants
Language translation
Visual effects

Harmful uses

Fraud calls
Identity impersonation
Fake political speeches
Scams
Non-consensual deepfakes

Combined deepfakes

Modern systems increasingly combine:

face-swapping,
voice cloning,
lip synchronization,
gesture synthesis.

This creates fully synthetic video personas that can appear highly realistic.

Video generation is much harder than image generation because a video is not just a sequence of good-looking images—it must also maintain consistent motion, identity, physics, lighting, and timing across time.

An image model only solves:

“What should this frame look like?”

A video model must solve:

“What should every frame look like, and how should they evolve coherently over time?”

1. Time adds a massive extra dimension

An image is spatial:

width,
height,
color channels.

A video adds:

time.

Conceptually:

Video(x,y,t)Video(x,y,t)Video(x,y,t)

Instead of generating one frame, the model generates hundreds or thousands while preserving continuity.

That dramatically increases complexity.

2. Temporal consistency is extremely difficult

A single realistic frame is not enough.

The model must keep stable:

faces,
clothing,
backgrounds,
object positions,
lighting,
shadows.

If tiny inconsistencies appear between frames, humans immediately notice:

flickering,
morphing faces,
jumping objects,
unstable hands.

This is called temporal coherence.

3. Motion is fundamentally harder than appearance

Generating a realistic image mostly requires understanding:

texture,
shape,
composition.

Generating video additionally requires modeling:

velocity,
acceleration,
momentum,
causality.

Conceptually:

xt+1=xt+vtΔtx_{t+1}=x_t+v_t\Delta txt+1=xt+vtΔt

The system must predict how objects evolve over time.

4. Humans are highly sensitive to motion errors

People tolerate minor image imperfections.

But humans are extraordinarily sensitive to:

unnatural movement,
broken eye motion,
impossible physics,
inconsistent gait.

Tiny timing errors make generated video feel “off.”

This is related to:

Optical Flow
biological motion perception.

5. Memory requirements explode

An image model may process one frame.

A video model must remember previous frames to maintain consistency.

The model often needs to track:

object identity,
scene geometry,
motion trajectories,
camera movement.

Attention across many frames becomes computationally huge.

Transformer attention scales roughly like:

O(n2)O(n^2)O(n2)

As frame count increases, computation grows rapidly.

6. Physics consistency is difficult

Videos implicitly encode real-world physics.

The AI must learn:

gravity,
collisions,
fluid behavior,
body mechanics,
cloth movement.

Bad physics instantly reveals fake video.

Examples:

fingers merging,
impossible shadows,
floating objects,
inconsistent reflections.

7. Identity preservation is hard

In images:

a face only needs to look correct once.

In video:

the same identity must remain stable across many frames and angles.

Otherwise:

facial features drift,
eyes change shape,
hair changes length,
expressions warp.

This is one reason hands and faces are especially challenging.

8. Video has vastly more data

A single HD image:

maybe a few MB.

A short HD video:

thousands of frames,
gigabytes of information.

Training video models requires:

enormous datasets,
huge GPU clusters,
massive memory bandwidth.

9. Noise accumulation causes instability

In diffusion-based video generation, errors compound over time.

A tiny inconsistency in one frame may grow worse in later frames.

Conceptually:

ϵt+1=ϵt+δ\epsilon_{t+1}=\epsilon_t+\deltaϵt+1=ϵt+δ

Small temporal noise can snowball into visible artifacts.

10. Camera movement complicates everything

The model must distinguish between:

object motion,
camera motion.

That requires understanding:

perspective,
depth,
occlusion,
scene geometry.

A rotating camera can completely change how objects appear.

11. Audio synchronization adds another layer

For talking videos:

lip movement,
facial muscles,
speech timing,
emotional expression

must align precisely.

Humans detect even tiny lip-sync errors.

Why modern AI video improved recently

Recent progress came from:

diffusion transformers,
larger datasets,
better motion conditioning,
latent video representations,
improved temporal attention.

Systems now model both:

spatial structure,
temporal dynamics.

Simple analogy

Image generation

Like painting a single convincing photograph.

Video generation

Like directing, animating, lighting, and filming an entire moving world consistently over time.

The deeper reason

Reality itself is temporally structured.

Video generation requires the AI to learn:

not just appearance,
but how the world changes.

That is a much more difficult modeling problem.

Conclusion on Deepfakes

Deepfakes represent one of the most powerful and disruptive applications of modern artificial intelligence. Using technologies such as Generative Adversarial Networks and Diffusion Model, AI systems can now create highly realistic synthetic faces, voices, images, and videos that are often difficult to distinguish from real media.

On the positive side, deepfake technology has valuable applications in:

film and visual effects,
education,
accessibility,
language translation,
virtual assistants,
gaming and digital avatars.

However, the same technology also creates serious risks:

misinformation,
identity fraud,
political manipulation,
impersonation scams,
non-consensual explicit content,
erosion of trust in digital evidence.

As deepfake quality improves, detecting fake media becomes increasingly challenging. Researchers now rely on:

AI-based forensic tools,
temporal and visual analysis,
metadata verification,
provenance systems,
cryptographic authenticity methods.

The development of deepfakes has created an ongoing technological arms race between:

generation systems,
and detection systems.

More broadly, deepfakes raise important ethical, legal, and social questions about:

privacy,
consent,
authenticity,
and trust in the digital age.

The future impact of deepfakes will depend not only on advances in AI, but also on:

responsible regulation,
public awareness,
media literacy,
and the development of trustworthy verification technologies.

In essence, deepfakes demonstrate both the extraordinary creative potential and the significant societal challenges of modern artificial intelligence.

Thanks for reading!!!!