Deepfakes
- Manyanshi Joshi
- 10 hours ago
- 12 min read

Deepfakes are synthetic media—usually videos, images, or audio—created using artificial intelligence to make it look or sound like someone is saying or doing something they never actually did.
What they look like
7
How they work
Deepfakes rely on techniques from Machine Learning, especially Neural Networks. A common approach uses Generative Adversarial Networks, where:
One AI generates fake content
Another AI tries to detect if it’s fake
They improve together until the result looks realistic
Common uses
Not all deepfakes are harmful. They’re used in:
Movies and TV (de-aging actors, visual effects)
Video games and virtual avatars
Education and historical recreations
But they’re also used for:
Misinformation and fake news
Scams and impersonation (voice cloning, fake calls)
Non-consensual content
Why they matter
Deepfakes blur the line between real and fake media. This creates risks like:
Damaging reputations
Influencing elections or public opinion
Making it harder to trust video/audio evidence
How to spot a deepfake (basic clues)
Unnatural blinking or facial movements
Mismatched lighting or shadows
Blurry edges around the face
Audio that doesn’t perfectly match lip movement
Here’s a high-level walkthrough of how deepfakes are typically created. This stays at a conceptual level and avoids operational misuse details.
How Deepfakes Are Made — Step by Step
1. Collect training data
Creators gather lots of images or video clips of:
The target person (whose face or voice will appear)
The source actor (the person performing the actions)
The AI needs many angles, lighting conditions, and facial expressions to learn realistic patterns.
Examples:
Interviews
Social media videos
Photos
Podcasts (for voice cloning)
2. Extract and align faces
Software detects faces frame-by-frame in videos.
The system:
Finds eyes, nose, mouth
Rotates and crops the face
Standardizes size and position
This creates a clean dataset the AI can learn from.
This step uses concepts from Computer Vision.
3. Train the AI model
The core model learns how a face behaves.
Common architectures include:
Autoencoder
Generative Adversarial Networks
Diffusion-based image generators
The AI studies:
Facial structure
Expressions
Lighting
Skin texture
Movement patterns
Conceptually, the system tries to learn a transformation:
f(source face)→target facef(\text{source face}) \rightarrow \text{target face}f(source face)→target face
Training can take hours or days depending on:
Dataset size
Video quality
Hardware power
4. Generate swapped frames
Once trained, the AI processes each video frame:
Detects the source face
Predicts a transformed version
Generates a synthetic face
The output tries to preserve:
Expression
Head pose
Eye direction
Emotion
while changing identity.
5. Blend the fake face into the video
The generated face is composited onto the original frame.
Extra processing helps realism:
Color matching
Edge smoothing
Motion stabilization
Lighting correction
Without this step, deepfakes often look obviously fake.
6. Synchronize audio (optional)
For speaking videos:
Voice cloning models imitate tone and speech patterns
Lip-sync models align mouth movements to audio
This uses techniques from:
Speech Processing
Text-to-Speech
7. Render the final video
The processed frames are recombined into a finished video file.
Higher realism usually requires:
High-resolution generation
Frame consistency
Temporal smoothing across frames
Why Older Deepfakes Looked Bad
Early deepfakes often had:
Flickering faces
Strange blinking
Warped teeth
Inconsistent lighting
Modern AI models are much better because:
Training datasets are larger
GPUs are faster
Diffusion models improved realism
How Detection Works
Researchers look for:
Biological inconsistencies
Compression artifacts
Unrealistic eye reflections
Frame-to-frame anomalies
Some detectors analyze tiny frequency patterns invisible to humans.
Ethical and legal issues
Many countries now regulate malicious deepfakes involving:
Fraud
Election misinformation
Non-consensual explicit content
Identity impersonation
Legitimate film/VFX use is generally treated differently from deceptive use.
Professionals detect deepfakes by combining human review, AI analysis, forensic techniques, and metadata investigation. No single method is perfect, so experts usually layer multiple checks together.
1. Visual forensic analysis
Investigators examine frames for inconsistencies humans often miss.
Common clues:
Uneven lighting on the face
Blurry boundaries around hair or jawline
Warped glasses, earrings, or teeth
Inconsistent reflections in eyes
Skin texture changing between frames
They also check whether facial movement follows natural biomechanics.
Example concept:
Frame Consistency(t)≈Frame Consistency(t+1)\text{Frame Consistency}(t) \approx \text{Frame Consistency}(t+1)Frame Consistency(t)≈Frame Consistency(t+1)
Real videos usually maintain stable patterns across adjacent frames.
2. Temporal analysis (motion over time)
A fake frame may look convincing alone, but motion reveals problems.
Detection systems analyze:
Eye blinking frequency
Lip synchronization
Head movement continuity
Micro-expressions
Natural muscle motion
Older deepfakes often failed here because each frame was generated too independently.
This area uses techniques from Signal Processing and Optical Flow.
3. AI-based detectors
Modern detectors train AI against other AI.
A detector learns statistical fingerprints left by generators:
Pixel distribution anomalies
Frequency-domain artifacts
Unrealistic texture synthesis
Compression mismatches
Conceptually:
P(fake∣x)>P(real∣x)P(\text{fake}\mid x) > P(\text{real}\mid x)P(fake∣x)>P(real∣x)
where the model estimates whether media is likely fake.
Researchers often use architectures from:
Convolutional Neural Network
Transformer
4. Frequency-domain analysis
Humans mostly notice spatial patterns, but detectors inspect hidden mathematical patterns.
Using transforms like:
F(ω)=∫−∞∞f(t)e−iωt dtF(\omega)=\int_{-\infty}^{\infty} f(t)e^{-i\omega t}\,dtF(ω)=∫−∞∞f(t)e−iωtdt
experts analyze frequency signatures produced during AI generation.
AI-generated media can leave:
Repeating noise structures
Unnatural high-frequency details
Generator-specific fingerprints
This comes from Fourier Transform.
5. Metadata and provenance checks
Professionals inspect:
File creation history
Editing software traces
Camera metadata
Compression history
Upload timestamps
A “phone video” that lacks expected smartphone metadata can raise suspicion.
6. Source verification
Journalists and investigators often verify:
Original upload source
Reverse image/video search
Whether the scene existed before
Geolocation and weather consistency
This is common in Bellingcat-style investigations.
7. Biological signal analysis
Advanced systems analyze subtle human physiological signals:
Blood-flow color changes in skin
Heart-rate patterns from tiny facial color variations
Natural breathing rhythms
AI generators often fail to reproduce these perfectly.
8. Watermarks and cryptographic signatures
Some companies embed invisible authenticity markers into real media.
Efforts include:
Content provenance systems
Cryptographic signing
Camera authenticity standards
Organizations like Coalition for Content Provenance and Authenticity work on this.
Why detection is difficult
Detection is an arms race.
As generators improve:
Artifacts disappear
Motion becomes smoother
Voice synthesis improves
Lighting realism increases
So detectors constantly retrain against newer models.
Important reality
Even professionals sometimes cannot conclusively prove a sophisticated deepfake from visual inspection alone. That’s why investigators increasingly rely on:
provenance,
trusted capture systems,
and chain-of-custody evidence,
not just image analysis.
Both GANs and diffusion models generate realistic AI images, videos, or audio—but they do it in very different ways.
Core idea
GANs: “Generator vs Detective”
A Generative Adversarial Networks uses two neural networks competing against each other:
Generator → creates fake images
Discriminator → tries to detect fakes
The generator improves by trying to fool the discriminator.
Conceptually:
minGmaxD V(D,G)=Ex∼pdata[logD(x)]+Ez∼pz[log(1−D(G(z)))]\min_G\max_D\;V(D,G)=\mathbb{E}_{x\sim p_{data}}[\log D(x)] + \mathbb{E}_{z\sim p_z}[\log(1-D(G(z)))]minGmaxDV(D,G)=Ex∼pdata[logD(x)]+Ez∼pz[log(1−D(G(z)))]
Think of it like:
a counterfeiter vs
a detective.
Over time, both become highly skilled.
Diffusion models: “Destroy then rebuild”
Diffusion Model works differently.
The model:
Gradually adds noise to real images
Learns how to reverse that noise process
Generates new images by turning random noise into coherent pictures
Conceptually:
xt=1−βtxt−1+βtϵx_t=\sqrt{1-\beta_t}x_{t-1}+\sqrt{\beta_t}\epsilonxt=1−βtxt−1+βtϵ
Then the model learns the reverse process:
pθ(xt−1∣xt)p_\theta(x_{t-1}\mid x_t)pθ(xt−1∣xt)
A diffusion model is more like:
repeatedly refining static
until an image appears.
Visual intuition
GAN
Random noise → generator → fake imageDiscriminator says:
“looks fake”
“looks real”
Generator improves from feedback.
Diffusion
Random noise → slightly cleaner → clearer → detailed image → final result
The image emerges progressively.
Main differences
Feature | GANs | Diffusion Models |
Generation style | Competitive game | Gradual denoising |
Speed | Usually faster | Usually slower |
Stability during training | Harder | More stable |
Image quality | Sharp but sometimes flawed | Extremely realistic |
Diversity | Can collapse to similar outputs | Better diversity |
Modern popularity | Declining somewhat | Dominant today |
Why GANs were revolutionary
GANs created the first truly convincing:
fake faces,
face swaps,
synthetic humans,
early deepfakes.
Projects like StyleGAN became famous for ultra-realistic fake faces.
Why diffusion models became dominant
Diffusion models power many modern systems because they:
produce more consistent images,
handle text prompts better,
scale effectively.
Examples include:
Stable Diffusion
DALL·E
Midjourney
These models are especially strong at:
photorealism,
artistic generation,
complex compositions.
In deepfakes specifically
GAN-based deepfakes
Older systems often:
swapped faces directly,
generated individual frames,
struggled with temporal consistency.
Diffusion-based deepfakes
Newer systems:
generate smoother details,
preserve lighting better,
create more realistic skin and motion,
improve frame coherence.
This is one reason modern AI video is advancing rapidly.
Weaknesses
GAN weaknesses
Training instability
“Mode collapse” (repeating similar outputs)
Hard balancing between networks
Example idea:
G(z1)≈G(z2)≈G(z3)G(z_1) \approx G(z_2) \approx G(z_3)G(z1)≈G(z2)≈G(z3)
where many inputs generate nearly identical outputs.
Diffusion weaknesses
Computationally expensive
Slower generation
Requires many denoising steps
Though newer techniques are speeding this up.
Simple analogy
GAN
A student artist competes against an art critic.
Diffusion
A sculptor slowly removes noise from a block of static until a picture appears.
Diffusion models are connected to thermodynamics because they mathematically resemble physical diffusion processes—the same kinds of processes that describe:
heat spreading,
smoke dispersing,
molecules moving randomly in fluids.
Modern AI diffusion models borrow equations and ideas from Statistical Mechanics and stochastic thermodynamics.
The core intuition
In physics:
A highly ordered system naturally becomes more disordered over time.
Examples:
Ice melts
Perfume spreads through air
Heat equalizes in a room
This trend toward disorder is related to:
Entropy
Diffusion models imitate this process
Forward process: adding noise
A diffusion model gradually destroys an image by adding random noise step-by-step.
Eventually:
Image → static noise
Mathematically:
xt=1−βtxt−1+βtϵx_t=\sqrt{1-\beta_t}x_{t-1}+\sqrt{\beta_t}\epsilonxt=1−βtxt−1+βtϵ
where:
xtx_txt = noisy image at step ttt
βt\beta_tβt = noise amount
ϵ\epsilonϵ = random Gaussian noise
This resembles physical diffusion:
particles spreading randomly,
information becoming disordered.
Thermodynamics connection
In thermodynamics, systems evolve toward maximum entropy.
Diffusion models deliberately push images toward a high-entropy state:
pure randomness.
Conceptually:
S=−kB∑ipilnpiS = -k_B \sum_i p_i \ln p_iS=−kB∑ipilnpi
This is the famous entropy equation from Ludwig Boltzmann.
As noise increases:
structure disappears,
entropy rises.
The reverse process is the magic
Physics says:
diffusion naturally increases disorder,
reversing it exactly is extremely difficult.
But diffusion models learn an approximate reverse process.
They learn:
Noise → slightly less noisy → recognizable structure → image
Mathematically:
pθ(xt−1∣xt)p_\theta(x_{t-1}\mid x_t)pθ(xt−1∣xt)
The AI estimates:
“Given this noisy image, what cleaner image likely came before it?”
Why this resembles statistical physics
The model treats image generation probabilistically.
Instead of storing one exact image, it learns:
probability distributions,
transitions between states,
stochastic trajectories.
This closely mirrors:
Brownian motion,
particle diffusion,
stochastic differential equations.
Brownian motion connection
The forward noising process resembles:
Brownian Motion
where particles randomly drift over time.
The mathematical framework often uses:
dx=f(x,t)dt+g(t)dWtdx = f(x,t)dt + g(t)dW_tdx=f(x,t)dt+g(t)dWt
This is a stochastic differential equation (SDE):
deterministic drift term,
random noise term.
These equations are common in:
thermodynamics,
quantum mechanics,
financial mathematics.
Why reversing diffusion works at all
Real thermodynamic systems lose information.
But diffusion models train on massive datasets and learn statistical structure:
faces,
textures,
lighting,
object relationships.
So during denoising, the model reconstructs likely structures—not the original exact image.
That’s why generated images are new creations rather than recovered originals.
Energy landscape intuition
Another physics analogy:
Imagine a landscape of possible images.
Random noise sits in chaotic high-energy regions.
Realistic images occupy stable low-energy regions.
The model learns how to “flow downhill” toward realistic states.
This idea relates to:
energy-based models,
free energy minimization,
equilibrium systems.
Why physicists became interested
Many researchers noticed that diffusion models:
behave like nonequilibrium thermodynamic systems,
can be analyzed with statistical mechanics tools,
resemble physical reversibility problems.
Some papers directly connect them to:
the Fokker–Planck Equation,
Langevin dynamics,
entropy production.
Simple analogy
Imagine:
You repeatedly smear ink across a painting until it becomes gray static.
Then an AI learns how to reverse the smearing process gradually.
That reversal process is mathematically related to reversing diffusion in physics.
The surprising part
Diffusion models work because:
the forward destruction process is simple,
but the learned reverse process captures incredibly rich statistical structure.
That combination turned out to be extraordinarily powerful for AI generation.
Face-swapping and voice cloning are both forms of synthetic media, but they operate on completely different kinds of data and AI problems.
Face-swapping modifies visual identity in images/video.
Voice cloning imitates someone’s speech characteristics in audio.
They overlap in deepfakes, but technically they use different pipelines.
Core difference
Technology | Input | Output |
Face-swapping | Video/images | Synthetic face |
Voice cloning | Audio samples | Synthetic speech |
1. Face-swapping
Face-swapping replaces one person’s face with another while preserving:
expressions,
head movement,
eye direction,
lighting.
The AI learns facial geometry and appearance.
This relies heavily on:
Computer Vision
image generation models
facial landmark tracking
Simplified pipeline
Step A — Detect the face
The system identifies:
eyes,
nose,
mouth,
jawline.
This creates a facial map.
Step B — Encode facial features
The model compresses facial information into a latent representation.
Conceptually:
z=E(x)z = E(x)z=E(x)
where:
xxx = input face,
EEE = encoder,
zzz = latent features.
Step C — Generate target identity
A decoder reconstructs the target person’s face while preserving expression.
Conceptually:
x^=D(z)\hat{x}=D(z)x^=D(z)
Step D — Blend into video
The generated face is composited onto the original frame.
Extra processing handles:
lighting,
skin tone,
shadows,
temporal consistency.
2. Voice cloning
Voice cloning reproduces:
tone,
pitch,
cadence,
accent,
speaking style.
Unlike face-swapping, this is mostly an audio signal problem.
It draws from:
Speech Processing
Text-to-Speech
Simplified pipeline
Step A — Analyze voice samples
The model studies:
frequency patterns,
pronunciation,
rhythm,
timbre.
Audio is transformed into representations like spectrograms.
Step B — Build speaker embedding
The system creates a compact mathematical representation of the speaker’s identity.
Conceptually:
s=f(audio)s = f(\text{audio})s=f(audio)
where:
sss = speaker embedding.
This embedding captures what makes a voice unique.
Step C — Generate speech
The model combines:
text,
speaker embedding,
speech synthesis.
Conceptually:
Speech=G(text,s)\text{Speech}=G(\text{text},s)Speech=G(text,s)
Step D — Vocoder converts to waveform
A vocoder transforms the generated representation into actual audio waves.
This produces realistic speech output.
Major technical difference
Face-swapping
Mostly spatial:
pixels,
geometry,
visual consistency.
The challenge:
maintaining realism frame-by-frame.
Voice cloning
Mostly temporal:
sound over time,
phoneme transitions,
speech dynamics.
The challenge:
preserving natural timing and prosody.
Why voice cloning can be easier
Humans are extremely sensitive to faces.
Small visual mistakes are noticeable:
eyes,
teeth,
skin movement.
But many people are less sensitive to subtle vocal inaccuracies.
So modern voice cloning often becomes convincing faster with less data.
Why video deepfakes are harder
Video requires:
face generation,
motion consistency,
lip synchronization,
lighting continuity,
audio alignment.
Errors accumulate across frames.
That’s why realistic AI video is much more computationally difficult than static images or speech.
Common AI architectures
Task | Common Models |
Face-swapping | GANs, diffusion, autoencoders |
Voice cloning | Transformers, autoregressive speech models, diffusion audio models |
Detection differences
Detecting face-swaps
Experts look for:
visual artifacts,
frame inconsistency,
lighting errors.
Detecting voice clones
Experts analyze:
unnatural frequency patterns,
prosody anomalies,
phase inconsistencies,
spectrogram artifacts.
Real-world uses
Legitimate uses
Film dubbing
Accessibility tools
AI assistants
Language translation
Visual effects
Harmful uses
Fraud calls
Identity impersonation
Fake political speeches
Scams
Non-consensual deepfakes
Combined deepfakes
Modern systems increasingly combine:
face-swapping,
voice cloning,
lip synchronization,
gesture synthesis.
This creates fully synthetic video personas that can appear highly realistic.
Video generation is much harder than image generation because a video is not just a sequence of good-looking images—it must also maintain consistent motion, identity, physics, lighting, and timing across time.
An image model only solves:
“What should this frame look like?”
A video model must solve:
“What should every frame look like, and how should they evolve coherently over time?”
1. Time adds a massive extra dimension
An image is spatial:
width,
height,
color channels.
A video adds:
time.
Conceptually:
Video(x,y,t)Video(x,y,t)Video(x,y,t)
Instead of generating one frame, the model generates hundreds or thousands while preserving continuity.
That dramatically increases complexity.
2. Temporal consistency is extremely difficult
A single realistic frame is not enough.
The model must keep stable:
faces,
clothing,
backgrounds,
object positions,
lighting,
shadows.
If tiny inconsistencies appear between frames, humans immediately notice:
flickering,
morphing faces,
jumping objects,
unstable hands.
This is called temporal coherence.
3. Motion is fundamentally harder than appearance
Generating a realistic image mostly requires understanding:
texture,
shape,
composition.
Generating video additionally requires modeling:
velocity,
acceleration,
momentum,
causality.
Conceptually:
xt+1=xt+vtΔtx_{t+1}=x_t+v_t\Delta txt+1=xt+vtΔt
The system must predict how objects evolve over time.
4. Humans are highly sensitive to motion errors
People tolerate minor image imperfections.
But humans are extraordinarily sensitive to:
unnatural movement,
broken eye motion,
impossible physics,
inconsistent gait.
Tiny timing errors make generated video feel “off.”
This is related to:
Optical Flow
biological motion perception.
5. Memory requirements explode
An image model may process one frame.
A video model must remember previous frames to maintain consistency.
The model often needs to track:
object identity,
scene geometry,
motion trajectories,
camera movement.
Attention across many frames becomes computationally huge.
Transformer attention scales roughly like:
O(n2)O(n^2)O(n2)
As frame count increases, computation grows rapidly.
6. Physics consistency is difficult
Videos implicitly encode real-world physics.
The AI must learn:
gravity,
collisions,
fluid behavior,
body mechanics,
cloth movement.
Bad physics instantly reveals fake video.
Examples:
fingers merging,
impossible shadows,
floating objects,
inconsistent reflections.
7. Identity preservation is hard
In images:
a face only needs to look correct once.
In video:
the same identity must remain stable across many frames and angles.
Otherwise:
facial features drift,
eyes change shape,
hair changes length,
expressions warp.
This is one reason hands and faces are especially challenging.
8. Video has vastly more data
A single HD image:
maybe a few MB.
A short HD video:
thousands of frames,
gigabytes of information.
Training video models requires:
enormous datasets,
huge GPU clusters,
massive memory bandwidth.
9. Noise accumulation causes instability
In diffusion-based video generation, errors compound over time.
A tiny inconsistency in one frame may grow worse in later frames.
Conceptually:
ϵt+1=ϵt+δ\epsilon_{t+1}=\epsilon_t+\deltaϵt+1=ϵt+δ
Small temporal noise can snowball into visible artifacts.
10. Camera movement complicates everything
The model must distinguish between:
object motion,
camera motion.
That requires understanding:
perspective,
depth,
occlusion,
scene geometry.
A rotating camera can completely change how objects appear.
11. Audio synchronization adds another layer
For talking videos:
lip movement,
facial muscles,
speech timing,
emotional expression
must align precisely.
Humans detect even tiny lip-sync errors.
Why modern AI video improved recently
Recent progress came from:
diffusion transformers,
larger datasets,
better motion conditioning,
latent video representations,
improved temporal attention.
Systems now model both:
spatial structure,
temporal dynamics.
Simple analogy
Image generation
Like painting a single convincing photograph.
Video generation
Like directing, animating, lighting, and filming an entire moving world consistently over time.
The deeper reason
Reality itself is temporally structured.
Video generation requires the AI to learn:
not just appearance,
but how the world changes.
That is a much more difficult modeling problem.
Conclusion on Deepfakes
Deepfakes represent one of the most powerful and disruptive applications of modern artificial intelligence. Using technologies such as Generative Adversarial Networks and Diffusion Model, AI systems can now create highly realistic synthetic faces, voices, images, and videos that are often difficult to distinguish from real media.
On the positive side, deepfake technology has valuable applications in:
film and visual effects,
education,
accessibility,
language translation,
virtual assistants,
gaming and digital avatars.
However, the same technology also creates serious risks:
misinformation,
identity fraud,
political manipulation,
impersonation scams,
non-consensual explicit content,
erosion of trust in digital evidence.
As deepfake quality improves, detecting fake media becomes increasingly challenging. Researchers now rely on:
AI-based forensic tools,
temporal and visual analysis,
metadata verification,
provenance systems,
cryptographic authenticity methods.
The development of deepfakes has created an ongoing technological arms race between:
generation systems,
and detection systems.
More broadly, deepfakes raise important ethical, legal, and social questions about:
privacy,
consent,
authenticity,
and trust in the digital age.
The future impact of deepfakes will depend not only on advances in AI, but also on:
responsible regulation,
public awareness,
media literacy,
and the development of trustworthy verification technologies.
In essence, deepfakes demonstrate both the extraordinary creative potential and the significant societal challenges of modern artificial intelligence.
Thanks for reading!!!!