Adversarial examples are inputs crafted to mislead ML models — small, often imperceptible perturbations that cause misclassification. Originally a research curiosity, adversarial examples have practical implications for any production ML system: face recognition, content moderation, fraud detection, medical imaging. This article covers the techniques and defences.
The classic example
Image classifier: panda image → “panda” with 99% confidence. Add carefully-computed noise (imperceptible to humans) → same image classified as “gibbon” with 99% confidence.
The attacker computes the noise via gradient-based optimisation against the model — knowing how the model works, what direction in input space crosses the decision boundary.
Variants
White-box attacks
Attacker knows the model architecture and weights. Computes gradient of loss with respect to input; perturbs input to maximise loss while staying within an L_∞ or L_2 norm bound.
# PGD (Projected Gradient Descent) — most common white-box attack
# For each step:
# 1. Compute gradient of loss(model(x_adversarial), target_class)
# 2. Step in gradient direction
# 3. Project back into epsilon-ball around original input
# 4. Repeat
# Library: cleverhans, foolbox, Adversarial Robustness Toolbox (IBM)
Black-box attacks
Attacker doesn’t have model weights but can query the model and observe outputs. Three approaches:
- Transfer attacks — train a substitute model on similar data; craft adversarial examples against substitute; they often transfer to target model
- Score-based — query model for prediction probabilities; estimate gradient numerically
- Decision-based — only see model’s top-1 prediction; iterative boundary attacks (Boundary Attack, HopSkipJump)
Physical-world attacks
- Adversarial patches — printable patterns that, when held in front of a camera, fool object detectors
- Adversarial glasses — fool face recognition (Sharif et al.)
- Adversarial road signs — stickers that fool autonomous-vehicle classifiers
Text adversarial examples
For NLP / LLM models — small text perturbations that change classification:
- Synonym substitution preserving meaning
- Character-level typos (homoglyphs)
- Insertion of irrelevant words
- Paraphrasing
Defences
Adversarial training
Include adversarial examples in training data. Model learns to be robust against the attack class it was trained against.
# Pseudocode
for batch in training_data:
x_adv = generate_adversarial(model, batch.x, batch.y)
loss = model.compute_loss(x_adv, batch.y)
optimizer.step(loss)
Effective but: trades some clean-data accuracy for robustness; computationally expensive; doesn’t generalise across attack types.
Defensive distillation
Train a student model on softened logits from a teacher model. Smooths gradients, making attacks less effective. Largely deprecated; broken by adaptive attacks.
Input transformations
Pre-process inputs (compression, blurring, noise reduction) to disrupt adversarial perturbations. Effective against weak attackers; defeated by adaptive attackers.
Detection
Train a separate classifier to detect adversarial inputs. Cat-and-mouse — attacker can craft examples that evade detection too.
Real-world impact
- Content moderation — adversarial examples bypass image / text moderation systems
- Face recognition — fooled by adversarial glasses, makeup patterns
- Fraud detection — transaction patterns crafted to evade ML detectors
- Medical AI — adversarial perturbations in imaging cause misclassification (research; not yet known in production attacks)
- Autonomous systems — research-stage but consequential
The Indian context
- DigiYatra (face-recognition airport entry) — adversarial robustness considerations
- Aadhaar biometric systems — UIDAI continuously updates against spoof attacks
- Bank fraud detection — adversarial transaction patterns are a real concern in production
The takeaway
Adversarial examples are mature research; production impact is increasing. Adversarial training is the primary defence; combined with input validation and ensemble methods, robustness improves but is never absolute. For high-stakes ML deployments, threat-model the adversary; for low-stakes (content recommendation), the cost-benefit favours not over-investing.
Get a VAPT scoping call
Senior practitioner-led VAPT — not a checklist run by juniors. CVSS-scored findings, free retest, attestation letter. India's SMBs and SaaS teams.