Adversarial ML Examples: Attacks and Defences

Manish Garg
Manish Garg Associate of (ISC)² · RingSafe
Apr 25, 2026
3 min read

Adversarial examples are inputs crafted to mislead ML models — small, often imperceptible perturbations that cause misclassification. Originally a research curiosity, adversarial examples have practical implications for any production ML system: face recognition, content moderation, fraud detection, medical imaging. This article covers the techniques and defences.

The classic example

Image classifier: panda image → “panda” with 99% confidence. Add carefully-computed noise (imperceptible to humans) → same image classified as “gibbon” with 99% confidence.

The attacker computes the noise via gradient-based optimisation against the model — knowing how the model works, what direction in input space crosses the decision boundary.

Variants

White-box attacks

Attacker knows the model architecture and weights. Computes gradient of loss with respect to input; perturbs input to maximise loss while staying within an L_∞ or L_2 norm bound.

# PGD (Projected Gradient Descent) — most common white-box attack
# For each step:
# 1. Compute gradient of loss(model(x_adversarial), target_class)
# 2. Step in gradient direction
# 3. Project back into epsilon-ball around original input
# 4. Repeat

# Library: cleverhans, foolbox, Adversarial Robustness Toolbox (IBM)

Black-box attacks

Attacker doesn’t have model weights but can query the model and observe outputs. Three approaches:

  • Transfer attacks — train a substitute model on similar data; craft adversarial examples against substitute; they often transfer to target model
  • Score-based — query model for prediction probabilities; estimate gradient numerically
  • Decision-based — only see model’s top-1 prediction; iterative boundary attacks (Boundary Attack, HopSkipJump)

Physical-world attacks

  • Adversarial patches — printable patterns that, when held in front of a camera, fool object detectors
  • Adversarial glasses — fool face recognition (Sharif et al.)
  • Adversarial road signs — stickers that fool autonomous-vehicle classifiers

Text adversarial examples

For NLP / LLM models — small text perturbations that change classification:

  • Synonym substitution preserving meaning
  • Character-level typos (homoglyphs)
  • Insertion of irrelevant words
  • Paraphrasing

Defences

Adversarial training

Include adversarial examples in training data. Model learns to be robust against the attack class it was trained against.

# Pseudocode
for batch in training_data:
    x_adv = generate_adversarial(model, batch.x, batch.y)
    loss = model.compute_loss(x_adv, batch.y)
    optimizer.step(loss)

Effective but: trades some clean-data accuracy for robustness; computationally expensive; doesn’t generalise across attack types.

Defensive distillation

Train a student model on softened logits from a teacher model. Smooths gradients, making attacks less effective. Largely deprecated; broken by adaptive attacks.

Input transformations

Pre-process inputs (compression, blurring, noise reduction) to disrupt adversarial perturbations. Effective against weak attackers; defeated by adaptive attackers.

Detection

Train a separate classifier to detect adversarial inputs. Cat-and-mouse — attacker can craft examples that evade detection too.

Real-world impact

  • Content moderation — adversarial examples bypass image / text moderation systems
  • Face recognition — fooled by adversarial glasses, makeup patterns
  • Fraud detection — transaction patterns crafted to evade ML detectors
  • Medical AI — adversarial perturbations in imaging cause misclassification (research; not yet known in production attacks)
  • Autonomous systems — research-stage but consequential

The Indian context

  • DigiYatra (face-recognition airport entry) — adversarial robustness considerations
  • Aadhaar biometric systems — UIDAI continuously updates against spoof attacks
  • Bank fraud detection — adversarial transaction patterns are a real concern in production

The takeaway

Adversarial examples are mature research; production impact is increasing. Adversarial training is the primary defence; combined with input validation and ensemble methods, robustness improves but is never absolute. For high-stakes ML deployments, threat-model the adversary; for low-stakes (content recommendation), the cost-benefit favours not over-investing.

Need a real pentest?

Get a VAPT scoping call

Senior practitioner-led VAPT — not a checklist run by juniors. CVSS-scored findings, free retest, attestation letter. India's SMBs and SaaS teams.

Book VAPT scoping call Replies in 4 working hrs · India-only · Senior consultants