Module 4 · Fine-tuning & Custom Models

Manish Garg
Manish Garg Associate of (ISC)² · RingSafe
Apr 25, 2026
4 min read
Read as

Last updated: April 29, 2026

When APIs aren’t enough — train, evaluate, deploy custom models on your own infra. LoRA, vLLM, evals, and the cost trade-offs.

Most teams reach for fine-tuning too early. This module is about when fine-tuning genuinely beats prompt engineering + RAG, how to do it cost-effectively (LoRA on a single GPU), and how to ship the resulting model to production.

When to fine-tune (and when not to)

Fine-tune when:

  • You need consistent output structure at high volume that prompt engineering can’t reliably enforce
  • You have a narrow domain where the base model lacks vocabulary (legal, medical, security domain-specific)
  • API cost at your volume exceeds ₹2-3L/month — fine-tuned hosted model could be 10× cheaper
  • You need data privacy — fine-tuning on-prem keeps data out of API providers

Don’t fine-tune when:

  • Prompt engineering hasn’t been seriously tried (most teams skip this)
  • You don’t have at least 200-500 high-quality training examples
  • You don’t have an eval harness to measure improvement
  • The use case is rapidly evolving (fine-tuning is point-in-time)

LoRA + QLoRA — fine-tuning on one GPU

Full fine-tuning of a 70B model needs ~280GB of VRAM — 4× A100 80GB GPUs at ~₹3L/hour. Out of reach. Enter LoRA (Low-Rank Adaptation):

  • Train a small “adapter” layer instead of all model weights
  • Only ~0.1-1% of total params are trainable
  • Fits a 7B model on one consumer GPU (24GB VRAM)
  • QLoRA combines LoRA + 4-bit quantisation: fine-tune 70B on a single A100 (~₹150/hour on RunPod)

The resulting adapter is 50-200MB instead of 14GB. You can swap adapters at inference time — multiple specialisations from one base model.

Dataset preparation — the unsexy 80%

Fine-tuning quality is determined by data quality. Bad examples in = bad model out. The standard format:

{"messages": [
  {"role": "system", "content": "You are a security analyst..."},
  {"role": "user", "content": "Triage this CVE: ..."},
  {"role": "assistant", "content": "{\"severity\": \"HIGH\", ...}"}
]}

Rules:

  • Minimum 200 examples; 500-1000 ideal; 5000+ if domain is complex
  • Diversity matters more than volume — cover all the input variations you’ll see
  • Include “edge case” examples explicitly — model learns from what you show
  • Hand-curate or LLM-generate then human-review — never raw scraped data
  • 80/20 split for train/eval; never train on eval

Eval harnesses — non-negotiable

Without evals, fine-tuning is throwing darts in the dark. Build an eval suite BEFORE you start training:

  1. 50-200 held-out test cases, each with input + expected output
  2. Deterministic graders where possible (regex, exact match, JSON-schema validation)
  3. LLM-as-judge for fuzzy outputs (with care — judges have biases)
  4. Run eval before training (baseline), after each epoch, after every config change

If your fine-tune doesn’t measurably beat the base model on your eval suite, you’ve gained nothing.

Quantisation — fitting more on less

Models are stored at fp16 (16-bit) by default. You can compress to 8-bit, 4-bit, even 2-bit at the cost of some accuracy:

Format VRAM (7B model) Quality drop
fp16 ~14GB None (baseline)
int8 / GGUF Q8_0 ~7GB ~1%
int4 / GGUF Q4_K_M ~4GB ~3-5%
2-bit ~2.5GB ~10%+ (often not worth)

4-bit is the sweet spot. A 7B model at Q4_K_M runs on a Mac M2 8GB or a ₹40K consumer GPU.

Serving — vLLM is the answer

Hosting your fine-tuned model:

  • vLLM — fastest open-source inference server. PagedAttention + continuous batching = 5-10× higher throughput.
  • TGI (Hugging Face) — alternative, slightly behind vLLM in throughput
  • SGLang — newer, very fast for structured-output workloads
  • llama.cpp — for CPU/Mac, slow but works without GPU

vLLM example serving:

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --enable-lora \
  --lora-modules my-adapter=/path/to/adapter \
  --max-model-len 8192

Cost reality check

Fine-tuning a 7B model on 5K examples with LoRA:

  • Training: ~4 hours on 1× L4 GPU (RunPod ~₹50/hour) = ₹200 one-time
  • Storage: adapter is ~100MB, basically free
  • Inference: 1× L4 serves ~50 RPS = ~₹35K/month for 24×7 hosting
  • Equivalent GPT-4o cost at same volume: ~₹3L/month

If you’re paying ₹3L/month for OpenAI and your task is bounded, fine-tuning + self-hosting saves 90%. If you’re paying ₹30K/month, the saving doesn’t justify the operational burden.

Your project for Module 4

Fine-tune a 7B model to classify phishing emails:

  1. Get a public phishing dataset (Nazario phishing corpus, Enron for legitimate, ~10K each)
  2. Format as instruction-tuning JSONL
  3. LoRA fine-tune Llama-3.1-8B with unsloth on RunPod L4 (₹200, 4 hours)
  4. Eval against held-out 1000 examples — target 96%+ F1
  5. Serve via vLLM on the same GPU
  6. Benchmark inference cost vs GPT-4o-mini API on same eval set

Summary

  • Fine-tune only after prompt engineering + RAG hits a ceiling.
  • LoRA / QLoRA enables fine-tuning on one consumer GPU; full fine-tuning is rarely needed.
  • Data quality dominates; 500 hand-curated > 50,000 scraped.
  • Eval harness is non-negotiable; baseline before training.
  • 4-bit quantisation is the sweet spot for inference.
  • vLLM is the default open-source inference server.
  • Self-hosting saves 90% of API cost at high volume; otherwise it’s a burden.
🧠
Check your understanding

Module Quiz · 20 questions

Pass with 80%+ to mark this module complete. Unlimited retries. Each question shows an explanation.

Want this for your team?

Custom team training + practitioner advisory

Beyond the free academy — we run private workshops, vCISO advisory, and red-team exercises tailored to your stack. For Indian SMBs scaling past their first hire.

Book team training call Replies in 4 working hrs · India-only · Senior consultants