Last updated: April 29, 2026
Most teams reach for fine-tuning too early. This module is about when fine-tuning genuinely beats prompt engineering + RAG, how to do it cost-effectively (LoRA on a single GPU), and how to ship the resulting model to production.
When to fine-tune (and when not to)
Fine-tune when:
- You need consistent output structure at high volume that prompt engineering can’t reliably enforce
- You have a narrow domain where the base model lacks vocabulary (legal, medical, security domain-specific)
- API cost at your volume exceeds ₹2-3L/month — fine-tuned hosted model could be 10× cheaper
- You need data privacy — fine-tuning on-prem keeps data out of API providers
Don’t fine-tune when:
- Prompt engineering hasn’t been seriously tried (most teams skip this)
- You don’t have at least 200-500 high-quality training examples
- You don’t have an eval harness to measure improvement
- The use case is rapidly evolving (fine-tuning is point-in-time)
LoRA + QLoRA — fine-tuning on one GPU
Full fine-tuning of a 70B model needs ~280GB of VRAM — 4× A100 80GB GPUs at ~₹3L/hour. Out of reach. Enter LoRA (Low-Rank Adaptation):
- Train a small “adapter” layer instead of all model weights
- Only ~0.1-1% of total params are trainable
- Fits a 7B model on one consumer GPU (24GB VRAM)
- QLoRA combines LoRA + 4-bit quantisation: fine-tune 70B on a single A100 (~₹150/hour on RunPod)
The resulting adapter is 50-200MB instead of 14GB. You can swap adapters at inference time — multiple specialisations from one base model.
Dataset preparation — the unsexy 80%
Fine-tuning quality is determined by data quality. Bad examples in = bad model out. The standard format:
{"messages": [
{"role": "system", "content": "You are a security analyst..."},
{"role": "user", "content": "Triage this CVE: ..."},
{"role": "assistant", "content": "{\"severity\": \"HIGH\", ...}"}
]}
Rules:
- Minimum 200 examples; 500-1000 ideal; 5000+ if domain is complex
- Diversity matters more than volume — cover all the input variations you’ll see
- Include “edge case” examples explicitly — model learns from what you show
- Hand-curate or LLM-generate then human-review — never raw scraped data
- 80/20 split for train/eval; never train on eval
Eval harnesses — non-negotiable
Without evals, fine-tuning is throwing darts in the dark. Build an eval suite BEFORE you start training:
- 50-200 held-out test cases, each with input + expected output
- Deterministic graders where possible (regex, exact match, JSON-schema validation)
- LLM-as-judge for fuzzy outputs (with care — judges have biases)
- Run eval before training (baseline), after each epoch, after every config change
If your fine-tune doesn’t measurably beat the base model on your eval suite, you’ve gained nothing.
Quantisation — fitting more on less
Models are stored at fp16 (16-bit) by default. You can compress to 8-bit, 4-bit, even 2-bit at the cost of some accuracy:
| Format | VRAM (7B model) | Quality drop |
|---|---|---|
| fp16 | ~14GB | None (baseline) |
| int8 / GGUF Q8_0 | ~7GB | ~1% |
| int4 / GGUF Q4_K_M | ~4GB | ~3-5% |
| 2-bit | ~2.5GB | ~10%+ (often not worth) |
4-bit is the sweet spot. A 7B model at Q4_K_M runs on a Mac M2 8GB or a ₹40K consumer GPU.
Serving — vLLM is the answer
Hosting your fine-tuned model:
- vLLM — fastest open-source inference server. PagedAttention + continuous batching = 5-10× higher throughput.
- TGI (Hugging Face) — alternative, slightly behind vLLM in throughput
- SGLang — newer, very fast for structured-output workloads
- llama.cpp — for CPU/Mac, slow but works without GPU
vLLM example serving:
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--enable-lora \
--lora-modules my-adapter=/path/to/adapter \
--max-model-len 8192
Cost reality check
Fine-tuning a 7B model on 5K examples with LoRA:
- Training: ~4 hours on 1× L4 GPU (RunPod ~₹50/hour) = ₹200 one-time
- Storage: adapter is ~100MB, basically free
- Inference: 1× L4 serves ~50 RPS = ~₹35K/month for 24×7 hosting
- Equivalent GPT-4o cost at same volume: ~₹3L/month
If you’re paying ₹3L/month for OpenAI and your task is bounded, fine-tuning + self-hosting saves 90%. If you’re paying ₹30K/month, the saving doesn’t justify the operational burden.
Your project for Module 4
Fine-tune a 7B model to classify phishing emails:
- Get a public phishing dataset (Nazario phishing corpus, Enron for legitimate, ~10K each)
- Format as instruction-tuning JSONL
- LoRA fine-tune Llama-3.1-8B with unsloth on RunPod L4 (₹200, 4 hours)
- Eval against held-out 1000 examples — target 96%+ F1
- Serve via vLLM on the same GPU
- Benchmark inference cost vs GPT-4o-mini API on same eval set
Summary
- Fine-tune only after prompt engineering + RAG hits a ceiling.
- LoRA / QLoRA enables fine-tuning on one consumer GPU; full fine-tuning is rarely needed.
- Data quality dominates; 500 hand-curated > 50,000 scraped.
- Eval harness is non-negotiable; baseline before training.
- 4-bit quantisation is the sweet spot for inference.
- vLLM is the default open-source inference server.
- Self-hosting saves 90% of API cost at high volume; otherwise it’s a burden.
Module Quiz · 20 questions
Pass with 80%+ to mark this module complete. Unlimited retries. Each question shows an explanation.
Custom team training + practitioner advisory
Beyond the free academy — we run private workshops, vCISO advisory, and red-team exercises tailored to your stack. For Indian SMBs scaling past their first hire.