Read as

Last updated: April 29, 2026

When APIs aren’t enough — train, evaluate, deploy custom models on your own infra. LoRA, vLLM, evals, and the cost trade-offs.

Most teams reach for fine-tuning too early. This module is about when fine-tuning genuinely beats prompt engineering + RAG, how to do it cost-effectively (LoRA on a single GPU), and how to ship the resulting model to production.

When to fine-tune (and when not to)

Fine-tune when:

You need consistent output structure at high volume that prompt engineering can’t reliably enforce
You have a narrow domain where the base model lacks vocabulary (legal, medical, security domain-specific)
API cost at your volume exceeds ₹2-3L/month — fine-tuned hosted model could be 10× cheaper
You need data privacy — fine-tuning on-prem keeps data out of API providers

Don’t fine-tune when:

Prompt engineering hasn’t been seriously tried (most teams skip this)
You don’t have at least 200-500 high-quality training examples
You don’t have an eval harness to measure improvement
The use case is rapidly evolving (fine-tuning is point-in-time)

LoRA + QLoRA — fine-tuning on one GPU

Full fine-tuning of a 70B model needs ~280GB of VRAM — 4× A100 80GB GPUs at ~₹3L/hour. Out of reach. Enter LoRA (Low-Rank Adaptation):

Train a small “adapter” layer instead of all model weights
Only ~0.1-1% of total params are trainable
Fits a 7B model on one consumer GPU (24GB VRAM)
QLoRA combines LoRA + 4-bit quantisation: fine-tune 70B on a single A100 (~₹150/hour on RunPod)

The resulting adapter is 50-200MB instead of 14GB. You can swap adapters at inference time — multiple specialisations from one base model.

Dataset preparation — the unsexy 80%

Fine-tuning quality is determined by data quality. Bad examples in = bad model out. The standard format:

{"messages": [
  {"role": "system", "content": "You are a security analyst..."},
  {"role": "user", "content": "Triage this CVE: ..."},
  {"role": "assistant", "content": "{\"severity\": \"HIGH\", ...}"}
]}

Rules:

Minimum 200 examples; 500-1000 ideal; 5000+ if domain is complex
Diversity matters more than volume — cover all the input variations you’ll see
Include “edge case” examples explicitly — model learns from what you show
Hand-curate or LLM-generate then human-review — never raw scraped data
80/20 split for train/eval; never train on eval

Eval harnesses — non-negotiable

Without evals, fine-tuning is throwing darts in the dark. Build an eval suite BEFORE you start training:

50-200 held-out test cases, each with input + expected output
Deterministic graders where possible (regex, exact match, JSON-schema validation)
LLM-as-judge for fuzzy outputs (with care — judges have biases)
Run eval before training (baseline), after each epoch, after every config change

If your fine-tune doesn’t measurably beat the base model on your eval suite, you’ve gained nothing.

Quantisation — fitting more on less

Models are stored at fp16 (16-bit) by default. You can compress to 8-bit, 4-bit, even 2-bit at the cost of some accuracy:

Format	VRAM (7B model)	Quality drop
fp16	~14GB	None (baseline)
int8 / GGUF Q8_0	~7GB	~1%
int4 / GGUF Q4_K_M	~4GB	~3-5%
2-bit	~2.5GB	~10%+ (often not worth)

4-bit is the sweet spot. A 7B model at Q4_K_M runs on a Mac M2 8GB or a ₹40K consumer GPU.

Serving — vLLM is the answer

Hosting your fine-tuned model:

vLLM — fastest open-source inference server. PagedAttention + continuous batching = 5-10× higher throughput.
TGI (Hugging Face) — alternative, slightly behind vLLM in throughput
SGLang — newer, very fast for structured-output workloads
llama.cpp — for CPU/Mac, slow but works without GPU

vLLM example serving:

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --enable-lora \
  --lora-modules my-adapter=/path/to/adapter \
  --max-model-len 8192

Cost reality check

Fine-tuning a 7B model on 5K examples with LoRA:

Training: ~4 hours on 1× L4 GPU (RunPod ~₹50/hour) = ₹200 one-time
Storage: adapter is ~100MB, basically free
Inference: 1× L4 serves ~50 RPS = ~₹35K/month for 24×7 hosting
Equivalent GPT-4o cost at same volume: ~₹3L/month

If you’re paying ₹3L/month for OpenAI and your task is bounded, fine-tuning + self-hosting saves 90%. If you’re paying ₹30K/month, the saving doesn’t justify the operational burden.

Your project for Module 4

Fine-tune a 7B model to classify phishing emails:

Get a public phishing dataset (Nazario phishing corpus, Enron for legitimate, ~10K each)
Format as instruction-tuning JSONL
LoRA fine-tune Llama-3.1-8B with unsloth on RunPod L4 (₹200, 4 hours)
Eval against held-out 1000 examples — target 96%+ F1
Serve via vLLM on the same GPU
Benchmark inference cost vs GPT-4o-mini API on same eval set

Summary

Fine-tune only after prompt engineering + RAG hits a ceiling.
LoRA / QLoRA enables fine-tuning on one consumer GPU; full fine-tuning is rarely needed.
Data quality dominates; 500 hand-curated > 50,000 scraped.
Eval harness is non-negotiable; baseline before training.
4-bit quantisation is the sweet spot for inference.
vLLM is the default open-source inference server.
Self-hosting saves 90% of API cost at high volume; otherwise it’s a burden.

🧠

Check your understanding

Module Quiz · 20 questions

Pass with 80%+ to mark this module complete. Unlimited retries. Each question shows an explanation.

Want this for your team?

Custom team training + practitioner advisory

Beyond the free academy — we run private workshops, vCISO advisory, and red-team exercises tailored to your stack. For Indian SMBs scaling past their first hire.

Book team training call Replies in 4 working hrs · India-only · Senior consultants

Module 4 · Fine-tuning & Custom Models