In 2023 self-hosting an LLM required a $10K GPU. In 2026 a 7B-parameter model runs at conversational speed on a Mac M2, a Ryzen with 32GB RAM, or a $400 used Tesla P40. The capability gap between a self-hosted Llama-3.1-8B-Instruct and GPT-4o-mini is small enough for most internal use cases. This module is the practical setup guide: pick hardware, pick runtime, pick model, expose an API, and harden the deployment.
Hardware budget — what you actually need
For a 7B model in 4-bit quantisation (the common practical choice): 8GB VRAM minimum, 16GB ideal. A used RTX 3060 12GB ($230) or M2 Mac Mini 16GB ($600) handles it well. For 13B models: 16GB+ VRAM, RTX 4070 Ti or M3 Pro. For 70B models: dual GPU or M2 Ultra (192GB unified memory). For 120B-405B: serious infrastructure ($20K+). Speed metrics: 7B-Q4 on RTX 3060 ≈ 50 tokens/sec (faster than reading speed); same on M2 ≈ 30 tok/s. The crossover where API beats self-hosted on cost: depends on volume. For internal team of 50 doing 100 queries/day each, self-hosted Mac Mini pays back in 4 months versus GPT-4o-mini API.
Three runtimes — pick the right one
Ollama is the easiest. curl https://ollama.ai/install.sh | sh, ollama pull llama3.1:8b, done. Ships REST API on port 11434. Best for: developers, quick prototyping, single-user. Limitation: no built-in batching, mediocre throughput. llama.cpp is the lowest-level option. C++, builds on Apple Silicon and CUDA, supports GGUF model format. Best for: edge deployment, embedded devices, when you need full control. Compile: git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp && make LLAMA_METAL=1 (Mac) or LLAMA_CUBLAS=1 (NVIDIA). vLLM is production-grade. PagedAttention, continuous batching, OpenAI-compatible API. Best for: serving multiple concurrent users at scale. pip install vllm; vllm serve meta-llama/Llama-3.1-8B-Instruct. Throughput 5-10× Ollama at high concurrency.
Choosing a model in 2026
For general English chat: Llama-3.1-8B-Instruct or Mistral-Small-Instruct — both excellent, openly licensed (Llama community license; Mistral Apache). For code: Qwen-2.5-Coder-7B or DeepSeek-Coder-V2-Lite. For Indian languages: Sarvam-2B (Bharti-trained, supports 10 Indian languages), Krutrim-Spectre-2B. For RAG: any 7B+ instruct model handles it; embedding models matter more (BGE-M3, Jina-v3). For agents: Llama-3.3-70B or Qwen-2.5-72B if you have the hardware; smaller models struggle with multi-step planning. Quantisation: Q4_K_M is the 80/20 default — minimal quality loss, half the memory. Q8 if quality is critical, FP16 if you want to fine-tune.
Production hardening checklist
Before exposing your LLM to anything beyond localhost: (1) Reverse proxy with TLS — caddy or nginx in front of Ollama/vLLM. (2) Authentication — Ollama has no auth by default. Add a Caddy middleware checking Authorization: Bearer tokens. (3) Rate limiting — per-user token quotas. vLLM supports this natively. (4) Input length cap — reject requests over 32K tokens to prevent context-flood DoS. (5) Output length cap — set max_tokens=4096 to bound response time. (6) Logging — NEVER log full prompts in production (PII). Log hashes + token counts. (7) Network — bind to localhost or private VPC, never 0.0.0.0 publicly. Most leaked self-hosted LLM endpoints in 2025 were this mistake. (8) Monitoring — Prometheus exporter; alert on token-per-second collapse (model crash) or queue depth spike (being scraped).
Cost comparison and break-even
GPT-4o pricing in 2026: $2.50/M input tokens, $10/M output. A modest internal-tools deployment (50 users × 100 queries/day × 1500 tokens average) = ~225M tokens/month. API cost: ~$1,400/mo. Self-hosted: $600 hardware (one-time, M2 Mini) + $20/mo electricity = $620/year first year, $240/year after. Break-even within a month at this scale. Where API still wins: bursty workloads (model sits idle 95% of time), need for GPT-4o quality on edge cases, multilingual at scale (better support than open models for low-resource Indian languages). Where self-hosted wins: predictable steady volume, sensitive data (DPDP, regulatory), latency-critical (<100ms first token).
Hands-on: Ollama → OpenAI-compatible API in 5 minutes
Step 1: curl https://ollama.ai/install.sh | sh. Step 2: ollama pull llama3.1:8b-instruct-q4_K_M. Step 3: ollama serve (runs on :11434). Step 4: test — curl http://localhost:11434/api/chat -d '{"model":"llama3.1:8b","messages":[{"role":"user","content":"hello"}]}'. For OpenAI-compatible, Ollama exposes /v1/chat/completions mimicking OpenAI shape, so existing OpenAI SDKs work with base_url="http://localhost:11434/v1". Step 5: harden with Caddy — caddy reverse-proxy --from :8443 --to :11434 + add basic auth in Caddyfile. Step 6: point your existing app at http://localhost:8443 and you have replaced your OpenAI dependency with a private LLM in 5 minutes.
Step-by-step: 30-minute Ollama install + first secure setup
Mac/Linux: curl -fsSL https://ollama.com/install.sh | sh (review the script first — never blind-pipe-curl in 2026). Pull a small model: ollama pull llama3.2:3b. Test: ollama run llama3.2:3b "Explain prompt injection in one paragraph". By default, Ollama serves on localhost:11434 with no authentication. This is the production-incident waiting to happen. Lock it down: (1) bind to localhost only via OLLAMA_HOST=127.0.0.1:11434, never 0.0.0.0 on a cloud VM; (2) put it behind a reverse proxy (Caddy or nginx) with bearer-token auth and rate limit; (3) firewall block port 11434 from public IPs explicitly; (4) for production GPU servers, use vLLM with proper auth instead of Ollama. The internet is full of un-authed Ollama instances on public IPs being abused by random visitors — use Shodan to confirm; you will see thousands. Do not be one of them.
When local LLMs make sense vs API — the decision matrix
Six factors govern the build-vs-buy decision for LLMs. (1) Data sensitivity: regulated PII, source code, M&A docs — local. Public marketing copy — API is fine. (2) Volume: under 100k requests/day — API is cheaper. Above 1M/day — local breakeven likely positive. (3) Latency: under 500ms target with consistent SLA — local with co-located GPU. Latency-tolerant batch jobs — API. (4) Capability: bleeding-edge reasoning — frontier APIs (GPT-4o, Claude 3.5, Gemini 2.0). Routine summarisation/extraction — Llama 3.3 / Qwen 2.5 are fine. (5) Compliance: DPDP cross-border data flow restrictions, RBI data-localisation, EU AI Act high-risk classification — local often forced. (6) Engineering bandwidth: a self-hosted GPU stack needs 0.25-1 FTE of platform engineering ongoing. Most early-stage teams should start with API + abstract behind LiteLLM, then migrate sensitive workloads to local as scale and budget allow. The hybrid model is the dominant 2026 architecture.
Choosing the right base model in 2026 — open-source landscape map
A practitioner’s guide for late-2026. General-purpose chat: Llama 3.3 70B (Meta, Apache-style licence with restrictions), Qwen 2.5 72B (Alibaba, Apache 2.0), Mistral Large (Mistral, commercial licence). Code generation: DeepSeek Coder V2.5, Qwen 2.5 Coder, CodeLlama 70B. Reasoning-heavy: DeepSeek R1 family (open MIT), QwQ 32B (Apache). Small / on-device: Llama 3.2 3B, Phi-4 (Microsoft), Gemma 2 9B (Google). Vision-language: Llama 3.2 Vision, Qwen 2.5 VL, Pixtral. Multilingual / Indic: AI4Bharat IndicTrans2 / IndicBERT, Krutrim (Ola), Sarvam-1, OpenHathi (Sarvam-1 fine-tunes are gaining traction for Hindi/Tamil/Bengali). How to pick: (1) match licence to use case — Llama has commercial restrictions for huge users, Apache/MIT models do not; (2) match size to hardware — 70B on A100, 8-13B on a single 24GB GPU, 3B on laptop; (3) test on YOUR eval set, not generic benchmarks; benchmark numbers are gamed. Indian-language use cases especially benefit from local models — frontier APIs are weaker on Hindi reasoning than on English. Updates: open model landscape moves fast; revisit choices every 6 months. Subscribe to “Hugging Face Daily Papers” and “Latent Space” for signal.
Hardware and command reference — local LLM operations cheat sheet
Hardware sizing in 2026. Phone / laptop / dev: 8 GB RAM = run 1B-3B models (Llama 3.2 1B, Phi-3-mini); 16 GB = 7B-8B comfortably; 32 GB = 13B; 64 GB unified Apple Silicon = 30B-70B at 4-bit. Single GPU: RTX 4090 / 5090 (24-32 GB) = 13B fp16 or 70B 4-bit; A10 (24 GB) cloud = same; A100 80 GB = 70B fp16 or multiple parallel 13Bs; H100 = 70B with batching headroom. Multi-GPU: 2× A100 / 2× H100 = 70B with full context + batching; 8× H100 = ready for 405B Llama if you really need it. Cost reference (2026 cloud rates): A10 ~$0.50-1/hr, A100 ~$1.50-3/hr, H100 ~$2.50-5/hr (Lambda Labs / RunPod / Modal cheaper than AWS / GCP). Ollama essentials: ollama list, ollama pull qwen2.5:7b-instruct, ollama run qwen2.5:7b "explain CSRF", ollama serve (REST API on :11434), OLLAMA_HOST=127.0.0.1:11434 OLLAMA_KEEP_ALIVE=10m ollama serve. vLLM essentials: pip install vllm; vllm serve meta-llama/Llama-3.2-3B-Instruct --host 127.0.0.1 --port 8000 --api-key sk-yourkey. llama.cpp: ./llama-server -m model.gguf --port 8080 -c 8192 -ngl 99. Quantisation cheat: Q4_K_M = best general-purpose; Q5_K_M = quality bump for ~25% more RAM; Q8_0 = near-lossless; fp16 = baseline. Common failure modes: OOM (reduce batch / context / quantize); slow inference (verify GPU is used: nvidia-smi); wrong tokenizer (mismatch between model and tokenizer files breaks output silently); model loading hangs (firewall blocking HF download — set HF_HUB_OFFLINE). Bench your setup: vllm bench --model your-model --num-prompts 100. Helpful scripts to keep around: github.com/ollama/ollama/blob/main/docs/api.md, github.com/vllm-project/vllm/tree/main/examples.
FAQ
Will a self-hosted Llama be as good as GPT-4o?
For most chat, summarisation, classification, and code-completion tasks: yes, within 5-10% of GPT-4o quality. For complex reasoning, agentic planning, and very long contexts (>128K tokens): GPT-4o and Claude still lead. Test on YOUR specific tasks — benchmarks are a poor proxy for production fit.
Can I run Llama-3-70B on a MacBook?
M2 Pro/Max with 32GB+ unified memory: yes at Q4 quant, slowly (~5 tok/s). M2 Ultra 128GB+: yes, comfortably. Other Macs: no. For 70B on PC, you need 48GB+ VRAM (dual RTX 3090 used, ~$1,400) or 64GB system RAM with CPU inference (very slow).
Is self-hosting LLM compliant with DPDP?
Yes — and easier than API. Data never leaves your infrastructure, no cross-border transfer issue, you control retention and erasure. The compliance burden shifts to standard infosec controls (encryption, access logs, breach reporting) rather than vendor-management.
What hardware do I need for a useful local LLM?
For experimentation: any laptop with 16 GB RAM runs 3B-8B models via Ollama. For production-quality 70B-class models: 2× A100 / H100 (~$30k each new, $5-15/hr cloud). For mid-tier 7B-13B production: a single A10 (~$0.5-1/hr cloud) is usually sufficient. Consumer hardware (RTX 4090 / 5090) runs 13B-30B comfortably and is dramatically cheaper for non-cloud experimentation.
How does Ollama compare to vLLM for production?
Ollama: developer-friendly, single-user, no batching, ~10-30 tok/s on commodity GPU. vLLM: production-grade, continuous batching for ~10× higher throughput, OpenAI-compatible API, the standard choice for self-hosting at scale. Use Ollama for laptops and dev; switch to vLLM when you go past 1 user.
⚖️ Legal: Use AI security techniques only on systems you own or have explicit written authorisation to test. In India, unauthorised access is punishable under IT Act §66 (up to 3 years + fine). Pair AI red-teaming with signed Statement of Work or Rules of Engagement before testing.
Book a free 30-minute scoping call
Our senior consultants will review your stack and tell you honestly what to fix first. No slide deck. No obligation. Indian businesses only.