Claude AI Infrastructure: GPUs, Context Windows, Scaling, and Inference Systems

Manish Garg
Manish Garg Associate of (ISC)² · RingSafe
May 17, 2026
6 min read

Introduction

The infrastructure behind Claude — GPUs, inference engines, networking, the prompt-caching layer — shapes what you can do with it and what it costs. Engineers who understand the infrastructure make better deployment choices. This article maps the production infrastructure for Claude and equivalent frontier models.

Background

LLM inference infrastructure matured rapidly between 2023 and 2026. The first generation (huggingface pipeline, vanilla PyTorch serving) gave way to optimized inference engines (vLLM, TGI, TensorRT-LLM, SGLang) with continuous batching, paged attention, and prefix caching. Anthropic, OpenAI, and Google built proprietary serving stacks on similar principles.

By 2026, the public cloud GPU stack is dominated by NVIDIA H100, H200, and B200 for frontier model serving. AMD MI300X and TPU v5p have meaningful share. Self-hosted serving has become viable for many enterprises.

Theory & Concepts

Inference vs training. Training computes weights; inference uses them. Inference is 1-5% of total compute cost over a model’s lifetime — but for production workloads it’s the dominant ongoing spend.

Prefill vs decode. Prefill processes the input tokens (parallelisable, fast). Decode generates output tokens one at a time (sequential, latency-bound). Both have different optimisation strategies.

KV cache. Stored intermediate attention computations from previous tokens. Reused when generating the next token. Why second-token latency is much lower than first-token.

Continuous batching. Inference server technique: add new requests to an in-flight batch as old ones finish. 5-10× throughput improvement over static batching.

Paged attention. vLLM’s KV cache management innovation, modelled on OS virtual memory. Enables high-throughput serving with many concurrent requests.

Speculative decoding. Use a small fast model to propose N tokens; the big model verifies in parallel. 2-3× faster generation at same quality.

Technical Deep Dive

GPU memory math. A 70B-parameter model at FP16 needs ~140 GB just for weights. KV cache for long contexts adds substantial more. Why frontier models require H100/H200/B200 (80GB or 141GB HBM3) — and often multi-GPU sharding (tensor parallelism).

Frontier-model serving. Models like Claude Opus likely run on multi-GPU clusters with tensor + pipeline parallelism. The model is split across GPUs; each request flows through the partition.

Anthropic’s stack. Proprietary, but the principles match the open-source state of the art: optimized attention kernels, continuous batching, paged-style KV management, prompt caching, speculative decoding.

Prompt caching mechanics. When a prompt has a cacheable prefix (system prompt, RAG-mode context), the KV cache for that prefix is computed once and reused across requests that share the prefix. The cached portion is charged at ~10% of normal token rate.

Batching APIs. Anthropic’s Message Batches API queues requests for processing at half the per-token rate within a 24-hour SLA. The provider can pack batches efficiently, amortising fixed costs.

Practical Implementation

For self-hosted serving (the comparison point), a minimal vLLM setup:

pip install vllm

python -m vllm.entrypoints.openai.api_server 
    --model meta-llama/Llama-3.1-70B-Instruct 
    --tensor-parallel-size 2 
    --max-model-len 32768 
    --enable-prefix-caching 
    --max-num-seqs 256

curl http://localhost:8000/v1/chat/completions 
    -H "Content-Type: application/json" 
    -d '{
        "model": "meta-llama/Llama-3.1-70B-Instruct",
        "messages": [{"role": "user", "content": "Hello"}],
        "max_tokens": 256
    }'

For Claude (managed):

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-haiku-4-5",
    system=[{
        "type": "text",
        "text": LARGE_SYSTEM_PROMPT,
        "cache_control": {"type": "ephemeral"},
    }],
    messages=[{"role": "user", "content": user_input}],
    max_tokens=512,
)

Enterprise Use Cases

High-volume chat (millions of queries/day). Prompt caching is mandatory. Without it, system-prompt cost dominates. With it, the cached portion is nearly free.

Long-document analysis. Long-context tiers (Opus 4.7 1M tokens). Prefill cost dominates; decoding is short. Different optimization profile than chat.

Agentic workflows. Many short calls per task. Latency matters; first-token latency is the user experience. Streaming is essential.

Batch processing. Overnight document processing, classification, eval runs. Batch API is the cost-optimal path.

Real-time decisioning. Sub-second latency requirements. Haiku tier; streaming; possibly local caching of common responses.

Cybersecurity Perspective

Infrastructure security considerations:

Network isolation. Only the AI gateway should reach Claude’s endpoint. Egress controls at the firewall level.

Credential management. API keys in a secret manager (Vault, AWS Secrets Manager). Rotated regularly. Never in code or environment files committed to git.

Data in transit. TLS everywhere. Anthropic enforces it; verify your client doesn’t downgrade.

Logging discipline. Logs of prompts and responses are sensitive. They contain user inputs (potentially PII). Same controls as the source data — encryption at rest, access logging, retention policy.

Provider security. Verify Anthropic’s SOC 2, ISO 27001, and your specific contract terms — zero-retention, processor agreement, breach-notification SLA.

For self-hosted alternatives. GPU operations security: container hardening, model weight integrity (checksum verification, signed weights), network policies.

Performance & Scaling

Latency targets.

  • First-token: under 1s for chat UX
  • Output throughput: 30-60 tokens/sec for streaming chat readability
  • End-to-end for typical chat: 2-5s

Throughput. Provider-side tier limits scale with usage and contract. For very high concurrency, dedicated capacity is negotiable.

Cost optimization. Prompt caching (10× reduction on cached tokens). Multi-tier routing (Haiku for simple, Opus for hard). Batch API (50% reduction for non-urgent). Token caps (prevent runaway generation).

Self-hosted comparison. vLLM on H100s achieves ~2000 tokens/sec aggregate for Llama 3.1 70B at 8-bit. Cost per million tokens depends on GPU utilisation; well-tuned deployments are 30-50% cheaper than managed APIs at scale. Operations overhead is real.

Real-World Examples

Indian SaaS: 200M tokens/day on Claude Haiku via the Message Batches API for non-urgent classification, plus Sonnet for real-time chat. Total monthly bill ~$8K. Pre-optimisation it was $32K.

Indian fintech: self-hosted Llama 3.1 70B on a 4× H100 cluster colocated in Mumbai. Inference cost ~₹2.5/M tokens (compute-only, excluding ops). Used for PII-touching workloads only; Claude handles non-PII.

US health-tech: hybrid. Claude for the clinician-facing summarisation; self-hosted Llama for low-priority offline transcription. Latency-tier routing.

Future Implications

Three trends to watch:

  1. Inference costs continue to drop. GPU/$ improves ~30%/year; algorithmic improvements (speculative decoding, MoE, quantisation) add another factor.
  1. India compute capacity grows. Mumbai/Hyderabad GPU regions and India-DC providers (Yotta, E2E, NxtGen) build out. Self-hosting becomes operationally easier.
  1. Edge inference. Small but capable models (3-8B params) running on consumer hardware for latency-sensitive or privacy-sensitive tasks.

RingSafe Analysis

Three observations from production engagements:

  1. Prompt caching is the cheapest performance lever. Most teams either don’t use it or use it incorrectly. Half a day of engineering to use it correctly pays for itself in weeks.
  1. Self-hosted is now a real choice, not just an aspiration. For Indian enterprises with DPDP exposure, the math is workable on commodity GPU. The operational cost is the question — not the compute.
  1. Multi-tier routing beats single-tier every time. Workload classifiers that route 80% of traffic to Haiku save more than any other optimisation.

Key Takeaways

  • Claude’s infrastructure is GPU clusters with continuous batching, paged-style KV management, prompt caching, speculative decoding.
  • Prompt caching reduces cached-token cost by ~90%. Mandatory for high-volume.
  • Self-hosted (vLLM on H100s) is now viable; trades managed cost for ops cost.
  • Multi-tier routing typically halves API spend.
  • For India, self-hosted on Mumbai GPU is workable for DPDP-sensitive workloads.

Conclusion

Claude’s infrastructure is opaque from the API side, but the principles are public. Engineers who understand prompt caching, batching, tier routing, and KV cache mechanics make deployment choices that ship cheaper and faster. The infrastructure is mature; the engineering question is now optimisation discipline.

Deep dive: RingSafe’s AI Practitioner Path for production infrastructure patterns.

FAQ

Q: What GPUs run Claude?
A: Anthropic doesn’t publish specifics. Frontier-model serving in 2026 typically uses NVIDIA H100/H200/B200 clusters with tensor parallelism.

Q: How does prompt caching work?
A: Static prompt prefixes are KV-cached on the provider side. Subsequent requests reusing the prefix pay ~10% of normal token cost.

Q: Is self-hosting Claude possible?
A: No — Anthropic doesn’t release weights. For self-hosting, use Llama, Mistral, or Qwen.

Q: What’s the cost difference between managed Claude and self-hosted Llama?
A: Self-hosted on H100s at high utilisation is ~30-50% cheaper than equivalent managed APIs at scale, before ops overhead.

Worried about your exposure?

Get a free attack-surface review

We check what an attacker would see about your business — leaked credentials, exposed services, dark-web mentions. 30 minutes, no obligation.

Book exposure review Replies in 4 working hrs · India-only · Senior consultants