Read as

Last updated: April 29, 2026

How LLMs actually work — tokenisation, context windows, embeddings, and the cost economics every Indian practitioner needs to know.

If you’ve used ChatGPT and felt it was “magic,” you’re missing 80% of the picture. This module strips away the mystery — by the end, you’ll explain LLMs to a non-technical founder, read AI job descriptions critically, and know when AI is the wrong tool.

What is a Large Language Model, really?

An LLM is a neural network trained on a massive amount of text to predict the next token. That’s it. There is no understanding, no consciousness, no reasoning module under the hood. The model has learned statistical patterns: given "The capital of France is", the most probable next token is " Paris".

This sounds reductive, but it explains nearly every behaviour you’ll encounter — including why LLMs hallucinate. They’re optimising for “plausible-looking next token,” not “correct answer.” When the training data is sparse on a topic, the model still produces fluent-looking text, but it’s now educated guessing.

The token — the actual unit

An LLM doesn’t read characters or words. It reads tokens. A token is roughly 4 characters of English text, but the rules are weirder than that:

"hello" = 1 token
"ChatGPT" = 2 tokens ("Chat" + "GPT")
"DPDP" = 1 or 2 tokens depending on the model
A space matters — " Paris" (with leading space) is often a different token than "Paris"
Indian-language text typically uses 2-3× more tokens than English for the same content

The process of converting text → tokens is called tokenisation. GPT-4 uses BPE (byte-pair encoding); Claude and Gemini use similar but different algorithms. The same text can cost ~10% more on Claude than GPT-4 just because of tokenisation differences.

Why this matters in practice

Every API charges by tokens — both input (your prompt) and output (the response). If you’re processing 10,000 customer emails per day at 500 tokens each, that’s 5M tokens/day. At GPT-4o pricing (~$2.50 per 1M input tokens), that’s ₹50,000/month just for input. Get this calculation wrong and you’ve burned a quarter of your runway by month-end.

Context window — the working memory

The context window is how many tokens the model can “see” in a single call. Past conversations, system prompts, retrieved documents — everything must fit. Current sizes:

Model	Context window
GPT-4o	128K tokens
Claude 3.5 Sonnet	200K tokens
Gemini 2.0	2M tokens
Llama 3.1 70B (open)	128K tokens

Bigger windows let you stuff more context (full documents, long conversation history, big code repos) but cost proportionally more per call. Most production apps don’t need 200K — they need 8-16K used wisely.

Embeddings — turning text into geometry

An embedding is a fixed-size vector (typically 1536 numbers) that represents the meaning of a piece of text. Two texts that are semantically similar produce vectors that are close together in this 1536-dimensional space.

This unlocks:

Semantic search — find documents by meaning, not keywords
RAG (Retrieval-Augmented Generation) — fetch the most relevant chunks of your private docs, paste them into the LLM prompt
Recommendation — “find articles similar to this one”
Clustering — group customer feedback by topic automatically

Embeddings are cheap. OpenAI’s text-embedding-3-small costs ~$0.02 per 1M tokens — about 100× cheaper than running GPT-4 on the same text.

Why LLMs hallucinate

An LLM produces fluent-sounding answers even when it doesn’t know. There’s no built-in “I’m uncertain about this” signal. The model picks the most-probable next token, full stop. Hallucinations cluster around:

Specific numbers — dates, version numbers, statistics, rupee amounts
Recent events — anything past the training cutoff (GPT-4o cuts off Oct 2023; Claude 3.5 around Apr 2024)
Citations — paper titles, URLs, ID numbers
Niche domains — Indian regulations, regional law, company-specific facts

The fix: RAG + grounding. Don’t ask the model what it knows. Give it the relevant document and ask it to answer based on that document only. We’ll cover this deeply in Module 3.

When NOT to use AI

“Just use ChatGPT for it” has become the default suggestion for everything. It’s often wrong. Don’t use an LLM when:

The task is deterministic — extracting a phone number from text? Use regex. AI is overkill at 100× the cost.
The data must be exact — financial calculations, legal citations, compliance fact-checking. Use code or DB.
The volume is huge and unit cost matters — classifying 50M log lines per day? Use a small purpose-built classifier, not GPT-4.
The information is recent — today’s stock price, tomorrow’s flight status. Hit an API.
Privacy is non-negotiable — patient records, undisclosed M&A info. Run a local model or don’t use AI at all.

The senior engineer’s superpower is knowing when AI is the right tool — and just as importantly, when it isn’t.

Cost economics — the part nobody tells you

Pricing as of Q2 2026 (subject to change, always check current):

Model	Input ($/1M tokens)	Output ($/1M tokens)	Use case
GPT-4o	$2.50	$10.00	High-quality general
GPT-4o-mini	$0.15	$0.60	Routine classification
Claude 3.5 Sonnet	$3.00	$15.00	Coding, long-context
Claude 3.5 Haiku	$0.80	$4.00	Mid-tier general
Gemini 2.0 Flash	$0.10	$0.40	Cheapest at scale

Real-world rule of thumb: a chatbot with 1,000 daily active users sending 5 messages each, with 800-token system prompt + 200-token user message + 400-token reply, costs roughly:

GPT-4o: ~₹35,000/month
GPT-4o-mini: ~₹2,500/month
Gemini Flash: ~₹1,500/month

Choosing the wrong model is the difference between a profitable feature and one that bankrupts the runway.

Your project for Module 1

Build a Python script (~30 lines) that:

Takes any text input
Tokenises it using OpenAI’s tiktoken library
Prints how many tokens, characters, and words
Estimates cost in ₹ for one GPT-4o input call (assume USD-INR 83)

Code skeleton:

import tiktoken

enc = tiktoken.encoding_for_model("gpt-4o")
text = input("Paste text: ")
tokens = enc.encode(text)
cost_inr = (len(tokens) / 1_000_000) * 2.50 * 83
print(f"Tokens: {len(tokens)}")
print(f"Cost (1 call): ₹{cost_inr:.4f}")

Run it on three different texts: a paragraph of English, a paragraph of Hindi/Marathi, and a code snippet. Note how token counts compare to character counts. You’ll see Indian-language text uses ~2.5× more tokens — economics changes everything.

Summary

LLMs predict the next token; that’s it. No magic, no understanding.
Tokens are the billing unit. ~4 chars per token, but it varies by model and language.
Context window is your working memory limit — manage it deliberately.
Embeddings turn text into geometry; they unlock semantic search and RAG.
Hallucinations cluster around specifics — numbers, citations, recency. Use RAG to ground answers.
Don’t reach for AI when regex or code does the job 100× cheaper.
Pick the cheapest model that meets your quality bar; the gap between top and budget tiers can be 30-50× on cost.

🧠

Check your understanding

Module Quiz · 20 questions

Pass with 80%+ to mark this module complete. Unlimited retries. Each question shows an explanation.

Want this for your team?

Custom team training + practitioner advisory

Beyond the free academy — we run private workshops, vCISO advisory, and red-team exercises tailored to your stack. For Indian SMBs scaling past their first hire.

Book team training call Replies in 4 working hrs · India-only · Senior consultants

Module 1 · AI Foundations — Tokens, Context & Cost