Last updated: April 29, 2026
If you’ve used ChatGPT and felt it was “magic,” you’re missing 80% of the picture. This module strips away the mystery — by the end, you’ll explain LLMs to a non-technical founder, read AI job descriptions critically, and know when AI is the wrong tool.
What is a Large Language Model, really?
An LLM is a neural network trained on a massive amount of text to predict the next token. That’s it. There is no understanding, no consciousness, no reasoning module under the hood. The model has learned statistical patterns: given "The capital of France is", the most probable next token is " Paris".
This sounds reductive, but it explains nearly every behaviour you’ll encounter — including why LLMs hallucinate. They’re optimising for “plausible-looking next token,” not “correct answer.” When the training data is sparse on a topic, the model still produces fluent-looking text, but it’s now educated guessing.
The token — the actual unit
An LLM doesn’t read characters or words. It reads tokens. A token is roughly 4 characters of English text, but the rules are weirder than that:
"hello"= 1 token"ChatGPT"= 2 tokens ("Chat"+"GPT")"DPDP"= 1 or 2 tokens depending on the model- A space matters —
" Paris"(with leading space) is often a different token than"Paris" - Indian-language text typically uses 2-3× more tokens than English for the same content
The process of converting text → tokens is called tokenisation. GPT-4 uses BPE (byte-pair encoding); Claude and Gemini use similar but different algorithms. The same text can cost ~10% more on Claude than GPT-4 just because of tokenisation differences.
Why this matters in practice
Every API charges by tokens — both input (your prompt) and output (the response). If you’re processing 10,000 customer emails per day at 500 tokens each, that’s 5M tokens/day. At GPT-4o pricing (~$2.50 per 1M input tokens), that’s ₹50,000/month just for input. Get this calculation wrong and you’ve burned a quarter of your runway by month-end.
Context window — the working memory
The context window is how many tokens the model can “see” in a single call. Past conversations, system prompts, retrieved documents — everything must fit. Current sizes:
| Model | Context window |
|---|---|
| GPT-4o | 128K tokens |
| Claude 3.5 Sonnet | 200K tokens |
| Gemini 2.0 | 2M tokens |
| Llama 3.1 70B (open) | 128K tokens |
Bigger windows let you stuff more context (full documents, long conversation history, big code repos) but cost proportionally more per call. Most production apps don’t need 200K — they need 8-16K used wisely.
Embeddings — turning text into geometry
An embedding is a fixed-size vector (typically 1536 numbers) that represents the meaning of a piece of text. Two texts that are semantically similar produce vectors that are close together in this 1536-dimensional space.
This unlocks:
- Semantic search — find documents by meaning, not keywords
- RAG (Retrieval-Augmented Generation) — fetch the most relevant chunks of your private docs, paste them into the LLM prompt
- Recommendation — “find articles similar to this one”
- Clustering — group customer feedback by topic automatically
Embeddings are cheap. OpenAI’s text-embedding-3-small costs ~$0.02 per 1M tokens — about 100× cheaper than running GPT-4 on the same text.
Why LLMs hallucinate
An LLM produces fluent-sounding answers even when it doesn’t know. There’s no built-in “I’m uncertain about this” signal. The model picks the most-probable next token, full stop. Hallucinations cluster around:
- Specific numbers — dates, version numbers, statistics, rupee amounts
- Recent events — anything past the training cutoff (GPT-4o cuts off Oct 2023; Claude 3.5 around Apr 2024)
- Citations — paper titles, URLs, ID numbers
- Niche domains — Indian regulations, regional law, company-specific facts
The fix: RAG + grounding. Don’t ask the model what it knows. Give it the relevant document and ask it to answer based on that document only. We’ll cover this deeply in Module 3.
When NOT to use AI
“Just use ChatGPT for it” has become the default suggestion for everything. It’s often wrong. Don’t use an LLM when:
- The task is deterministic — extracting a phone number from text? Use regex. AI is overkill at 100× the cost.
- The data must be exact — financial calculations, legal citations, compliance fact-checking. Use code or DB.
- The volume is huge and unit cost matters — classifying 50M log lines per day? Use a small purpose-built classifier, not GPT-4.
- The information is recent — today’s stock price, tomorrow’s flight status. Hit an API.
- Privacy is non-negotiable — patient records, undisclosed M&A info. Run a local model or don’t use AI at all.
The senior engineer’s superpower is knowing when AI is the right tool — and just as importantly, when it isn’t.
Cost economics — the part nobody tells you
Pricing as of Q2 2026 (subject to change, always check current):
| Model | Input ($/1M tokens) | Output ($/1M tokens) | Use case |
|---|---|---|---|
| GPT-4o | $2.50 | $10.00 | High-quality general |
| GPT-4o-mini | $0.15 | $0.60 | Routine classification |
| Claude 3.5 Sonnet | $3.00 | $15.00 | Coding, long-context |
| Claude 3.5 Haiku | $0.80 | $4.00 | Mid-tier general |
| Gemini 2.0 Flash | $0.10 | $0.40 | Cheapest at scale |
Real-world rule of thumb: a chatbot with 1,000 daily active users sending 5 messages each, with 800-token system prompt + 200-token user message + 400-token reply, costs roughly:
- GPT-4o: ~₹35,000/month
- GPT-4o-mini: ~₹2,500/month
- Gemini Flash: ~₹1,500/month
Choosing the wrong model is the difference between a profitable feature and one that bankrupts the runway.
Your project for Module 1
Build a Python script (~30 lines) that:
- Takes any text input
- Tokenises it using OpenAI’s
tiktokenlibrary - Prints how many tokens, characters, and words
- Estimates cost in ₹ for one GPT-4o input call (assume USD-INR 83)
Code skeleton:
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
text = input("Paste text: ")
tokens = enc.encode(text)
cost_inr = (len(tokens) / 1_000_000) * 2.50 * 83
print(f"Tokens: {len(tokens)}")
print(f"Cost (1 call): ₹{cost_inr:.4f}")
Run it on three different texts: a paragraph of English, a paragraph of Hindi/Marathi, and a code snippet. Note how token counts compare to character counts. You’ll see Indian-language text uses ~2.5× more tokens — economics changes everything.
Summary
- LLMs predict the next token; that’s it. No magic, no understanding.
- Tokens are the billing unit. ~4 chars per token, but it varies by model and language.
- Context window is your working memory limit — manage it deliberately.
- Embeddings turn text into geometry; they unlock semantic search and RAG.
- Hallucinations cluster around specifics — numbers, citations, recency. Use RAG to ground answers.
- Don’t reach for AI when regex or code does the job 100× cheaper.
- Pick the cheapest model that meets your quality bar; the gap between top and budget tiers can be 30-50× on cost.
Module Quiz · 20 questions
Pass with 80%+ to mark this module complete. Unlimited retries. Each question shows an explanation.
Custom team training + practitioner advisory
Beyond the free academy — we run private workshops, vCISO advisory, and red-team exercises tailored to your stack. For Indian SMBs scaling past their first hire.