What Is a Large Language Model (LLM)?
title: "What Is a Large Language Model (LLM)?" date: "2026-03-15" description: "A clear, practical explanation of what Large Language Models are, how they work under the hood — from training to tokenization to inference — and why they matter for software engineers." tags: ["AI Fundamentals", "LLM"]
Large Language Models (LLMs) are the engine behind tools like ChatGPT, Claude, and Gemini. If you're a software engineer in 2026, you've almost certainly used one. But what's actually happening inside?
This article breaks down LLMs from first principles — no PhD required.
What Is a Language Model?
At its core, a language model is a system that assigns probabilities to sequences of words (or tokens). Given the text "The sky is", a language model might predict:
- "blue" — 40% probability
- "clear" — 25% probability
- "falling" — 2% probability
A Large Language Model is simply a language model trained on massive datasets (trillions of tokens) with billions of parameters. Scale changes everything — emergent capabilities like reasoning, code generation, and instruction-following appear at sufficient scale.
How Are LLMs Trained?
Training an LLM happens in stages:
1. Pre-training (Self-supervised Learning)
The model is trained on a huge corpus of text — books, code, web pages, academic papers. The objective is simple:
Predict the next token given all previous tokens.
This is called causal language modeling. The model sees "The quick brown fox", and must predict "jumps". It gets a score (cross-entropy loss) and adjusts its weights using backpropagation and gradient descent.
After billions of iterations across trillions of tokens, the model internalizes grammar, facts, reasoning patterns, and world knowledge.
2. Instruction Tuning (Supervised Fine-Tuning)
Raw pre-trained models are great at text completion, but not great at following instructions. Fine-tuning on high-quality (prompt, response) pairs teaches the model to be a helpful assistant.
3. RLHF (Reinforcement Learning from Human Feedback)
Human raters rank model outputs. A reward model is trained on those rankings, and the LLM is then fine-tuned using PPO (Proximal Policy Optimization) to maximize the reward. This aligns the model with human preferences — making it more helpful and less harmful.
Tokenization
LLMs don't process text character-by-character or word-by-word. They use tokens — subword chunks produced by algorithms like Byte Pair Encoding (BPE).
For example, the word "unbelievable" might be split into: ["un", "believ", "able"]
This lets the model handle rare words by breaking them into familiar subwords, while common words remain a single token. Most LLMs have vocabularies of ~50,000–100,000 tokens.
Context window refers to how many tokens the model can process at once. GPT-4 started at 8K, Claude 3 supports up to 200K. Larger context windows = more information the model can reason over simultaneously.
The Transformer Architecture
All modern LLMs are built on the Transformer architecture (introduced by Google in "Attention Is All You Need", 2017).
Key components:
- Embeddings — each token is converted to a dense vector (e.g., 4096 dimensions)
- Self-attention — every token can "attend" to every other token, learning relationships
- Multi-head attention — run attention multiple times in parallel (different "heads" catch different relationship types)
- Feed-forward layers — apply non-linear transformations to each token position
- Layer normalization — stabilize training
- Stacking — repeat N times (GPT-3: 96 layers, 175B parameters)
Self-attention is the magic: it lets the model relate "it" back to "the cat" in "The cat sat on the mat. It was fluffy." — regardless of distance.
Inference
Once trained, running the model (inference) works like this:
- Your prompt is tokenized
- Each token becomes an embedding vector
- The vectors pass through all transformer layers
- The final layer produces a probability distribution over the vocabulary
- A token is sampled (using temperature / top-p sampling)
- The new token is appended and the process repeats (autoregressive generation)
This is why LLMs generate text one token at a time.
Using LLMs in Code
Here's how to call an LLM via the Anthropic SDK in Python:
import anthropic
client = anthropic.Anthropic()
message = client.messages.create(
model="claude-opus-4-5",
max_tokens=1024,
messages=[
{
"role": "user",
"content": "Explain the transformer architecture in 3 bullet points."
}
]
)
print(message.content[0].text)And here's a simple streaming example:
import anthropic
client = anthropic.Anthropic()
with client.messages.stream(
model="claude-opus-4-5",
max_tokens=1024,
messages=[{"role": "user", "content": "Write a haiku about neural networks."}],
) as stream:
for text in stream.text_stream:
print(text, end="", flush=True)Why Do LLMs Hallucinate?
LLMs generate the statistically most likely next token — they don't "know" things in the way humans do. When asked about something outside their training data (or something ambiguous), they confidently generate plausible-sounding text that may be incorrect.
Mitigations:
- RAG (Retrieval-Augmented Generation) — ground the model with retrieved documents
- Tool use — let the model call external APIs or databases
- Temperature tuning — lower temperature = more deterministic, less creative
Summary
| Concept | Description | |---|---| | Token | Subword unit; basic unit of LLM input/output | | Context window | Max tokens the model can process at once | | Pre-training | Learn from massive text corpora | | RLHF | Align model to human preferences | | Inference | Autoregressive token-by-token generation | | Hallucination | Model generates plausible but incorrect text |
LLMs are remarkable but imperfect. Understanding how they work — not just how to prompt them — makes you a significantly more effective engineer when building AI-powered applications.