Understanding LLMs and Transformers: A Developer's Guide

The Big Picture: What LLMs Actually Do
From Words to Numbers: Embeddings
Tokenization: Breaking Text Into Pieces
The Transformer Architecture
Attention: The Key Innovation
How LLMs Are Trained
From Base Model to ChatGPT: Fine-tuning
Practical Implications for Developers
Resources for Deeper Learning

1. The Big Picture: What LLMs Actually Do

At its core, a Large Language Model does one thing: predict the next word.

That's it. Given a sequence of words, it estimates the probability of what word comes next. But this simple task, when scaled up with billions of parameters and trained on vast amounts of text, produces surprisingly capable systems.

The Core Mechanic

Think of autocomplete on your phone keyboard. When you type "I'm running", your keyboard might suggest "late" or "fast" or "out". An LLM does the same thing, but:

It considers much more context (thousands of words, not just a few)
It's been trained on billions of pages of text
It uses a sophisticated architecture (the Transformer) to understand relationships between words

How Responses Are Generated

When you ask ChatGPT a question, it doesn't "know" the answer in the way you might think. Instead:

Your input becomes the beginning of a text sequence
The model predicts the most likely next word
That word is added to the sequence
The model predicts the next word after that
Repeat until the response is complete

This is called autoregressive generation—each new word depends on all the words that came before.

The Surprising Emergent Behavior

Here's what's remarkable: by training a model to predict text well enough, it develops what appear to be reasoning capabilities, world knowledge, and even something that looks like understanding. Whether this constitutes "real" intelligence is a philosophical debate, but the practical capabilities are undeniable.

2. From Words to Numbers: Embeddings

Computers work with numbers, not words. So how do we represent words mathematically?

The Naive Approach: One-Hot Encoding

You could assign each word a unique number:

"cat" = 1
"dog" = 2
"running" = 3
etc.

But this has a fatal flaw: it doesn't capture relationships. The numbers tell us nothing about how "cat" and "dog" are similar (both animals) while "running" is different (a verb).

The Breakthrough: Word Embeddings

Word embeddings represent each word as a vector (a list of numbers, typically 100-1000 dimensions). The key insight: words that appear in similar contexts should have similar vectors.

For example, "king" and "queen" appear in similar contexts, so their vectors are close together in the high-dimensional space. Same with "Paris" and "France".

The Famous Example

Word embeddings capture semantic relationships mathematically:

vector("king") - vector("man") + vector("woman") ≈ vector("queen")

This works because the vectors encode the relationships between concepts. The "royalty" direction plus the "female" direction gives you "queen".

Word2Vec: How It's Learned

Word2Vec (2013) was a breakthrough algorithm that learns embeddings by:

Skip-gram: Given a word, predict its context words
CBOW (Continuous Bag of Words): Given context words, predict the center word

Through training on millions of text examples, the model learns vectors that capture semantic meaning.

In Modern LLMs

Today's LLMs don't use static Word2Vec embeddings. Instead, they learn contextual embeddings—the same word gets different vectors depending on its context. "Bank" in "river bank" vs "bank account" gets different representations.

3. Tokenization: Breaking Text Into Pieces

Before text enters an LLM, it must be broken into discrete units called tokens. This isn't as simple as splitting on spaces.

Why Not Just Use Words?

Several problems with word-level tokenization:

Vocabulary explosion: English has 100,000+ words. Add misspellings, names, technical terms, and you're at millions
Out-of-vocabulary: What happens with new words or typos?
Morphology: "run", "running", "runner" share meaning but would be separate tokens

Subword Tokenization

Modern LLMs use subword tokenization—breaking words into meaningful pieces:

"unbelievable" → ["un", "believ", "able"]
"tokenization" → ["token", "ization"]

This balances vocabulary size with expressiveness.

Byte-Pair Encoding (BPE)

BPE is the most common algorithm, used by GPT models:

Start with individual characters as tokens
Find the most frequent adjacent pair
Merge that pair into a new token
Repeat until you reach desired vocabulary size (typically 50,000-100,000 tokens)

Example evolution:

Characters: "l", "o", "w", "e", "r", "n", "w"...
After merges: "low", "er", "new", "est"...

Practical Implications

Understanding tokenization helps you:

Estimate costs: APIs charge per token, not per word
Understand limits: Context windows are in tokens (GPT-4: 128k tokens)
Debug issues: Some words tokenize unexpectedly, affecting model behavior

A rough heuristic: 1 token ≈ 4 characters or 100 tokens ≈ 75 words in English.

4. The Transformer Architecture

The Transformer, introduced in the 2017 paper "Attention Is All You Need," revolutionized NLP. Before it, models processed text sequentially (one word at a time). Transformers process all words simultaneously, enabling massive parallelization.

The Basic Structure

A Transformer has two main parts:

Encoder: Reads and understands the input
Decoder: Generates the output

For pure language models like GPT, only the decoder is used. For translation tasks, both are used.

The Processing Pipeline

Here's what happens to your input:

Input Text
    ↓
Tokenization (text → token IDs)
    ↓
Embedding (token IDs → vectors)
    ↓
Positional Encoding (add position information)
    ↓
Multiple Transformer Layers
    ↓
Output Probabilities (for each possible next token)

Inside a Transformer Layer

Each layer has:

Multi-Head Attention: Lets tokens look at and gather information from other tokens
Feed-Forward Network: Processes each token's representation independently
Residual Connections: Add the layer's input to its output (helps with training)
Layer Normalization: Stabilizes the values

Modern LLMs stack many of these layers:

GPT-2: 12-48 layers
GPT-3: 96 layers
GPT-4: likely 100+ layers (not publicly confirmed)

Positional Encoding: Adding Word Order

Since Transformers process all tokens at once, they don't inherently know word order. "The cat sat on the mat" and "mat the on sat cat the" would look identical.

Positional encoding solves this by adding position information to each embedding. There are different approaches:

Sinusoidal: Original paper used sine/cosine functions of different frequencies
Learned: Let the model learn position embeddings
RoPE (Rotary Position Embedding): Modern approach used by Llama and others

5. Attention: The Key Innovation

Attention is the mechanism that makes Transformers work. It's how the model decides which words are relevant to which other words.

The Intuition

Consider: "The animal didn't cross the street because it was too tired."

What does "it" refer to? A human easily knows it's "the animal" (not "the street"). The attention mechanism allows the model to make this same connection—when processing "it", the model attends strongly to "animal".

Query, Key, Value

Attention uses three vectors for each token:

Query (Q): "What am I looking for?"
Key (K): "What do I contain?"
Value (V): "What information do I provide?"

The process:

Each token's Query is compared to all tokens' Keys
Higher matches → higher attention scores
Attention scores determine how much of each token's Value to incorporate

The Math (Simplified)

Attention(Q, K, V) = softmax(Q × K^T / √d) × V

Where:

Q × K^T: Dot product gives similarity scores
√d: Scaling factor (d = dimension size)
softmax: Converts to probabilities (sum to 1)
× V: Weight the values by attention scores

Multi-Head Attention

Instead of one attention mechanism, Transformers use multiple "heads" (typically 8-64). Each head can focus on different types of relationships:

Head 1 might focus on syntactic relationships
Head 2 might focus on semantic similarity
Head 3 might focus on coreference (like the "it" → "animal" example)

The outputs from all heads are concatenated and projected back to the original dimension.

Self-Attention vs Cross-Attention

Self-Attention: Tokens attend to other tokens in the same sequence
Cross-Attention: Tokens in one sequence attend to tokens in another (used in encoder-decoder models)

Causal Masking

In language models, there's a constraint: when predicting the next word, you can only look at previous words, not future ones. This is enforced through causal masking—attention scores to future positions are set to -∞ (becoming 0 after softmax).

6. How LLMs Are Trained

Training an LLM happens in stages, each with different goals and techniques.

Stage 1: Pre-training

Goal: Learn language patterns and world knowledge from vast amounts of text.

Process:

Gather training data (web pages, books, code—often 1+ trillion tokens)
For each text, predict the next token given previous tokens
Compare prediction to actual next token (cross-entropy loss)
Update model weights via backpropagation
Repeat billions of times

Resources: This is extremely expensive—GPT-4 training likely cost $100M+ in compute.

Result: A "base model" that can complete text but isn't aligned with user intent. Ask it a question and it might continue the question rather than answer it.

Stage 2: Supervised Fine-Tuning (SFT)

Goal: Teach the model to follow instructions and have conversations.

Process:

Create a dataset of (prompt, ideal response) pairs
Human contractors write or edit responses
Fine-tune the model to produce these responses

Result: The model learns the format of helpful responses.

Stage 3: Reinforcement Learning from Human Feedback (RLHF)

Goal: Further align the model with human preferences.

Process:

Generate multiple responses to the same prompt
Human raters rank responses (best to worst)
Train a "reward model" to predict human preferences
Use reinforcement learning (PPO algorithm) to optimize the LLM against this reward model

Result: A model that produces responses humans tend to prefer.

Recent Developments

Constitutional AI (Anthropic): Uses AI feedback in addition to human feedback
DPO (Direct Preference Optimization): Simpler alternative to RLHF
DeepSeek-R1: Showed chain-of-thought reasoning can emerge from pure RL

7. From Base Model to ChatGPT: Fine-tuning

Understanding the progression from base model to chat model helps explain both capabilities and limitations.

Base Models vs Instruction-Tuned Models

Base Model (e.g., GPT-3 base):

Trained only to predict next tokens
Completes text in the style of training data
Not naturally helpful—might continue your question with more questions
Still contains all the knowledge, just doesn't know how to use it conversationally

Instruction-Tuned Model (e.g., ChatGPT):

Fine-tuned to follow instructions
Knows to answer questions, not continue them
Refuses harmful requests
Better at structured outputs

The Chat Format

Modern chat models are trained on conversations with specific roles:

System: You are a helpful assistant.
User: What is the capital of France?
Assistant: The capital of France is Paris.

The model learns to continue the conversation appropriately for each role.

Fine-Tuning for Specific Tasks

Companies fine-tune base models for their needs:

Domain adaptation: Train on legal documents, medical texts, etc.
Task-specific: Train for code generation, summarization, etc.
Safety tuning: Additional training to prevent harmful outputs

LoRA and Efficient Fine-Tuning

Full fine-tuning is expensive (all parameters updated). Techniques like LoRA (Low-Rank Adaptation) make it practical:

Freeze original weights
Add small trainable matrices (typically 0.1% of parameters)
Train only these additions
Effective for many use cases at a fraction of the cost

8. Practical Implications for Developers

Understanding how LLMs work informs how to use them effectively.

Prompt Engineering

Why prompts matter:

The model predicts what text would likely follow your prompt
Better prompts create contexts where useful responses are likely
"Few-shot" examples work because the model continues the established pattern

Effective techniques:

Chain-of-thought: "Let's think step by step"
Few-shot examples: Show the format you want
Role prompting: "You are an expert in..."
Structured output: Request JSON, markdown tables, etc.

Understanding Limitations

Hallucinations: The model generates plausible-sounding but false information because:

It predicts likely text, not necessarily true text
Training data contains errors
The model has no mechanism to verify facts

Context windows: Limited by architecture and training:

Attention is O(n²) in sequence length
Longer contexts → higher costs and latency
Information retrieval degrades with very long contexts

Tokenization quirks:

Math on numbers often fails (digits tokenize unexpectedly)
Some languages tokenize inefficiently
Code indentation can consume many tokens

When to Use Different Approaches

Use LLMs directly when:

Tasks benefit from language understanding
Exact accuracy isn't critical
Creative/generative outputs needed

Augment with tools when:

Calculations needed (use code interpreter)
Current information required (use search/retrieval)
Structured data operations (use databases)

Consider alternatives when:

Simple pattern matching suffices
Deterministic behavior required
Cost sensitivity is high

The Future: Where Things Are Heading

Current trends:

Longer context windows: 100k+ tokens becoming standard
Multimodal: Text + images + audio + video
Smaller, more efficient models: Llama 3, Phi, etc.
Better reasoning: Chain-of-thought, tree search, etc.
Agents: LLMs as orchestrators of tools and workflows

9. Resources for Deeper Learning

Videos (Recommended Starting Points)

3Blue1Brown's Neural Networks Series (Free, ~4 hours total) https://www.3blue1brown.com/topics/neural-networks

The best visual explanations available:

"Large Language Models explained briefly" - Quick intro
"Transformers (Chapter 5)" - Visual architecture walkthrough
"Attention in transformers (Chapter 6)" - The key mechanism explained
"How LLMs Store Facts (Chapter 7)" - Deep dive into MLPs

Andrej Karpathy's YouTube Videos (Free)

"Deep Dive into LLMs like ChatGPT" (3.5 hours) - Comprehensive technical overview
"How I Use LLMs" (2 hours) - Practical applications
"Intro to Large Language Models" (1 hour) - Accessible introduction

https://karpathy.ai/

Blog Posts (Print-Friendly for Reading)

The Illustrated Transformer by Jay Alammar https://jalammar.github.io/illustrated-transformer/ The classic visual explanation of Transformers. Features clear diagrams and step-by-step breakdowns. This is probably the single best resource for understanding the architecture.

The Illustrated GPT-2 by Jay Alammar https://jalammar.github.io/illustrated-gpt2/ Builds on the Transformer article to explain GPT specifically.

What Is ChatGPT Doing and Why Does It Work? by Stephen Wolfram https://writings.stephenwolfram.com/2023/02/what-is-chatgpt-doing-and-why-does-it-work/ Deep philosophical and technical exploration. Longer read but excellent for building intuition.

Books

"Build a Large Language Model (From Scratch)" by Sebastian Raschka Hands-on PyTorch implementation. Best for developers who want to understand by building.

"Hands-On Large Language Models" by Jay Alammar and Maarten Grootendorst Practical guide using HuggingFace, LangChain, etc. Good for applying LLMs in projects.

"What Is ChatGPT Doing and Why Does It Work?" by Stephen Wolfram Book version of the essay, expanded with additional material.

Interactive Resources

HuggingFace Tokenizer Playground Explore how different tokenizers break down text

Transformer Explainer (Georgia Tech) Interactive visualization of attention patterns

Academic Papers (If You Want to Go Deeper)

"Attention Is All You Need" (2017) - The original Transformer paper https://arxiv.org/abs/1706.03762

"Language Models are Few-Shot Learners" (2020) - GPT-3 paper

"Training Language Models to Follow Instructions" (2022) - InstructGPT paper

Quick Reference: Key Terms

| Term | Definition | |------|------------| | Token | The basic unit of text the model works with (roughly 4 characters) | | Embedding | A vector representation of a token capturing its meaning | | Attention | Mechanism allowing tokens to exchange information | | Context window | Maximum tokens the model can process at once | | Parameters | The model's learned weights (GPT-3: 175B, GPT-4: ~1T estimated) | | Fine-tuning | Additional training on specific data/tasks | | Prompt | The input text given to generate a response | | Inference | Running the trained model to generate outputs | | Hallucination | Confident but incorrect model outputs | | RLHF | Reinforcement Learning from Human Feedback |

Understanding LLMs and Transformers: A Developer's Guide

Table of Contents

1. The Big Picture: What LLMs Actually Do

The Core Mechanic

How Responses Are Generated

The Surprising Emergent Behavior

2. From Words to Numbers: Embeddings

The Naive Approach: One-Hot Encoding

The Breakthrough: Word Embeddings

The Famous Example

Word2Vec: How It's Learned

In Modern LLMs

3. Tokenization: Breaking Text Into Pieces

Why Not Just Use Words?

Subword Tokenization

Byte-Pair Encoding (BPE)

Practical Implications

4. The Transformer Architecture

The Basic Structure

The Processing Pipeline

Inside a Transformer Layer

Positional Encoding: Adding Word Order

5. Attention: The Key Innovation

The Intuition

Query, Key, Value

The Math (Simplified)

Multi-Head Attention

Self-Attention vs Cross-Attention

Causal Masking

6. How LLMs Are Trained

Stage 1: Pre-training

Stage 2: Supervised Fine-Tuning (SFT)

Stage 3: Reinforcement Learning from Human Feedback (RLHF)

Recent Developments

7. From Base Model to ChatGPT: Fine-tuning

Base Models vs Instruction-Tuned Models

The Chat Format

Fine-Tuning for Specific Tasks

LoRA and Efficient Fine-Tuning

8. Practical Implications for Developers

Prompt Engineering

Understanding Limitations

When to Use Different Approaches

The Future: Where Things Are Heading

9. Resources for Deeper Learning

Videos (Recommended Starting Points)

Blog Posts (Print-Friendly for Reading)

Books

Interactive Resources

Academic Papers (If You Want to Go Deeper)

Quick Reference: Key Terms

Suggested Reading Order