Understanding LLMs and Transformers: A Developer's Guide

ai · machine-learning · transformers|2025-12-04 · 11 min read

Table of Contents

  1. The Big Picture: What LLMs Actually Do
  2. From Words to Numbers: Embeddings
  3. Tokenization: Breaking Text Into Pieces
  4. The Transformer Architecture
  5. Attention: The Key Innovation
  6. How LLMs Are Trained
  7. From Base Model to ChatGPT: Fine-tuning
  8. Practical Implications for Developers
  9. Resources for Deeper Learning

1. The Big Picture: What LLMs Actually Do

At its core, a Large Language Model does one thing: predict the next word.

That's it. Given a sequence of words, it estimates the probability of what word comes next. But this simple task, when scaled up with billions of parameters and trained on vast amounts of text, produces surprisingly capable systems.

The Core Mechanic

Think of autocomplete on your phone keyboard. When you type "I'm running", your keyboard might suggest "late" or "fast" or "out". An LLM does the same thing, but:

  1. It considers much more context (thousands of words, not just a few)
  2. It's been trained on billions of pages of text
  3. It uses a sophisticated architecture (the Transformer) to understand relationships between words

How Responses Are Generated

When you ask ChatGPT a question, it doesn't "know" the answer in the way you might think. Instead:

  1. Your input becomes the beginning of a text sequence
  2. The model predicts the most likely next word
  3. That word is added to the sequence
  4. The model predicts the next word after that
  5. Repeat until the response is complete

This is called autoregressive generation—each new word depends on all the words that came before.

The Surprising Emergent Behavior

Here's what's remarkable: by training a model to predict text well enough, it develops what appear to be reasoning capabilities, world knowledge, and even something that looks like understanding. Whether this constitutes "real" intelligence is a philosophical debate, but the practical capabilities are undeniable.


2. From Words to Numbers: Embeddings

Computers work with numbers, not words. So how do we represent words mathematically?

The Naive Approach: One-Hot Encoding

You could assign each word a unique number:

  • "cat" = 1
  • "dog" = 2
  • "running" = 3
  • etc.

But this has a fatal flaw: it doesn't capture relationships. The numbers tell us nothing about how "cat" and "dog" are similar (both animals) while "running" is different (a verb).

The Breakthrough: Word Embeddings

Word embeddings represent each word as a vector (a list of numbers, typically 100-1000 dimensions). The key insight: words that appear in similar contexts should have similar vectors.

For example, "king" and "queen" appear in similar contexts, so their vectors are close together in the high-dimensional space. Same with "Paris" and "France".

The Famous Example

Word embeddings capture semantic relationships mathematically:

vector("king") - vector("man") + vector("woman") ≈ vector("queen")

This works because the vectors encode the relationships between concepts. The "royalty" direction plus the "female" direction gives you "queen".

Word2Vec: How It's Learned

Word2Vec (2013) was a breakthrough algorithm that learns embeddings by:

  1. Skip-gram: Given a word, predict its context words
  2. CBOW (Continuous Bag of Words): Given context words, predict the center word

Through training on millions of text examples, the model learns vectors that capture semantic meaning.

In Modern LLMs

Today's LLMs don't use static Word2Vec embeddings. Instead, they learn contextual embeddings—the same word gets different vectors depending on its context. "Bank" in "river bank" vs "bank account" gets different representations.


3. Tokenization: Breaking Text Into Pieces

Before text enters an LLM, it must be broken into discrete units called tokens. This isn't as simple as splitting on spaces.

Why Not Just Use Words?

Several problems with word-level tokenization:

  1. Vocabulary explosion: English has 100,000+ words. Add misspellings, names, technical terms, and you're at millions
  2. Out-of-vocabulary: What happens with new words or typos?
  3. Morphology: "run", "running", "runner" share meaning but would be separate tokens

Subword Tokenization

Modern LLMs use subword tokenization—breaking words into meaningful pieces:

  • "unbelievable" → ["un", "believ", "able"]
  • "tokenization" → ["token", "ization"]

This balances vocabulary size with expressiveness.

Byte-Pair Encoding (BPE)

BPE is the most common algorithm, used by GPT models:

  1. Start with individual characters as tokens
  2. Find the most frequent adjacent pair
  3. Merge that pair into a new token
  4. Repeat until you reach desired vocabulary size (typically 50,000-100,000 tokens)

Example evolution:

  • Characters: "l", "o", "w", "e", "r", "n", "w"...
  • After merges: "low", "er", "new", "est"...

Practical Implications

Understanding tokenization helps you:

  1. Estimate costs: APIs charge per token, not per word
  2. Understand limits: Context windows are in tokens (GPT-4: 128k tokens)
  3. Debug issues: Some words tokenize unexpectedly, affecting model behavior

A rough heuristic: 1 token ≈ 4 characters or 100 tokens ≈ 75 words in English.


4. The Transformer Architecture

The Transformer, introduced in the 2017 paper "Attention Is All You Need," revolutionized NLP. Before it, models processed text sequentially (one word at a time). Transformers process all words simultaneously, enabling massive parallelization.

The Basic Structure

A Transformer has two main parts:

  1. Encoder: Reads and understands the input
  2. Decoder: Generates the output

For pure language models like GPT, only the decoder is used. For translation tasks, both are used.

The Processing Pipeline

Here's what happens to your input:

Input Text
    ↓
Tokenization (text → token IDs)
    ↓
Embedding (token IDs → vectors)
    ↓
Positional Encoding (add position information)
    ↓
Multiple Transformer Layers
    ↓
Output Probabilities (for each possible next token)

Inside a Transformer Layer

Each layer has:

  1. Multi-Head Attention: Lets tokens look at and gather information from other tokens
  2. Feed-Forward Network: Processes each token's representation independently
  3. Residual Connections: Add the layer's input to its output (helps with training)
  4. Layer Normalization: Stabilizes the values

Modern LLMs stack many of these layers:

  • GPT-2: 12-48 layers
  • GPT-3: 96 layers
  • GPT-4: likely 100+ layers (not publicly confirmed)

Positional Encoding: Adding Word Order

Since Transformers process all tokens at once, they don't inherently know word order. "The cat sat on the mat" and "mat the on sat cat the" would look identical.

Positional encoding solves this by adding position information to each embedding. There are different approaches:

  1. Sinusoidal: Original paper used sine/cosine functions of different frequencies
  2. Learned: Let the model learn position embeddings
  3. RoPE (Rotary Position Embedding): Modern approach used by Llama and others

5. Attention: The Key Innovation

Attention is the mechanism that makes Transformers work. It's how the model decides which words are relevant to which other words.

The Intuition

Consider: "The animal didn't cross the street because it was too tired."

What does "it" refer to? A human easily knows it's "the animal" (not "the street"). The attention mechanism allows the model to make this same connection—when processing "it", the model attends strongly to "animal".

Query, Key, Value

Attention uses three vectors for each token:

  • Query (Q): "What am I looking for?"
  • Key (K): "What do I contain?"
  • Value (V): "What information do I provide?"

The process:

  1. Each token's Query is compared to all tokens' Keys
  2. Higher matches → higher attention scores
  3. Attention scores determine how much of each token's Value to incorporate

The Math (Simplified)

Attention(Q, K, V) = softmax(Q × K^T / √d) × V

Where:

  • Q × K^T: Dot product gives similarity scores
  • √d: Scaling factor (d = dimension size)
  • softmax: Converts to probabilities (sum to 1)
  • × V: Weight the values by attention scores

Multi-Head Attention

Instead of one attention mechanism, Transformers use multiple "heads" (typically 8-64). Each head can focus on different types of relationships:

  • Head 1 might focus on syntactic relationships
  • Head 2 might focus on semantic similarity
  • Head 3 might focus on coreference (like the "it" → "animal" example)

The outputs from all heads are concatenated and projected back to the original dimension.

Self-Attention vs Cross-Attention

  • Self-Attention: Tokens attend to other tokens in the same sequence
  • Cross-Attention: Tokens in one sequence attend to tokens in another (used in encoder-decoder models)

Causal Masking

In language models, there's a constraint: when predicting the next word, you can only look at previous words, not future ones. This is enforced through causal masking—attention scores to future positions are set to -∞ (becoming 0 after softmax).


6. How LLMs Are Trained

Training an LLM happens in stages, each with different goals and techniques.

Stage 1: Pre-training

Goal: Learn language patterns and world knowledge from vast amounts of text.

Process:

  1. Gather training data (web pages, books, code—often 1+ trillion tokens)
  2. For each text, predict the next token given previous tokens
  3. Compare prediction to actual next token (cross-entropy loss)
  4. Update model weights via backpropagation
  5. Repeat billions of times

Resources: This is extremely expensive—GPT-4 training likely cost $100M+ in compute.

Result: A "base model" that can complete text but isn't aligned with user intent. Ask it a question and it might continue the question rather than answer it.

Stage 2: Supervised Fine-Tuning (SFT)

Goal: Teach the model to follow instructions and have conversations.

Process:

  1. Create a dataset of (prompt, ideal response) pairs
  2. Human contractors write or edit responses
  3. Fine-tune the model to produce these responses

Result: The model learns the format of helpful responses.

Stage 3: Reinforcement Learning from Human Feedback (RLHF)

Goal: Further align the model with human preferences.

Process:

  1. Generate multiple responses to the same prompt
  2. Human raters rank responses (best to worst)
  3. Train a "reward model" to predict human preferences
  4. Use reinforcement learning (PPO algorithm) to optimize the LLM against this reward model

Result: A model that produces responses humans tend to prefer.

Recent Developments

  • Constitutional AI (Anthropic): Uses AI feedback in addition to human feedback
  • DPO (Direct Preference Optimization): Simpler alternative to RLHF
  • DeepSeek-R1: Showed chain-of-thought reasoning can emerge from pure RL

7. From Base Model to ChatGPT: Fine-tuning

Understanding the progression from base model to chat model helps explain both capabilities and limitations.

Base Models vs Instruction-Tuned Models

Base Model (e.g., GPT-3 base):

  • Trained only to predict next tokens
  • Completes text in the style of training data
  • Not naturally helpful—might continue your question with more questions
  • Still contains all the knowledge, just doesn't know how to use it conversationally

Instruction-Tuned Model (e.g., ChatGPT):

  • Fine-tuned to follow instructions
  • Knows to answer questions, not continue them
  • Refuses harmful requests
  • Better at structured outputs

The Chat Format

Modern chat models are trained on conversations with specific roles:

System: You are a helpful assistant.
User: What is the capital of France?
Assistant: The capital of France is Paris.

The model learns to continue the conversation appropriately for each role.

Fine-Tuning for Specific Tasks

Companies fine-tune base models for their needs:

  1. Domain adaptation: Train on legal documents, medical texts, etc.
  2. Task-specific: Train for code generation, summarization, etc.
  3. Safety tuning: Additional training to prevent harmful outputs

LoRA and Efficient Fine-Tuning

Full fine-tuning is expensive (all parameters updated). Techniques like LoRA (Low-Rank Adaptation) make it practical:

  • Freeze original weights
  • Add small trainable matrices (typically 0.1% of parameters)
  • Train only these additions
  • Effective for many use cases at a fraction of the cost

8. Practical Implications for Developers

Understanding how LLMs work informs how to use them effectively.

Prompt Engineering

Why prompts matter:

  • The model predicts what text would likely follow your prompt
  • Better prompts create contexts where useful responses are likely
  • "Few-shot" examples work because the model continues the established pattern

Effective techniques:

  • Chain-of-thought: "Let's think step by step"
  • Few-shot examples: Show the format you want
  • Role prompting: "You are an expert in..."
  • Structured output: Request JSON, markdown tables, etc.

Understanding Limitations

Hallucinations: The model generates plausible-sounding but false information because:

  • It predicts likely text, not necessarily true text
  • Training data contains errors
  • The model has no mechanism to verify facts

Context windows: Limited by architecture and training:

  • Attention is O(n²) in sequence length
  • Longer contexts → higher costs and latency
  • Information retrieval degrades with very long contexts

Tokenization quirks:

  • Math on numbers often fails (digits tokenize unexpectedly)
  • Some languages tokenize inefficiently
  • Code indentation can consume many tokens

When to Use Different Approaches

Use LLMs directly when:

  • Tasks benefit from language understanding
  • Exact accuracy isn't critical
  • Creative/generative outputs needed

Augment with tools when:

  • Calculations needed (use code interpreter)
  • Current information required (use search/retrieval)
  • Structured data operations (use databases)

Consider alternatives when:

  • Simple pattern matching suffices
  • Deterministic behavior required
  • Cost sensitivity is high

The Future: Where Things Are Heading

Current trends:

  • Longer context windows: 100k+ tokens becoming standard
  • Multimodal: Text + images + audio + video
  • Smaller, more efficient models: Llama 3, Phi, etc.
  • Better reasoning: Chain-of-thought, tree search, etc.
  • Agents: LLMs as orchestrators of tools and workflows

9. Resources for Deeper Learning

3Blue1Brown's Neural Networks Series (Free, ~4 hours total) https://www.3blue1brown.com/topics/neural-networks

The best visual explanations available:

  • "Large Language Models explained briefly" - Quick intro
  • "Transformers (Chapter 5)" - Visual architecture walkthrough
  • "Attention in transformers (Chapter 6)" - The key mechanism explained
  • "How LLMs Store Facts (Chapter 7)" - Deep dive into MLPs

Andrej Karpathy's YouTube Videos (Free)

  • "Deep Dive into LLMs like ChatGPT" (3.5 hours) - Comprehensive technical overview
  • "How I Use LLMs" (2 hours) - Practical applications
  • "Intro to Large Language Models" (1 hour) - Accessible introduction

https://karpathy.ai/

Blog Posts (Print-Friendly for Reading)

The Illustrated Transformer by Jay Alammar https://jalammar.github.io/illustrated-transformer/ The classic visual explanation of Transformers. Features clear diagrams and step-by-step breakdowns. This is probably the single best resource for understanding the architecture.

The Illustrated GPT-2 by Jay Alammar https://jalammar.github.io/illustrated-gpt2/ Builds on the Transformer article to explain GPT specifically.

What Is ChatGPT Doing and Why Does It Work? by Stephen Wolfram https://writings.stephenwolfram.com/2023/02/what-is-chatgpt-doing-and-why-does-it-work/ Deep philosophical and technical exploration. Longer read but excellent for building intuition.

Books

"Build a Large Language Model (From Scratch)" by Sebastian Raschka Hands-on PyTorch implementation. Best for developers who want to understand by building.

"Hands-On Large Language Models" by Jay Alammar and Maarten Grootendorst Practical guide using HuggingFace, LangChain, etc. Good for applying LLMs in projects.

"What Is ChatGPT Doing and Why Does It Work?" by Stephen Wolfram Book version of the essay, expanded with additional material.

Interactive Resources

HuggingFace Tokenizer Playground Explore how different tokenizers break down text

Transformer Explainer (Georgia Tech) Interactive visualization of attention patterns

Academic Papers (If You Want to Go Deeper)

"Attention Is All You Need" (2017) - The original Transformer paper https://arxiv.org/abs/1706.03762

"Language Models are Few-Shot Learners" (2020) - GPT-3 paper

"Training Language Models to Follow Instructions" (2022) - InstructGPT paper


Quick Reference: Key Terms

| Term | Definition | |------|------------| | Token | The basic unit of text the model works with (roughly 4 characters) | | Embedding | A vector representation of a token capturing its meaning | | Attention | Mechanism allowing tokens to exchange information | | Context window | Maximum tokens the model can process at once | | Parameters | The model's learned weights (GPT-3: 175B, GPT-4: ~1T estimated) | | Fine-tuning | Additional training on specific data/tasks | | Prompt | The input text given to generate a response | | Inference | Running the trained model to generate outputs | | Hallucination | Confident but incorrect model outputs | | RLHF | Reinforcement Learning from Human Feedback |


Suggested Reading Order

For a 2-hour flight:

  1. Read Sections 1-5 of this guide (the core concepts)
  2. Skim Jay Alammar's "Illustrated Transformer" for visuals

For a 4-hour flight:

  1. Complete this entire guide
  2. Read Jay Alammar's "Illustrated Transformer" in full
  3. Read Stephen Wolfram's essay if time permits

Post-flight deep dive:

  1. Watch 3Blue1Brown's video series
  2. Watch Andrej Karpathy's "Deep Dive into LLMs"
  3. Explore the books if you want to build/code