nanochat Source Code Deep Dive

Complete architecture analysis of Karpathy's $100 LLM training pipeline
~8,000 lines • GPT-2 in 3 hours • $72

Core insight

One dial controls everything: --depth. Set depth=26 for GPT-2 capability. All hyperparameters (width, heads, LR, batch size, weight decay) are computed automatically using scaling laws.

Repository Structure

nanochat is Karpathy's complete LLM training harness — from tokenization to chat UI in ~8,000 lines of clean PyTorch.

nanochat/

gpt.py — Transformer architecture (~300 lines)

optim.py — MuonAdamW fused optimizer

flash_attention.py — FA3 on Hopper, SDPA fallback

dataloader.py — Distributed data loading

engine.py — Inference with KV cache

tokenizer.py — Rust BPE wrapper

core_eval.py — DCLM CORE evaluation

ui.html — Chat interface

scripts/

base_train.py — Pretraining loop

chat_sft.py — Supervised fine-tuning

chat_rl.py — GRPO reinforcement learning

runs/

speedrun.sh — GPT-2 benchmark ($72)

Design Philosophy

No config files: All parameters computed from --depth
No frameworks: Pure PyTorch, no HuggingFace dependencies
Single-node focus: 8×H100 is the target hardware
Hackable: Every file is readable in one sitting

gpt.py — Architecture Deep Dive

The core architecture distills lessons from GPT-3, Llama, PaLM, and Gemma into ~300 lines.

Key Architectural Choices

Component	Choice	Why
Position encoding	RoPE	O(1) memory, better length extrapolation
Normalization	RMSNorm (no params)	30% faster than LayerNorm
Attention	QK-norm + GQA + sliding window	Stability + memory + context
Activation	ReLU²	15% faster than GELU, 95% of SwiGLU quality
Linear layers	No bias	Redundant with RMSNorm
Embeddings	Untied + Value Embeddings	ResFormer-style value residual

Sliding Window Attention

Uses a configurable pattern string that tiles across layers:

window_pattern: str = "SSSL"  # S=short (half context), L=long (full)
# Final layer always uses full context regardless

Reduces memory while maintaining full context access through alternating patterns.

Value Embeddings (ResFormer)

Alternating layers get a learned value embedding that mixes into the attention values:

# Value residual with input-dependent gating
if ve is not None:
    gate = 2 * torch.sigmoid(self.ve_gate(x[..., :32]))  # (0, 2)
    v = v + gate.unsqueeze(-1) * ve

This provides a shortcut path for the model to directly access token embeddings in deeper layers.

RoPE Implementation

Rotary Position Embeddings encode position through 2D rotations rather than learned embeddings:

def apply_rotary_emb(x, cos, sin):
    # x shape: (B, T, H, D) - FA3's native layout
    d = x.shape[3] // 2
    x1, x2 = x[..., :d], x[..., d:]  # split into pairs
    y1 = x1 * cos + x2 * sin         # rotate
    y2 = x1 * (-sin) + x2 * cos
    return torch.cat([y1, y2], 3)

# Applied to queries and keys before QK-norm
q, k = apply_rotary_emb(q, cos, sin), apply_rotary_emb(k, cos, sin)
q, k = norm(q), norm(k)  # QK norm after RoPE

Key details:

Precomputes cos/sin at initialization (10× sequence length for inference headroom)
Uses persistent=False buffers (not saved to checkpoints)
Applied via element-wise ops on dimension pairs

QK Normalization

Applied after RoPE to prevent attention logit explosion:

Empirical effect

Without QK norm: Query norms drift 0.5 → 5.0 during training
With QK norm: Query norms stabilize ~1.0 throughout
Eliminates training instabilities without hyperparameter tuning.

ReLU² Activation

class MLP(nn.Module):
    def forward(self, x):
        x = self.c_fc(x)
        x = F.relu(x).square()  # ReLU squared
        x = self.c_proj(x)
        return x

Advantages:

Simple: Two fast operations (comparison + element-wise squaring)
Smooth: Differentiable everywhere (unlike standard ReLU at 0)
Efficient: 15% faster training than GELU with 95% of SwiGLU's quality
No extra params: SwiGLU adds 50% parameter overhead

RMSNorm (Parameter-Free)

def norm(x):
    """Purely functional RMSNorm with no learnable params."""
    return F.rms_norm(x, (x.size(-1),))

Computes: rms = sqrt(mean(x²) + eps); return x / rms

Eliminates mean centering step
Removes learnable scale/shift (γ=1, β=0)
~30% faster than LayerNorm
Saves ~107K parameters in 270M model

Logits Softcapping

softcap = 15
logits = softcap * torch.tanh(logits / softcap)

Bounds outputs to [-15, 15] via smooth saturation, preventing numerical instability in cross-entropy loss.

Muon Optimizer

Muon (MomentUm Orthogonalized) is the key optimizer innovation. It applies SGD with Nesterov momentum, then orthogonalizes each 2D parameter update via Polar Express iteration.

Why Orthogonalization?

Traditional optimizers treat all parameters identically. But transformer parameters are 2D matrices with geometric structure. Orthogonal updates:

Preserve norms (prevent gradient explosion/vanishing)
Remove harmful correlations between parameters
Eliminate spurious gradient coupling

Polar Express Iteration

nanochat uses Polar Express for orthogonalization — faster convergence than Newton-Schulz with better numerical properties:

# Pre-computed coefficients for 5 iterations
polar_express_coeffs = [
    (8.156, -22.483, 15.879),
    (4.043, -2.809, 0.500),
    (3.892, -2.772, 0.506),
    (3.286, -2.368, 0.464),
    (2.347, -1.710, 0.423),
]

X = g.bfloat16()
X = X / (X.norm(dim=(-2, -1), keepdim=True) * 1.02 + 1e-6)

for a, b, c in polar_express_coeffs[:ns_steps]:
    A = X @ X.mT
    B = b * A + c * (A @ A)
    X = a * X + B @ X

Key properties:

Per-iteration coefficients: Each step has tuned (a, b, c) values
bfloat16 stable: No numerical instability
Approximate but effective: Produces US'V^T where S' ≈ [0.5, 1.5]

NorMuon Variance Reduction

After orthogonalization, Muon's output has non-uniform scales across neurons. NorMuon fixes this with per-neuron adaptive learning rates:

# Track per-neuron second moment
v_mean = g.float().square().mean(dim=red_dim, keepdim=True)
second_momentum_buffer.lerp_(v_mean, 1 - beta2)

# Scale by inverse sqrt of variance
step_size = second_momentum_buffer.clamp_min(1e-10).rsqrt()
g = g * step_size

Cautious Weight Decay

Standard weight decay pulls parameters toward zero regardless of gradient direction. Cautious weight decay only applies when the gradient and parameter are aligned:

# Only decay weights aligned with gradient direction
mask = (g * params) >= 0
params.sub_(lr * g + lr * wd * params * mask)

This prevents decay from fighting the gradient, improving training stability.

Dual-Optimizer Setup

# Single fused MuonAdamW optimizer
# Muon for 2D matrix parameters (transformer blocks)
# AdamW for 1D/0D parameters (embeddings, scalars)

Why different treatment? Embeddings receive sparse one-hot gradients and need adaptive methods. Weight matrices benefit from orthogonalization.

Performance Gains

Metric	Muon vs AdamW
Training speed	~35% faster convergence
Memory per param	Factored second moment (reduced)
Stability	Runs in bfloat16

Training Pipeline

nanochat's training is a "speedrun" — train a working ChatGPT in one go across multiple stages:

TRAINING STAGES

1. Tokenizer: Train Rust BPE tokenizer (65K vocab)

2. Pretraining: FineWeb-EDU dataset with Muon

3. Mid-training: SmolTalk + MMLU aux + GSM8K with tool use tags

4. SFT: Supervised fine-tuning on chat format

5. RL (optional): Simplified GRPO on GSM8K

6. Evaluation: DCLM CORE score benchmark

Cost Breakdown

Config	Hardware	Time	Cost	Result
Speedrun	8×H100	~3 hours	~$72	GPT-2 capability
Robust	8×H100	~42 hours	~$1000	Better quality

In 2019, training GPT-2 cost ~$43,000. nanochat achieves the same for 0.17% of the cost — a ~600× reduction.

FP8 Training

Optional FP8 support on Hopper+ GPUs for faster training:

python -m scripts.base_train --fp8 --fp8-recipe=tensorwise

Converts Linear layers to Float8Linear with tensorwise (faster) or rowwise (more accurate) scaling.

Momentum Schedule

Muon momentum warms up over 300 steps to prevent early instability:

def get_muon_momentum(it):
    frac = min(it / 300, 1)
    return (1 - frac) * 0.85 + frac * 0.95  # 0.85 → 0.95

GRPO (Simplified RL)

The optional final stage applies reinforcement learning on GSM8K with a simplified GRPO routine. Key simplifications vs canonical PPO-style RLHF:

Simplifications vs canonical RLHF

No trust region (no reference model needed)
No KL regularization
On-policy: no PPO ratios or clipping
Token-level DAPO-style normalization
Mean-shift advantage: (r - μ) instead of z-score (r - μ)/σ

This simplifies to REINFORCE with group-relative advantages — dramatically less code than PPO-style RLHF.

Reward Function

Format reward: Regex-based checking for correct answer format
Accuracy reward: Binary — answer correct or not

Scaling Laws

nanochat automatically calculates hyperparameters from --depth using empirically-derived scaling laws:

Hyperparameter	Formula	Source
Model dimension	`depth × aspect_ratio` (default 64)	Width ∝ depth
Training tokens	`ratio × scaling_params` (default 10.5×)	Chinchilla ~20×
Batch size	`B_ref × (D/D_ref)^0.383`	Power Lines
Learning rate	`η_ref × √(B/B_ref)`	AdamW √N scaling
Weight decay	`λ_ref × √(B/B_ref) × (D_ref/D)`	T_epoch

Reference model: d12

All scaling is relative to depth-12 with tuned hyperparameters: B_ref = 524,288 tokens, known-good LRs. Formulas extrapolate to larger depths via muP-style transfer.

LR Schedule

Linear warmup → constant → linear warmdown (not cosine):

def get_lr_multiplier(it):
    if it < warmup_iters:
        return (it + 1) / warmup_iters    # warmup
    elif it <= num_iterations - warmdown_iters:
        return 1.0                         # constant
    else:
        progress = (num_iterations - it) / warmdown_iters
        return progress + (1 - progress) * final_lr_frac  # warmdown

GPT-2 Speedrun Leaderboard

Community competition to train GPT-2 capability as fast as possible on 8×H100. Target: beat GPT-2 (1.6B) CORE score of 0.256525.

Date	Time	CORE Score	Who
Feb 2026	~2.8 hours	0.260	Community record
Oct 2025	~3.0 hours	0.257	Karpathy baseline

Search

Core Links

GitHub — Main repository (~8000 lines)

Discussion #1 — Karpathy's intro and philosophy

Speedrun Journey — Beating GPT-2 for $100

Muon Paper — Optimizer theory and derivation

Deep Dives

Architecture Analysis — RoPE, QK-norm, ReLU² explained

Muon Explained — Newton-Schulz orthogonalization

Optimizer Internals — Medium deep dive