Complete architecture analysis of Karpathy's $100 LLM training pipeline
~8,000 lines • GPT-2 in 3 hours • $72
One dial controls everything: --depth. Set depth=26 for GPT-2 capability. All hyperparameters (width, heads, LR, batch size, weight decay) are computed automatically using scaling laws.
nanochat is Karpathy's complete LLM training harness — from tokenization to chat UI in ~8,000 lines of clean PyTorch.
--depthThe core architecture distills lessons from GPT-3, Llama, PaLM, and Gemma into ~300 lines.
| Component | Choice | Why |
|---|---|---|
| Position encoding | RoPE | O(1) memory, better length extrapolation |
| Normalization | RMSNorm (no params) | 30% faster than LayerNorm |
| Attention | QK-norm + GQA + sliding window | Stability + memory + context |
| Activation | ReLU² | 15% faster than GELU, 95% of SwiGLU quality |
| Linear layers | No bias | Redundant with RMSNorm |
| Embeddings | Untied + Value Embeddings | ResFormer-style value residual |
Uses a configurable pattern string that tiles across layers:
window_pattern: str = "SSSL" # S=short (half context), L=long (full)
# Final layer always uses full context regardless
Reduces memory while maintaining full context access through alternating patterns.
Alternating layers get a learned value embedding that mixes into the attention values:
# Value residual with input-dependent gating
if ve is not None:
gate = 2 * torch.sigmoid(self.ve_gate(x[..., :32])) # (0, 2)
v = v + gate.unsqueeze(-1) * ve
This provides a shortcut path for the model to directly access token embeddings in deeper layers.
Rotary Position Embeddings encode position through 2D rotations rather than learned embeddings:
def apply_rotary_emb(x, cos, sin):
# x shape: (B, T, H, D) - FA3's native layout
d = x.shape[3] // 2
x1, x2 = x[..., :d], x[..., d:] # split into pairs
y1 = x1 * cos + x2 * sin # rotate
y2 = x1 * (-sin) + x2 * cos
return torch.cat([y1, y2], 3)
# Applied to queries and keys before QK-norm
q, k = apply_rotary_emb(q, cos, sin), apply_rotary_emb(k, cos, sin)
q, k = norm(q), norm(k) # QK norm after RoPE
Key details:
Applied after RoPE to prevent attention logit explosion:
Without QK norm: Query norms drift 0.5 → 5.0 during training
With QK norm: Query norms stabilize ~1.0 throughout
Eliminates training instabilities without hyperparameter tuning.
class MLP(nn.Module):
def forward(self, x):
x = self.c_fc(x)
x = F.relu(x).square() # ReLU squared
x = self.c_proj(x)
return x
Advantages:
def norm(x):
"""Purely functional RMSNorm with no learnable params."""
return F.rms_norm(x, (x.size(-1),))
Computes: rms = sqrt(mean(x²) + eps); return x / rms
softcap = 15
logits = softcap * torch.tanh(logits / softcap)
Bounds outputs to [-15, 15] via smooth saturation, preventing numerical instability in cross-entropy loss.
Muon (MomentUm Orthogonalized) is the key optimizer innovation. It applies SGD with Nesterov momentum, then orthogonalizes each 2D parameter update via Polar Express iteration.
Traditional optimizers treat all parameters identically. But transformer parameters are 2D matrices with geometric structure. Orthogonal updates:
nanochat uses Polar Express for orthogonalization — faster convergence than Newton-Schulz with better numerical properties:
# Pre-computed coefficients for 5 iterations
polar_express_coeffs = [
(8.156, -22.483, 15.879),
(4.043, -2.809, 0.500),
(3.892, -2.772, 0.506),
(3.286, -2.368, 0.464),
(2.347, -1.710, 0.423),
]
X = g.bfloat16()
X = X / (X.norm(dim=(-2, -1), keepdim=True) * 1.02 + 1e-6)
for a, b, c in polar_express_coeffs[:ns_steps]:
A = X @ X.mT
B = b * A + c * (A @ A)
X = a * X + B @ X
Key properties:
After orthogonalization, Muon's output has non-uniform scales across neurons. NorMuon fixes this with per-neuron adaptive learning rates:
# Track per-neuron second moment
v_mean = g.float().square().mean(dim=red_dim, keepdim=True)
second_momentum_buffer.lerp_(v_mean, 1 - beta2)
# Scale by inverse sqrt of variance
step_size = second_momentum_buffer.clamp_min(1e-10).rsqrt()
g = g * step_size
Standard weight decay pulls parameters toward zero regardless of gradient direction. Cautious weight decay only applies when the gradient and parameter are aligned:
# Only decay weights aligned with gradient direction
mask = (g * params) >= 0
params.sub_(lr * g + lr * wd * params * mask)
This prevents decay from fighting the gradient, improving training stability.
# Single fused MuonAdamW optimizer
# Muon for 2D matrix parameters (transformer blocks)
# AdamW for 1D/0D parameters (embeddings, scalars)
Why different treatment? Embeddings receive sparse one-hot gradients and need adaptive methods. Weight matrices benefit from orthogonalization.
| Metric | Muon vs AdamW |
|---|---|
| Training speed | ~35% faster convergence |
| Memory per param | Factored second moment (reduced) |
| Stability | Runs in bfloat16 |
nanochat's training is a "speedrun" — train a working ChatGPT in one go across multiple stages:
| Config | Hardware | Time | Cost | Result |
|---|---|---|---|---|
| Speedrun | 8×H100 | ~3 hours | ~$72 | GPT-2 capability |
| Robust | 8×H100 | ~42 hours | ~$1000 | Better quality |
In 2019, training GPT-2 cost ~$43,000. nanochat achieves the same for 0.17% of the cost — a ~600× reduction.
Optional FP8 support on Hopper+ GPUs for faster training:
python -m scripts.base_train --fp8 --fp8-recipe=tensorwise
Converts Linear layers to Float8Linear with tensorwise (faster) or rowwise (more accurate) scaling.
Muon momentum warms up over 300 steps to prevent early instability:
def get_muon_momentum(it):
frac = min(it / 300, 1)
return (1 - frac) * 0.85 + frac * 0.95 # 0.85 → 0.95
The optional final stage applies reinforcement learning on GSM8K with a simplified GRPO routine. Key simplifications vs canonical PPO-style RLHF:
(r - μ) instead of z-score (r - μ)/σThis simplifies to REINFORCE with group-relative advantages — dramatically less code than PPO-style RLHF.
nanochat automatically calculates hyperparameters from --depth using empirically-derived scaling laws:
| Hyperparameter | Formula | Source |
|---|---|---|
| Model dimension | depth × aspect_ratio (default 64) | Width ∝ depth |
| Training tokens | ratio × scaling_params (default 10.5×) | Chinchilla ~20× |
| Batch size | B_ref × (D/D_ref)^0.383 | Power Lines |
| Learning rate | η_ref × √(B/B_ref) | AdamW √N scaling |
| Weight decay | λ_ref × √(B/B_ref) × (D_ref/D) | T_epoch |
All scaling is relative to depth-12 with tuned hyperparameters: B_ref = 524,288 tokens, known-good LRs. Formulas extrapolate to larger depths via muP-style transfer.
Linear warmup → constant → linear warmdown (not cosine):
def get_lr_multiplier(it):
if it < warmup_iters:
return (it + 1) / warmup_iters # warmup
elif it <= num_iterations - warmdown_iters:
return 1.0 # constant
else:
progress = (num_iterations - it) / warmdown_iters
return progress + (1 - progress) * final_lr_frac # warmdown
Community competition to train GPT-2 capability as fast as possible on 8×H100. Target: beat GPT-2 (1.6B) CORE score of 0.256525.
| Date | Time | CORE Score | Who |
|---|---|---|---|
| Feb 2026 | ~2.8 hours | 0.260 | Community record |
| Oct 2025 | ~3.0 hours | 0.257 | Karpathy baseline |