nanochat Source Code Deep Dive

Complete architecture analysis of Karpathy's $100 LLM training pipeline
~8,000 lines • GPT-2 in 3 hours • $72

Core insight

One dial controls everything: --depth. Set depth=26 for GPT-2 capability. All hyperparameters (width, heads, LR, batch size, weight decay) are computed automatically using scaling laws.

Repository Structure

nanochat is Karpathy's complete LLM training harness — from tokenization to chat UI in ~8,000 lines of clean PyTorch.

nanochat/
gpt.py — Transformer architecture (~300 lines)
optim.py — MuonAdamW fused optimizer
flash_attention.py — FA3 on Hopper, SDPA fallback
dataloader.py — Distributed data loading
engine.py — Inference with KV cache
tokenizer.py — Rust BPE wrapper
core_eval.py — DCLM CORE evaluation
ui.html — Chat interface
scripts/
base_train.py — Pretraining loop
chat_sft.py — Supervised fine-tuning
chat_rl.py — GRPO reinforcement learning
runs/
speedrun.sh — GPT-2 benchmark ($72)

Design Philosophy

gpt.py — Architecture Deep Dive

The core architecture distills lessons from GPT-3, Llama, PaLM, and Gemma into ~300 lines.

Key Architectural Choices

ComponentChoiceWhy
Position encodingRoPEO(1) memory, better length extrapolation
NormalizationRMSNorm (no params)30% faster than LayerNorm
AttentionQK-norm + GQA + sliding windowStability + memory + context
ActivationReLU²15% faster than GELU, 95% of SwiGLU quality
Linear layersNo biasRedundant with RMSNorm
EmbeddingsUntied + Value EmbeddingsResFormer-style value residual

Sliding Window Attention

Uses a configurable pattern string that tiles across layers:

window_pattern: str = "SSSL"  # S=short (half context), L=long (full)
# Final layer always uses full context regardless

Reduces memory while maintaining full context access through alternating patterns.

Value Embeddings (ResFormer)

Alternating layers get a learned value embedding that mixes into the attention values:

# Value residual with input-dependent gating
if ve is not None:
    gate = 2 * torch.sigmoid(self.ve_gate(x[..., :32]))  # (0, 2)
    v = v + gate.unsqueeze(-1) * ve

This provides a shortcut path for the model to directly access token embeddings in deeper layers.

RoPE Implementation

Rotary Position Embeddings encode position through 2D rotations rather than learned embeddings:

def apply_rotary_emb(x, cos, sin):
    # x shape: (B, T, H, D) - FA3's native layout
    d = x.shape[3] // 2
    x1, x2 = x[..., :d], x[..., d:]  # split into pairs
    y1 = x1 * cos + x2 * sin         # rotate
    y2 = x1 * (-sin) + x2 * cos
    return torch.cat([y1, y2], 3)

# Applied to queries and keys before QK-norm
q, k = apply_rotary_emb(q, cos, sin), apply_rotary_emb(k, cos, sin)
q, k = norm(q), norm(k)  # QK norm after RoPE

Key details:

QK Normalization

Applied after RoPE to prevent attention logit explosion:

Empirical effect

Without QK norm: Query norms drift 0.5 → 5.0 during training
With QK norm: Query norms stabilize ~1.0 throughout
Eliminates training instabilities without hyperparameter tuning.

ReLU² Activation

class MLP(nn.Module):
    def forward(self, x):
        x = self.c_fc(x)
        x = F.relu(x).square()  # ReLU squared
        x = self.c_proj(x)
        return x

Advantages:

RMSNorm (Parameter-Free)

def norm(x):
    """Purely functional RMSNorm with no learnable params."""
    return F.rms_norm(x, (x.size(-1),))

Computes: rms = sqrt(mean(x²) + eps); return x / rms

Logits Softcapping

softcap = 15
logits = softcap * torch.tanh(logits / softcap)

Bounds outputs to [-15, 15] via smooth saturation, preventing numerical instability in cross-entropy loss.

Muon Optimizer

Muon (MomentUm Orthogonalized) is the key optimizer innovation. It applies SGD with Nesterov momentum, then orthogonalizes each 2D parameter update via Polar Express iteration.

Why Orthogonalization?

Traditional optimizers treat all parameters identically. But transformer parameters are 2D matrices with geometric structure. Orthogonal updates:

Polar Express Iteration

nanochat uses Polar Express for orthogonalization — faster convergence than Newton-Schulz with better numerical properties:

# Pre-computed coefficients for 5 iterations
polar_express_coeffs = [
    (8.156, -22.483, 15.879),
    (4.043, -2.809, 0.500),
    (3.892, -2.772, 0.506),
    (3.286, -2.368, 0.464),
    (2.347, -1.710, 0.423),
]

X = g.bfloat16()
X = X / (X.norm(dim=(-2, -1), keepdim=True) * 1.02 + 1e-6)

for a, b, c in polar_express_coeffs[:ns_steps]:
    A = X @ X.mT
    B = b * A + c * (A @ A)
    X = a * X + B @ X

Key properties:

NorMuon Variance Reduction

After orthogonalization, Muon's output has non-uniform scales across neurons. NorMuon fixes this with per-neuron adaptive learning rates:

# Track per-neuron second moment
v_mean = g.float().square().mean(dim=red_dim, keepdim=True)
second_momentum_buffer.lerp_(v_mean, 1 - beta2)

# Scale by inverse sqrt of variance
step_size = second_momentum_buffer.clamp_min(1e-10).rsqrt()
g = g * step_size

Cautious Weight Decay

Standard weight decay pulls parameters toward zero regardless of gradient direction. Cautious weight decay only applies when the gradient and parameter are aligned:

# Only decay weights aligned with gradient direction
mask = (g * params) >= 0
params.sub_(lr * g + lr * wd * params * mask)

This prevents decay from fighting the gradient, improving training stability.

Dual-Optimizer Setup

# Single fused MuonAdamW optimizer
# Muon for 2D matrix parameters (transformer blocks)
# AdamW for 1D/0D parameters (embeddings, scalars)

Why different treatment? Embeddings receive sparse one-hot gradients and need adaptive methods. Weight matrices benefit from orthogonalization.

Performance Gains

MetricMuon vs AdamW
Training speed~35% faster convergence
Memory per paramFactored second moment (reduced)
StabilityRuns in bfloat16

Training Pipeline

nanochat's training is a "speedrun" — train a working ChatGPT in one go across multiple stages:

TRAINING STAGES
1. Tokenizer: Train Rust BPE tokenizer (65K vocab)
2. Pretraining: FineWeb-EDU dataset with Muon
3. Mid-training: SmolTalk + MMLU aux + GSM8K with tool use tags
4. SFT: Supervised fine-tuning on chat format
5. RL (optional): Simplified GRPO on GSM8K
6. Evaluation: DCLM CORE score benchmark

Cost Breakdown

ConfigHardwareTimeCostResult
Speedrun8×H100~3 hours~$72GPT-2 capability
Robust8×H100~42 hours~$1000Better quality

In 2019, training GPT-2 cost ~$43,000. nanochat achieves the same for 0.17% of the cost — a ~600× reduction.

FP8 Training

Optional FP8 support on Hopper+ GPUs for faster training:

python -m scripts.base_train --fp8 --fp8-recipe=tensorwise

Converts Linear layers to Float8Linear with tensorwise (faster) or rowwise (more accurate) scaling.

Momentum Schedule

Muon momentum warms up over 300 steps to prevent early instability:

def get_muon_momentum(it):
    frac = min(it / 300, 1)
    return (1 - frac) * 0.85 + frac * 0.95  # 0.85 → 0.95

GRPO (Simplified RL)

The optional final stage applies reinforcement learning on GSM8K with a simplified GRPO routine. Key simplifications vs canonical PPO-style RLHF:

Simplifications vs canonical RLHF
  • No trust region (no reference model needed)
  • No KL regularization
  • On-policy: no PPO ratios or clipping
  • Token-level DAPO-style normalization
  • Mean-shift advantage: (r - μ) instead of z-score (r - μ)/σ

This simplifies to REINFORCE with group-relative advantages — dramatically less code than PPO-style RLHF.

Reward Function

Scaling Laws

nanochat automatically calculates hyperparameters from --depth using empirically-derived scaling laws:

HyperparameterFormulaSource
Model dimensiondepth × aspect_ratio (default 64)Width ∝ depth
Training tokensratio × scaling_params (default 10.5×)Chinchilla ~20×
Batch sizeB_ref × (D/D_ref)^0.383Power Lines
Learning rateη_ref × √(B/B_ref)AdamW √N scaling
Weight decayλ_ref × √(B/B_ref) × (D_ref/D)T_epoch
Reference model: d12

All scaling is relative to depth-12 with tuned hyperparameters: B_ref = 524,288 tokens, known-good LRs. Formulas extrapolate to larger depths via muP-style transfer.

LR Schedule

Linear warmup → constant → linear warmdown (not cosine):

def get_lr_multiplier(it):
    if it < warmup_iters:
        return (it + 1) / warmup_iters    # warmup
    elif it <= num_iterations - warmdown_iters:
        return 1.0                         # constant
    else:
        progress = (num_iterations - it) / warmdown_iters
        return progress + (1 - progress) * final_lr_frac  # warmdown

GPT-2 Speedrun Leaderboard

Community competition to train GPT-2 capability as fast as possible on 8×H100. Target: beat GPT-2 (1.6B) CORE score of 0.256525.

DateTimeCORE ScoreWho
Feb 2026~2.8 hours0.260Community record
Oct 2025~3.0 hours0.257Karpathy baseline