Architecture, internals, and training pipelines for frontier open models
Last updated: 2026-02-17
Open source LLMs now match or exceed proprietary models — MiniMax M2.5 matches Claude Opus on SWE-Bench at 1/20th cost, while MiniMax-01's Lightning Attention enables 4M-token context at linear complexity.
The open source AI ecosystem has exploded. Models that were unthinkable two years ago are now freely available under Apache 2.0 or similar licenses.
| Model | Params | Active | Context | Release | License |
|---|---|---|---|---|---|
| Qwen 3.5 | 397B | 17B | 1M | Feb 2026 | Apache 2.0 |
| MiniMax M2.5 | 230B | 10B | 200K | Feb 2026 | MIT |
| DeepSeek R1 | 671B | 37B | 128K | Jan 2025 | MIT |
| MiniMax-01 | 456B | 46B | 4M | Jan 2025 | Apache 2.0 |
| Llama 4 Maverick | 400B | 17B | 1M | Apr 2025 | Llama 4 |
| Llama 4 Scout | 109B | 17B | 10M | Apr 2025 | Llama 4 |
| Qwen3 235B | 235B | 22B | 256K | May 2025 | Apache 2.0 |
All frontier open models now use Mixture of Experts (MoE). Qwen 3.5 has 397B total params but activates only 17B per token — getting 400B-level capability at 17B inference cost.
| $5.6M | DeepSeek V3 training cost (vs $100M+ for GPT-4) |
| $534K | MiniMax M1 training cost (10× cheaper than DeepSeek R1) |
| 80.2% | MiniMax M2.5 SWE-Bench score (within 0.6% of Claude Opus at 1/20th cost) |
| $72 | Cost to train GPT-2 capability with nanochat |
Qwen is Alibaba's flagship open source LLM series. Qwen3 and Qwen 3.5 represent the current state of the art in open models.
Alibaba's multimodal MoE flagship. 397B params, 17B activated per token.
| Benchmark | Score | Notes |
|---|---|---|
| MMLU-Pro | 87.8% | Knowledge/reasoning |
| MathVista | 90.3% | Visual math |
| MMMU | 85.0% | Multimodal understanding |
Qwen 3.5's 250K vocabulary encodes Chinese, math symbols, and code more compactly than prior models. Fine-tuners report 15-25% lower token counts on technical datasets — meaning lower inference costs.
| Model | Type | Params | Activated | Context |
|---|---|---|---|---|
| Qwen3-235B-A22B | MoE | 235B | 22B | 256K→1M |
| Qwen3-30B-A3B | MoE | 30B | 3B | 256K |
| Qwen3-32B | Dense | 32B | 32B | 128K |
| Qwen3-14B | Dense | 14B | 14B | 128K |
| Qwen3-8B | Dense | 8B | 8B | 128K |
| Qwen3-4B | Dense | 4B | 4B | 32K |
| Qwen3-1.7B | Dense | 1.7B | 1.7B | 32K |
| Qwen3-0.6B | Dense | 0.6B | 0.6B | 32K |
Qwen3 integrates thinking mode (complex, multi-step reasoning) and non-thinking mode (rapid, context-driven) into a unified framework.
DeepSeek is a Chinese AI lab that shocked the industry with efficient training and novel MoE architecture. DeepSeek R1 demonstrated that reasoning capabilities can emerge from pure RL.
| Metric | DeepSeek V3 | GPT-4 (est.) |
|---|---|---|
| GPU hours | 2.79M H800 | ~20M+ H100 |
| Cost | $5.6M | $50-100M |
| MFU | 23% | ~40% |
During R1-Zero training, models spontaneously developed self-reflection: "Wait, let me reconsider..." — emerging purely from RL without being taught. This suggests reasoning can be discovered, not just imitated.
R1's reasoning capabilities distilled into smaller, efficient models:
| Model | Base | Params | Notes |
|---|---|---|---|
| R1-Distill-Qwen-1.5B | Qwen | 1.5B | Edge-deployable reasoning |
| R1-Distill-Qwen-7B | Qwen | 7B | Strong reasoning transfer |
| R1-Distill-Llama-8B | Llama | 8B | Llama base + R1 reasoning |
Llama is Meta's open model family that democratized large language models. Llama 4 (2025) marked Meta's shift to MoE architecture.
| Model | Total | Active | Experts | Context |
|---|---|---|---|---|
| Llama 4 Scout | 109B | 17B | 16 | 10M |
| Llama 4 Maverick | 400B | 17B | 128 | 1M |
| Llama 4 Behemoth* | ~2T | 288B | 16 | — |
*Behemoth announced but not released; Maverick was distilled from it.
Dense architecture, still highly relevant for fine-tuning and deployment:
| Model | Params | Context | Architecture |
|---|---|---|---|
| Llama 3.1 405B | 405B | 128K | Dense |
| Llama 3.1 70B | 70B | 128K | Dense |
| Llama 3.1 8B | 8B | 128K | Dense |
Llama 3.1 used dense architecture to "maximize training stability." Llama 4 switched to MoE for efficiency. The industry consensus has shifted: MoE is now the default for frontier models.
MiniMax is a Shanghai-based AI company that pioneered Lightning Attention — a linear-complexity attention mechanism enabling 4M-token context. Their models match Claude Opus on coding benchmarks at 1/20th the cost.
Lightning Attention splits attention into intra-block (dense) and inter-block (linear) operations. Complexity drops from O(n²) to O(nd² + nBd). Result: 4M context at constant memory per token.
First open model to match Claude Opus on SWE-Bench. 80% of internal MiniMax tasks now completed by M2.5.
| Benchmark | M2.5 | Claude Opus 4.6 |
|---|---|---|
| SWE-Bench Verified | 80.2% | 80.8% |
| Multi-SWE-Bench | 51.3% | 50.3% |
| Cost per task | $0.15 | $3.00 |
First model to demonstrate Lightning Attention at scale. 100% Needle-in-Haystack at 4M tokens.
| Benchmark | Score |
|---|---|
| MMLU | 88.5% |
| GSM8K | 94.8% |
| Needle-in-Haystack (4M) | 100% |
First open-weight hybrid-attention reasoning model. Uses 25% of DeepSeek R1's FLOPs at 100K tokens.
| Metric | M1 | DeepSeek R1 |
|---|---|---|
| Hardware | 512 H800s × 3 weeks | 2.79M H800 hours |
| Cost | $534K | $5.6M |
| FLOPs at 100K gen | 25% | 100% |
Standard attention is O(n²) — doubling context 4× the compute. Lightning Attention is O(n) — doubling context doubles compute. At 4M tokens, this is the difference between "impossible" and "affordable".
MoE is the defining architecture of 2025-2026 frontier models. It enables massive parameter counts with efficient inference.
Instead of using all parameters for every token, MoE routes each token to a subset of "expert" networks.
| Model | Total | Active | Experts | Routing |
|---|---|---|---|---|
| Qwen 3.5 | 397B | 17B | — | Gated Delta Networks |
| DeepSeek R1 | 671B | 37B | 256+1 | Top-8 + shared |
| MiniMax-01 | 456B | 46B | 32 | Top-2 + Lightning Attention |
| MiniMax M2.5 | 230B | 10B | — | Sparse MoE |
| Llama 4 Maverick | 400B | 17B | 128 | Sparse |
| Llama 4 Scout | 109B | 17B | 16 | Sparse |
| Mixtral 8x7B | 47B | 13B | 8 | Top-2 |
For complete architecture details, codebase map, and implementation guide, see the dedicated /nanochat page.
Andrej Karpathy's nanochat (October 2025) is the simplest complete harness for training LLMs from scratch. It covers every stage: tokenization, pretraining, SFT, RL, evaluation, and chat UI. Train GPT-2 capability for ~$72 in 3 hours.
One dial controls everything: depth. Set --depth=26 for GPT-2 capability. All other hyperparameters (width, heads, learning rate, batch size, weight decay) are computed automatically using scaling laws.
| Config | Hardware | Time | Cost | Result |
|---|---|---|---|---|
| Speedrun | 8×H100 | 3 hours | ~$72 | GPT-2 capability |
| Robust | 8×H100 | 42 hours | ~$1000 | Better quality |
In 2019, training GPT-2 cost ~$43,000. nanochat achieves the same for 0.2% of the cost.
Modern transformer with carefully selected components:
Combined optimizer: Muon for matrix params, AdamW for embeddings/scalars.
nanochat automatically calculates optimal hyperparameters from depth using empirically-derived scaling laws:
| Hyperparameter | Formula | Reference |
|---|---|---|
| Model dim | depth × aspect_ratio (default 64) | Width ∝ depth |
| Training tokens | target_param_data_ratio × params (default 10.5×) | Chinchilla ~20× |
| Batch size | B_ref × (D/D_ref)^0.383 | Power Lines |
| Learning rate | η_ref × √(B/B_ref) | AdamW scaling |
| Weight decay | λ_ref × √(B/B_ref) × (D_ref/D) | T_epoch paper |
Community competition to train GPT-2 capability as fast as possible. Target: beat GPT-2 (1.6B) CORE score of 0.256525.
nanoGPT (2022) is Karpathy's original minimal GPT implementation. While superseded by nanochat, it remains valuable for learning and as a baseline.
The entire codebase fits in two files:
| File | Lines | Purpose |
|---|---|---|
| model.py | ~300 | GPT model definition |
| train.py | ~300 | Training loop |
A practical guide to training your own LLM using open source tools.
| Paper | Year | Key Contribution |
|---|---|---|
| Qwen3 Technical Report | 2025 | Complete training methodology for frontier open model |
| DeepSeek-R1 | 2025 | GRPO, emergent reasoning from pure RL |
| Chinchilla | 2022 | Optimal data/param ratio (~20×) |
| LoRA | 2021 | Parameter-efficient fine-tuning |
| DPO | 2023 | Reward-free preference learning |
| QLoRA | 2023 | 4-bit quantization + LoRA. 70B on single GPU. |
Distillation transfers knowledge from large "teacher" models to smaller "student" models, achieving strong performance at lower inference cost.
DeepSeek-R1-Distill models are fine-tuned on samples generated by DeepSeek-R1, transferring advanced reasoning capabilities to smaller architectures.
| Model | Base | Params | Notable Result |
|---|---|---|---|
| R1-Distill-Qwen-1.5B | Qwen2.5-1.5B | 1.5B | Edge-deployable reasoning |
| R1-Distill-Qwen-7B | Qwen2.5-Math-7B | 7B | Outperforms QwQ-32B-Preview |
| R1-Distill-Qwen-14B | Qwen2.5-14B | 14B | Strong reasoning transfer |
| R1-Distill-Qwen-32B | Qwen2.5-32B | 32B | Outperforms OpenAI o1-mini |
DeepSeek-R1-Distill-Qwen-7B outperforms QwQ-32B-Preview despite being 4.5× smaller. The distillation captures reasoning patterns, not just final answers.
Example configuration for distilling from Qwen3-235B:
# config.yaml
project_name: qwen-distill
model: Qwen/Qwen3-8B
output_path: ./output
sequence_length: 8192
dataset:
train_dataset:
repo_id: arcee-ai/Qwen3-235B-Logits-Packed-8192
split: train
prepacked: true
teacher:
kind: dataset # Pre-captured logits
loss_functions:
- function: cross_entropy
weight: 0.5
- function: kl
weight: 0.5
temperature: 1.0
Quantization reduces model precision (FP32 → INT8/INT4) for faster inference on consumer hardware.
| Method | Quality | Best For | Ecosystem |
|---|---|---|---|
| AWQ | ~95% | GPU inference, creative tasks | vLLM, TGI |
| GGUF | ~92% | CPU/Apple Silicon, Ollama | llama.cpp, LM Studio |
| GPTQ | ~90% | Raw CUDA throughput | ExLlama, AutoGPTQ |
Official Qwen documentation covers GPTQ, AWQ, and GGUF quantization:
# GGUF with llama.cpp
python convert-hf-to-gguf.py Qwen/Qwen3-8B \
--outfile Qwen3-8B-F16.gguf \
--outtype bf16
# Quantize to Q4_K_M
./llama-quantize Qwen3-8B-F16.gguf Qwen3-8B-Q4_K_M.gguf Q4_K_M
To improve quantized model quality, apply AWQ scale before GGUF conversion. When running model.quantize() with AutoAWQ, add export_compatible=True.
Qwen3 models are available pre-quantized on HuggingFace and LM Studio. Official quantization guide →
Deep dives into model architectures and training codebases.
MLA compresses key-value tensors into lower-dimensional latent space, dramatically reducing KV cache memory:
Code location: inference/model.py:396
Qwen3 is integrated into HuggingFace transformers. Key implementation files:
Karpathy's nanochat distills modern transformer innovations into ~300 lines. Full deep dive →
| Component | nanochat Choice | Why |
|---|---|---|
| Positions | RoPE | O(1) memory, length extrapolation |
| Norm | RMSNorm (no params) | 30% faster than LayerNorm |
| Activation | ReLU² | 15% faster than GELU |
| Attention | QK-norm + GQA | Stability + efficiency |
| Optimizer | Muon + AdamW | 35% faster training |