Open Source LLMs

Architecture, internals, and training pipelines for frontier open models
Last updated: 2026-02-17

What brings you here?

🚀 Qwen 3.5 (Feb 2026) 397B MoE, 1M context, multimodal 🧠 DeepSeek R1 671B MoE reasoning model ⚡ MiniMax M2.5 80% SWE-Bench, 1/20th Claude cost ⚡ nanochat ($100 LLM) Full training pipeline 💡 Distillation R1-Distill, DistillKit, DIY

One sentence

Open source LLMs now match or exceed proprietary models — MiniMax M2.5 matches Claude Opus on SWE-Bench at 1/20th cost, while MiniMax-01's Lightning Attention enables 4M-token context at linear complexity.

2026 Model Landscape

The open source AI ecosystem has exploded. Models that were unthinkable two years ago are now freely available under Apache 2.0 or similar licenses.

Frontier Open Models

Model	Params	Active	Context	Release	License
Qwen 3.5	397B	17B	1M	Feb 2026	Apache 2.0
MiniMax M2.5	230B	10B	200K	Feb 2026	MIT
DeepSeek R1	671B	37B	128K	Jan 2025	MIT
MiniMax-01	456B	46B	4M	Jan 2025	Apache 2.0
Llama 4 Maverick	400B	17B	1M	Apr 2025	Llama 4
Llama 4 Scout	109B	17B	10M	Apr 2025	Llama 4
Qwen3 235B	235B	22B	256K	May 2025	Apache 2.0

The efficiency revolution

All frontier open models now use Mixture of Experts (MoE). Qwen 3.5 has 397B total params but activates only 17B per token — getting 400B-level capability at 17B inference cost.

Key trends (2025-2026)

MoE everywhere: Qwen, DeepSeek, Llama 4, and MiniMax all use sparse experts
Million-token context: MiniMax-01 (4M), Llama 4 Scout (10M), Qwen 3.5 (1M)
Linear attention: MiniMax Lightning Attention enables 4M context at linear cost
Multimodal native: Models trained on text+image+video from start
Reasoning modes: "Thinking" vs "non-thinking" unified in one model
RL-trained reasoning: GRPO/RLHF for chain-of-thought capabilities

4 numbers that matter

$5.6M	DeepSeek V3 training cost (vs $100M+ for GPT-4)
$534K	MiniMax M1 training cost (10× cheaper than DeepSeek R1)
80.2%	MiniMax M2.5 SWE-Bench score (within 0.6% of Claude Opus at 1/20th cost)
$72	Cost to train GPT-2 capability with nanochat

Qwen (Alibaba)

Qwen is Alibaba's flagship open source LLM series. Qwen3 and Qwen 3.5 represent the current state of the art in open models.

Qwen 3.5 (February 2026)

Qwen 3.5-397B-A17B Latest

Blog · GitHub

Alibaba's multimodal MoE flagship. 397B params, 17B activated per token.

Architecture

Total params: 397 billion
Activated params: 17 billion per token
Architecture: Sparse MoE with Gated Delta Networks for linear-complexity attention
Context: 1M tokens (hosted), extendable via rope scaling
Vocabulary: 250K tokens (up from 152K in Qwen3)
Languages: 201 languages and dialects (69% increase over Qwen3)

Multimodal Design

Native multimodal: Text, image, video fused from first pretraining stage
Early fusion: Image patches injected directly into transformer layers (not adapter-based)
Resolution: Up to 1344×1344 pixels, pixel-perfect UI detection

Performance

Benchmark	Score	Notes
MMLU-Pro	87.8%	Knowledge/reasoning
MathVista	90.3%	Visual math
MMMU	85.0%	Multimodal understanding

Efficiency

8.6× faster decoding at 32K context vs Qwen3-Max
19× faster at 256K context
60% lower cost vs predecessors at equivalent capability

The vocabulary multiplier

Qwen 3.5's 250K vocabulary encodes Chinese, math symbols, and code more compactly than prior models. Fine-tuners report 15-25% lower token counts on technical datasets — meaning lower inference costs.

Qwen3 Family (May 2025)

Model	Type	Params	Activated	Context
Qwen3-235B-A22B	MoE	235B	22B	256K→1M
Qwen3-30B-A3B	MoE	30B	3B	256K
Qwen3-32B	Dense	32B	32B	128K
Qwen3-14B	Dense	14B	14B	128K
Qwen3-8B	Dense	8B	8B	128K
Qwen3-4B	Dense	4B	4B	32K
Qwen3-1.7B	Dense	1.7B	1.7B	32K
Qwen3-0.6B	Dense	0.6B	0.6B	32K

Key Innovations

Thinking Mode Unification

Qwen3 integrates thinking mode (complex, multi-step reasoning) and non-thinking mode (rapid, context-driven) into a unified framework.

Single model: No need to switch between separate reasoning and chat models
Thinking budget: Users can allocate compute adaptively per task
Latency/performance trade-off: Control via inference-time parameter

Qwen Model Links

Qwen3-235B — Flagship MoE. 22B activated.

Qwen3-4B — Best for fine-tuning. ARC Prize base.

Qwen3-8B — Matches Qwen2.5-14B. Great multilingual.

Technical Report — Full architecture details. arXiv.

DeepSeek

DeepSeek is a Chinese AI lab that shocked the industry with efficient training and novel MoE architecture. DeepSeek R1 demonstrated that reasoning capabilities can emerge from pure RL.

DeepSeek R1 (January 2025)

DeepSeek-R1-671B Reasoning

HuggingFace · Paper

Architecture

Total params: 671 billion
Activated params: 37 billion per token
Experts: 1 shared expert (always active) + 256 routed experts (8 active per token)
Layers: 61 transformer layers
Context: 128K tokens

Training Methodology

R1-Zero: Pure RL without SFT — reasoning emerged naturally
GRPO: Group Relative Policy Optimization (no value network)
Reward: Simple regex-based format + accuracy rewards
Emergent behaviors: Self-verification, reflection, "aha moments"

Training Cost

Metric	DeepSeek V3	GPT-4 (est.)
GPU hours	2.79M H800	~20M+ H100
Cost	$5.6M	$50-100M
MFU	23%	~40%

The "Aha Moment"

During R1-Zero training, models spontaneously developed self-reflection: "Wait, let me reconsider..." — emerging purely from RL without being taught. This suggests reasoning can be discovered, not just imitated.

DeepSeek R1 Distilled Models

R1's reasoning capabilities distilled into smaller, efficient models:

Model	Base	Params	Notes
R1-Distill-Qwen-1.5B	Qwen	1.5B	Edge-deployable reasoning
R1-Distill-Qwen-7B	Qwen	7B	Strong reasoning transfer
R1-Distill-Llama-8B	Llama	8B	Llama base + R1 reasoning

DeepSeek V3 Architecture Innovations

MoE ARCHITECTURE

Shared expert: 1 always-activated expert per layer

Captures common patterns, reduces routing overhead

Routed experts: 256 experts, 8 selected per token

3.125% activation ratio (8/256)

All-experts layers: 3 layers activate all experts

Enables cross-expert knowledge sharing

Llama (Meta)

Llama is Meta's open model family that democratized large language models. Llama 4 (2025) marked Meta's shift to MoE architecture.

Llama 4 (April 2025)

Llama 4 Scout & Maverick MoE

Website · Blog

Model Variants

Model	Total	Active	Experts	Context
Llama 4 Scout	109B	17B	16	10M
Llama 4 Maverick	400B	17B	128	1M
Llama 4 Behemoth*	~2T	288B	16	—

*Behemoth announced but not released; Maverick was distilled from it.

Key Features

10M context: Scout supports 10 million tokens — orders of magnitude beyond prior LLMs
Multimodal: Text and image input, text output
Multilingual: 12 languages supported
MoE architecture: Both Scout and Maverick use sparse experts

Llama 3.1 (July 2024)

Dense architecture, still highly relevant for fine-tuning and deployment:

Model	Params	Context	Architecture
Llama 3.1 405B	405B	128K	Dense
Llama 3.1 70B	70B	128K	Dense
Llama 3.1 8B	8B	128K	Dense

Dense vs MoE

Llama 3.1 used dense architecture to "maximize training stability." Llama 4 switched to MoE for efficiency. The industry consensus has shifted: MoE is now the default for frontier models.

MiniMax

MiniMax is a Shanghai-based AI company that pioneered Lightning Attention — a linear-complexity attention mechanism enabling 4M-token context. Their models match Claude Opus on coding benchmarks at 1/20th the cost.

Core innovation

Lightning Attention splits attention into intra-block (dense) and inter-block (linear) operations. Complexity drops from O(n²) to O(nd² + nBd). Result: 4M context at constant memory per token.

MiniMax-M2.5 (February 2026)

MiniMax-M2.5 Coding

Blog · GitHub · HuggingFace

First open model to match Claude Opus on SWE-Bench. 80% of internal MiniMax tasks now completed by M2.5.

Architecture

Total params: 230B (MoE)
Active params: 10B per token
Context: 200K tokens
Output: 57.3 tokens/sec
License: MIT

Benchmark Performance

Benchmark	M2.5	Claude Opus 4.6
SWE-Bench Verified	80.2%	80.8%
Multi-SWE-Bench	51.3%	50.3%
Cost per task	$0.15	$3.00

MiniMax-01 Series (January 2025)

MiniMax-Text-01 4M Context

Paper · HuggingFace

First model to demonstrate Lightning Attention at scale. 100% Needle-in-Haystack at 4M tokens.

Architecture

Total params: 456B
Active params: 45.9B per token
Experts: 32 (Top-2 routing)
Layers: 80 (7 Lightning + 1 Softmax per 8)
Hidden size: 6144
MoE hidden: 9216
Attention heads: 64 × 128 dim
RoPE base: 10,000,000 (for extreme context)
Vocab: 200,064 tokens
Context: 1M train → 4M inference

Benchmark Performance

Benchmark	Score
MMLU	88.5%
GSM8K	94.8%
Needle-in-Haystack (4M)	100%

MiniMax-M1 (June 2025)

MiniMax-M1 Reasoning

Paper · HuggingFace

First open-weight hybrid-attention reasoning model. Uses 25% of DeepSeek R1's FLOPs at 100K tokens.

Training Cost

Metric	M1	DeepSeek R1
Hardware	512 H800s × 3 weeks	2.79M H800 hours
Cost	$534K	$5.6M
FLOPs at 100K gen	25%	100%

Why Lightning Attention matters

Standard attention is O(n²) — doubling context 4× the compute. Lightning Attention is O(n) — doubling context doubles compute. At 4M tokens, this is the difference between "impossible" and "affordable".

Lightning Attention Architecture

LIGHTNING ATTENTION

Intra-block: Dense matmul within small blocks (uses Tensor Cores)

O_intra = [(Q·K^T) ⊙ M] · V where M is causal decay mask

Inter-block: Linear approximation between blocks

Rolling accumulator, minimal state transfer

Complexity: O(nd² + nBd) — linear in sequence length

Hybrid: 7 Lightning + 1 Softmax per 8 layers

Softmax layers preserve fine-grained interactions

Model Links

MiniMax Models

M2.5 — Coding flagship. 230B/10B MoE. MIT license.

M1-80k — Reasoning. 456B/46B. Apache 2.0.

MiniMax-01 Paper — Lightning Attention architecture details.

MiniMax-M1 Paper — Test-time compute scaling.

Mixture of Experts Architecture

MoE is the defining architecture of 2025-2026 frontier models. It enables massive parameter counts with efficient inference.

How MoE Works

Sparse Expert Routing

Instead of using all parameters for every token, MoE routes each token to a subset of "expert" networks.

Basic Components

Router network: Small network that decides which experts to use
Expert FFNs: Multiple parallel feedforward networks
Top-K selection: Only K experts (typically 2-8) process each token

Example: DeepSeek R1

256 routed experts per layer
8 experts selected per token (3.125% activation)
1 shared expert always active
Result: 671B params, 37B activated

MoE Comparison Across Models

Model	Total	Active	Experts	Routing
Qwen 3.5	397B	17B	—	Gated Delta Networks
DeepSeek R1	671B	37B	256+1	Top-8 + shared
MiniMax-01	456B	46B	32	Top-2 + Lightning Attention
MiniMax M2.5	230B	10B	—	Sparse MoE
Llama 4 Maverick	400B	17B	128	Sparse
Llama 4 Scout	109B	17B	16	Sparse
Mixtral 8x7B	47B	13B	8	Top-2

Why MoE?

Advantages

✓ Higher capacity at same inference cost

✓ Better scaling — add experts cheaply

✓ Specialization — experts can learn different skills

✓ Training efficiency — more params without proportional compute

Challenges

✗ Load balancing — need even expert utilization

✗ Memory — all experts must fit in memory

✗ Training stability — requires careful tuning

✗ MFU — harder to achieve high utilization

nanochat: Build Your Own LLM for $100

Full nanochat Deep Dive

For complete architecture details, codebase map, and implementation guide, see the dedicated /nanochat page.

Andrej Karpathy's nanochat (October 2025) is the simplest complete harness for training LLMs from scratch. It covers every stage: tokenization, pretraining, SFT, RL, evaluation, and chat UI. Train GPT-2 capability for ~$72 in 3 hours.

Core philosophy

One dial controls everything: depth. Set --depth=26 for GPT-2 capability. All other hyperparameters (width, heads, learning rate, batch size, weight decay) are computed automatically using scaling laws.

What nanochat Includes

COMPLETE TRAINING PIPELINE

1. Tokenization: Rust BPE tokenizer (high performance)

2. Pretraining: FineWeb-EDU dataset

3. Mid-training: SmolTalk + MMLU aux + GSM8K with tool use tags

4. SFT: Supervised fine-tuning

5. RL: Simplified GRPO on GSM8K (optional)

6. Evaluation: DCLM CORE score

7. Inference: Engine with KV cache, Python interpreter

8. Chat UI: Talk to your model like ChatGPT

Cost & Performance

Config	Hardware	Time	Cost	Result
Speedrun	8×H100	3 hours	~$72	GPT-2 capability
Robust	8×H100	42 hours	~$1000	Better quality

In 2019, training GPT-2 cost ~$43,000. nanochat achieves the same for 0.2% of the cost.

Architecture Principles

GPT Architecture

nanochat/gpt.py

Modern transformer with carefully selected components:

Rotary Embeddings (RoPE) — no learned position embeddings
QK Normalization — stabilizes attention scores
Group-Query Attention (GQA) — memory-efficient attention
ReLU² activation — relu(x)² in MLP for sparse, efficient computation
RMSNorm — no learnable params, faster than LayerNorm
No bias — removed from all linear layers
Untied embeddings — separate token embedding and lm_head weights
Flash Attention 3 — H100-optimized with sliding window support

Muon Optimizer

nanochat/optim.py · Theory

Combined optimizer: Muon for matrix params, AdamW for embeddings/scalars.

Muon (MomentUm Orthogonalized by Newton-schulz)
- Standard SGD-momentum + orthogonalization post-processing
- Polar Express iteration for fast orthogonalization in bfloat16
- NorMuon variance reduction — per-neuron adaptive learning rate
- Cautious weight decay — only decays weights aligned with gradient
Distributed (ZeRO-2 style): Reduce-scatter → compute → all-gather

Scaling Laws Applied

nanochat automatically calculates optimal hyperparameters from depth using empirically-derived scaling laws:

Hyperparameter	Formula	Reference
Model dim	depth × aspect_ratio (default 64)	Width ∝ depth
Training tokens	target_param_data_ratio × params (default 10.5×)	Chinchilla ~20×
Batch size	B_ref × (D/D_ref)^0.383	Power Lines
Learning rate	η_ref × √(B/B_ref)	AdamW scaling
Weight decay	λ_ref × √(B/B_ref) × (D_ref/D)	T_epoch paper

Leaderboard: GPT-2 Speedrun

Community competition to train GPT-2 capability as fast as possible. Target: beat GPT-2 (1.6B) CORE score of 0.256525.

nanochat Resources

GitHub — Main repository. ~8000 lines of clean code.

Discussion — Karpathy's intro and community Q&A.

MarkTechPost — Overview article.

nanoGPT: The Original Minimal GPT

nanoGPT (2022) is Karpathy's original minimal GPT implementation. While superseded by nanochat, it remains valuable for learning and as a baseline.

Deprecated but foundational

nanoGPT is now deprecated — use nanochat for new projects
Still valuable for learning the basics of GPT training
Spawned the nanoGPT speedrun competition

Code Structure

The entire codebase fits in two files:

File	Lines	Purpose
model.py	~300	GPT model definition
train.py	~300	Training loop

Training Performance

Original: GPT-2 (124M) in 45 minutes on 8×A100
Community optimized: Same quality in 3 minutes
Led to: llm.c (C/CUDA port), modded-nanogpt speedrun

Evolution

KARPATHY'S LLM JOURNEY

minGPT (2020) — Educational, prioritizes readability

nanoGPT (2022) — Minimal but fast, "teeth over education"

llm.c (2024) — Pure C/CUDA implementation

nanochat (2025) — Full stack: pretrain to chat UI

nanoGPT Resources

nanoGPT — Original repo. Good for learning.

build-nanogpt — Video lecture on building from scratch.

llm.c — Pure C/CUDA implementation.

Zero to Hero — Karpathy's video series (YouTube).

Training From Scratch

A practical guide to training your own LLM using open source tools.

Decision: What to Train?

Which approach should I use?

Learning: Start with nanoGPT or nanochat tutorials
Production small model: Fine-tune Qwen3-4B or Llama 3.1 8B
Reasoning task: Try R1-Distill models or GRPO training
Custom from scratch: Use nanochat with your data
Research: Explore TRM-style recursive architectures

Training Stack

Fine-Tuning Tools

PEFT — HuggingFace. LoRA, QLoRA, adapters. Most accessible.

Unsloth — 2× faster fine-tuning. 70% less memory.

Axolotl — Config-based fine-tuning. Many methods.

LitGPT — Lightning AI. Pretrain + fine-tune.

RL Training

TRL — HuggingFace. PPO, DPO, GRPO, KTO, ORPO.

OpenRLHF — Ray + vLLM. Scales to 70B+.

veRL — ByteDance. GRPO, VAPO. Scales to 671B.

From-Scratch Training

nanochat — Full pipeline. $100 for GPT-2 capability.

llm.c — Pure C/CUDA. No Python dependencies.

torchtitan — PyTorch native. Llama architecture.

Learning Path

Recommended learning path

Watch "Let's build GPT" (2h)

Karpathy builds a GPT from scratch. Best introduction.

→ YouTube

Run nanoGPT on Shakespeare (30 min)

Train your first model. Even works on CPU.

→ GitHub

Fine-tune Qwen3-4B with Unsloth (1h)

Learn LoRA fine-tuning on a real model.

→ Unsloth

Try DPO with TRL (2h)

Add preference alignment to your model.

→ TRL DPO

Run nanochat speedrun ($72)

Train GPT-2 capability from scratch.

→ nanochat

Key Papers

Paper	Year	Key Contribution
Qwen3 Technical Report	2025	Complete training methodology for frontier open model
DeepSeek-R1	2025	GRPO, emergent reasoning from pure RL
Chinchilla	2022	Optimal data/param ratio (~20×)
LoRA	2021	Parameter-efficient fine-tuning
DPO	2023	Reward-free preference learning
QLoRA	2023	4-bit quantization + LoRA. 70B on single GPU.

Model Distillation

Distillation transfers knowledge from large "teacher" models to smaller "student" models, achieving strong performance at lower inference cost.

DeepSeek R1 Distillation to Qwen

DeepSeek-R1-Distill models are fine-tuned on samples generated by DeepSeek-R1, transferring advanced reasoning capabilities to smaller architectures.

R1-Distill Models DeepSeek

GitHub · HuggingFace

Available Qwen-Based Models

Model	Base	Params	Notable Result
R1-Distill-Qwen-1.5B	Qwen2.5-1.5B	1.5B	Edge-deployable reasoning
R1-Distill-Qwen-7B	Qwen2.5-Math-7B	7B	Outperforms QwQ-32B-Preview
R1-Distill-Qwen-14B	Qwen2.5-14B	14B	Strong reasoning transfer
R1-Distill-Qwen-32B	Qwen2.5-32B	32B	Outperforms OpenAI o1-mini

Distillation Process

Data generation: R1 generates chain-of-thought reasoning traces
SFT: Student fine-tuned on teacher CoT outputs
Optional KL loss: Temperature-scaled KL divergence for soft targets
Result: 7B model matches/beats specialized 32B reasoning models

Key finding

DeepSeek-R1-Distill-Qwen-7B outperforms QwQ-32B-Preview despite being 4.5× smaller. The distillation captures reasoning patterns, not just final answers.

Distillation Tools

Production-Ready Toolkits

DistillKit — Arcee AI. Logit + hidden state distillation. Pre-captured Qwen3-235B datasets.

LLaMA-Factory — Unified fine-tuning. LoRA, QLoRA, DPO, KTO support.

Axolotl — Config-based training. Community-driven defaults.

EasyDistill — ModelScope toolkit for LLM knowledge distillation.

DIY Distillation with DistillKit

Example configuration for distilling from Qwen3-235B:

# config.yaml
project_name: qwen-distill
model: Qwen/Qwen3-8B
output_path: ./output
sequence_length: 8192

dataset:
  train_dataset:
    repo_id: arcee-ai/Qwen3-235B-Logits-Packed-8192
    split: train
    prepacked: true

teacher:
  kind: dataset  # Pre-captured logits

loss_functions:
  - function: cross_entropy
    weight: 0.5
  - function: kl
    weight: 0.5
    temperature: 1.0

Distillation Methods Compared

Logit-Based

✓ Transfers output distribution

✓ Better benchmark scores

✓ Can use pre-captured data

✗ Large storage for logits

Hidden-State

✓ Transfers "thinking process"

✓ Better generalization

✗ Requires architecture matching

✗ More complex to implement

Quantization

Quantization reduces model precision (FP32 → INT8/INT4) for faster inference on consumer hardware.

Methods Compared

Method	Quality	Best For	Ecosystem
AWQ	~95%	GPU inference, creative tasks	vLLM, TGI
GGUF	~92%	CPU/Apple Silicon, Ollama	llama.cpp, LM Studio
GPTQ	~90%	Raw CUDA throughput	ExLlama, AutoGPTQ

Quantizing Qwen3

Official Qwen documentation covers GPTQ, AWQ, and GGUF quantization:

# GGUF with llama.cpp
python convert-hf-to-gguf.py Qwen/Qwen3-8B \
  --outfile Qwen3-8B-F16.gguf \
  --outtype bf16

# Quantize to Q4_K_M
./llama-quantize Qwen3-8B-F16.gguf Qwen3-8B-Q4_K_M.gguf Q4_K_M

AWQ scale tip

To improve quantized model quality, apply AWQ scale before GGUF conversion. When running model.quantize() with AutoAWQ, add export_compatible=True.

Pre-Quantized Models

Qwen3 models are available pre-quantized on HuggingFace and LM Studio. Official quantization guide →

Source Code Analysis

Deep dives into model architectures and training codebases.

DeepSeek V3/R1 Architecture

Multi-head Latent Attention (MLA) DeepSeek Innovation

GitHub · Paper

MLA compresses key-value tensors into lower-dimensional latent space, dramatically reducing KV cache memory:

Compression: h_t → latent vector with dimension d_c ≪ (n_h × d_h)
Recovery: Map latent back to full dimension for attention computation
Parameters: n_h=128 heads, d_h=128 per-head dim, d_c=512 KV compression
RoPE handling: "Decoupled RoPE" to maintain position sensitivity

Code location: inference/model.py:396

DeepSeekMoE Routing Sparse Experts

Fine-grained experts: More experts than standard MoE for better knowledge decomposition
Shared experts: Some experts always active for common patterns
All-experts layers: 3 layers activate all experts for cross-expert knowledge sharing
Auxiliary-loss-free balancing: Novel load balancing without explicit loss term

Qwen3 Architecture

Qwen3 is integrated into HuggingFace transformers. Key implementation files:

Code Locations

modeling_qwen3.py — HuggingFace model implementation

QwenLM/Qwen3 — Official repository with examples

arXiv:2505.09388 — Technical report with full specs

nanochat Architecture

Karpathy's nanochat distills modern transformer innovations into ~300 lines. Full deep dive →

Component	nanochat Choice	Why
Positions	RoPE	O(1) memory, length extrapolation
Norm	RMSNorm (no params)	30% faster than LayerNorm
Activation	ReLU²	15% faster than GELU
Attention	QK-norm + GQA	Stability + efficiency
Optimizer	Muon + AdamW	35% faster training