Open Source LLMs

Architecture, internals, and training pipelines for frontier open models
Last updated: 2026-02-17

What brings you here?
One sentence

Open source LLMs now match or exceed proprietary models — MiniMax M2.5 matches Claude Opus on SWE-Bench at 1/20th cost, while MiniMax-01's Lightning Attention enables 4M-token context at linear complexity.

2026 Model Landscape

The open source AI ecosystem has exploded. Models that were unthinkable two years ago are now freely available under Apache 2.0 or similar licenses.

Frontier Open Models

ModelParamsActiveContextReleaseLicense
Qwen 3.5397B17B1MFeb 2026Apache 2.0
MiniMax M2.5230B10B200KFeb 2026MIT
DeepSeek R1671B37B128KJan 2025MIT
MiniMax-01456B46B4MJan 2025Apache 2.0
Llama 4 Maverick400B17B1MApr 2025Llama 4
Llama 4 Scout109B17B10MApr 2025Llama 4
Qwen3 235B235B22B256KMay 2025Apache 2.0
The efficiency revolution

All frontier open models now use Mixture of Experts (MoE). Qwen 3.5 has 397B total params but activates only 17B per token — getting 400B-level capability at 17B inference cost.

Key trends (2025-2026)

4 numbers that matter

$5.6MDeepSeek V3 training cost (vs $100M+ for GPT-4)
$534KMiniMax M1 training cost (10× cheaper than DeepSeek R1)
80.2%MiniMax M2.5 SWE-Bench score (within 0.6% of Claude Opus at 1/20th cost)
$72Cost to train GPT-2 capability with nanochat

Qwen (Alibaba)

Qwen is Alibaba's flagship open source LLM series. Qwen3 and Qwen 3.5 represent the current state of the art in open models.

Qwen 3.5 (February 2026)

Qwen 3.5-397B-A17B Latest

Alibaba's multimodal MoE flagship. 397B params, 17B activated per token.

Architecture

  • Total params: 397 billion
  • Activated params: 17 billion per token
  • Architecture: Sparse MoE with Gated Delta Networks for linear-complexity attention
  • Context: 1M tokens (hosted), extendable via rope scaling
  • Vocabulary: 250K tokens (up from 152K in Qwen3)
  • Languages: 201 languages and dialects (69% increase over Qwen3)

Multimodal Design

  • Native multimodal: Text, image, video fused from first pretraining stage
  • Early fusion: Image patches injected directly into transformer layers (not adapter-based)
  • Resolution: Up to 1344×1344 pixels, pixel-perfect UI detection

Performance

BenchmarkScoreNotes
MMLU-Pro87.8%Knowledge/reasoning
MathVista90.3%Visual math
MMMU85.0%Multimodal understanding

Efficiency

  • 8.6× faster decoding at 32K context vs Qwen3-Max
  • 19× faster at 256K context
  • 60% lower cost vs predecessors at equivalent capability
The vocabulary multiplier

Qwen 3.5's 250K vocabulary encodes Chinese, math symbols, and code more compactly than prior models. Fine-tuners report 15-25% lower token counts on technical datasets — meaning lower inference costs.

Qwen3 Family (May 2025)

ModelTypeParamsActivatedContext
Qwen3-235B-A22BMoE235B22B256K→1M
Qwen3-30B-A3BMoE30B3B256K
Qwen3-32BDense32B32B128K
Qwen3-14BDense14B14B128K
Qwen3-8BDense8B8B128K
Qwen3-4BDense4B4B32K
Qwen3-1.7BDense1.7B1.7B32K
Qwen3-0.6BDense0.6B0.6B32K

Key Innovations

Thinking Mode Unification

Qwen3 integrates thinking mode (complex, multi-step reasoning) and non-thinking mode (rapid, context-driven) into a unified framework.

  • Single model: No need to switch between separate reasoning and chat models
  • Thinking budget: Users can allocate compute adaptively per task
  • Latency/performance trade-off: Control via inference-time parameter
Qwen Model Links
Qwen3-235B — Flagship MoE. 22B activated.
Qwen3-4B — Best for fine-tuning. ARC Prize base.
Qwen3-8B — Matches Qwen2.5-14B. Great multilingual.
Technical Report — Full architecture details. arXiv.

DeepSeek

DeepSeek is a Chinese AI lab that shocked the industry with efficient training and novel MoE architecture. DeepSeek R1 demonstrated that reasoning capabilities can emerge from pure RL.

DeepSeek R1 (January 2025)

DeepSeek-R1-671B Reasoning

Architecture

  • Total params: 671 billion
  • Activated params: 37 billion per token
  • Experts: 1 shared expert (always active) + 256 routed experts (8 active per token)
  • Layers: 61 transformer layers
  • Context: 128K tokens

Training Methodology

  • R1-Zero: Pure RL without SFT — reasoning emerged naturally
  • GRPO: Group Relative Policy Optimization (no value network)
  • Reward: Simple regex-based format + accuracy rewards
  • Emergent behaviors: Self-verification, reflection, "aha moments"

Training Cost

MetricDeepSeek V3GPT-4 (est.)
GPU hours2.79M H800~20M+ H100
Cost$5.6M$50-100M
MFU23%~40%
The "Aha Moment"

During R1-Zero training, models spontaneously developed self-reflection: "Wait, let me reconsider..." — emerging purely from RL without being taught. This suggests reasoning can be discovered, not just imitated.

DeepSeek R1 Distilled Models

R1's reasoning capabilities distilled into smaller, efficient models:

ModelBaseParamsNotes
R1-Distill-Qwen-1.5BQwen1.5BEdge-deployable reasoning
R1-Distill-Qwen-7BQwen7BStrong reasoning transfer
R1-Distill-Llama-8BLlama8BLlama base + R1 reasoning

DeepSeek V3 Architecture Innovations

MoE ARCHITECTURE
Shared expert: 1 always-activated expert per layer
Captures common patterns, reduces routing overhead
Routed experts: 256 experts, 8 selected per token
3.125% activation ratio (8/256)
All-experts layers: 3 layers activate all experts
Enables cross-expert knowledge sharing

Llama (Meta)

Llama is Meta's open model family that democratized large language models. Llama 4 (2025) marked Meta's shift to MoE architecture.

Llama 4 (April 2025)

Llama 4 Scout & Maverick MoE

Model Variants

ModelTotalActiveExpertsContext
Llama 4 Scout109B17B1610M
Llama 4 Maverick400B17B1281M
Llama 4 Behemoth*~2T288B16

*Behemoth announced but not released; Maverick was distilled from it.

Key Features

  • 10M context: Scout supports 10 million tokens — orders of magnitude beyond prior LLMs
  • Multimodal: Text and image input, text output
  • Multilingual: 12 languages supported
  • MoE architecture: Both Scout and Maverick use sparse experts

Llama 3.1 (July 2024)

Dense architecture, still highly relevant for fine-tuning and deployment:

ModelParamsContextArchitecture
Llama 3.1 405B405B128KDense
Llama 3.1 70B70B128KDense
Llama 3.1 8B8B128KDense
Dense vs MoE

Llama 3.1 used dense architecture to "maximize training stability." Llama 4 switched to MoE for efficiency. The industry consensus has shifted: MoE is now the default for frontier models.

MiniMax

MiniMax is a Shanghai-based AI company that pioneered Lightning Attention — a linear-complexity attention mechanism enabling 4M-token context. Their models match Claude Opus on coding benchmarks at 1/20th the cost.

Core innovation

Lightning Attention splits attention into intra-block (dense) and inter-block (linear) operations. Complexity drops from O(n²) to O(nd² + nBd). Result: 4M context at constant memory per token.

MiniMax-M2.5 (February 2026)

MiniMax-M2.5 Coding

First open model to match Claude Opus on SWE-Bench. 80% of internal MiniMax tasks now completed by M2.5.

Architecture

  • Total params: 230B (MoE)
  • Active params: 10B per token
  • Context: 200K tokens
  • Output: 57.3 tokens/sec
  • License: MIT

Benchmark Performance

BenchmarkM2.5Claude Opus 4.6
SWE-Bench Verified80.2%80.8%
Multi-SWE-Bench51.3%50.3%
Cost per task$0.15$3.00

MiniMax-01 Series (January 2025)

MiniMax-Text-01 4M Context

First model to demonstrate Lightning Attention at scale. 100% Needle-in-Haystack at 4M tokens.

Architecture

  • Total params: 456B
  • Active params: 45.9B per token
  • Experts: 32 (Top-2 routing)
  • Layers: 80 (7 Lightning + 1 Softmax per 8)
  • Hidden size: 6144
  • MoE hidden: 9216
  • Attention heads: 64 × 128 dim
  • RoPE base: 10,000,000 (for extreme context)
  • Vocab: 200,064 tokens
  • Context: 1M train → 4M inference

Benchmark Performance

BenchmarkScore
MMLU88.5%
GSM8K94.8%
Needle-in-Haystack (4M)100%

MiniMax-M1 (June 2025)

MiniMax-M1 Reasoning

First open-weight hybrid-attention reasoning model. Uses 25% of DeepSeek R1's FLOPs at 100K tokens.

Training Cost

MetricM1DeepSeek R1
Hardware512 H800s × 3 weeks2.79M H800 hours
Cost$534K$5.6M
FLOPs at 100K gen25%100%
Why Lightning Attention matters

Standard attention is O(n²) — doubling context 4× the compute. Lightning Attention is O(n) — doubling context doubles compute. At 4M tokens, this is the difference between "impossible" and "affordable".

Lightning Attention Architecture

LIGHTNING ATTENTION
Intra-block: Dense matmul within small blocks (uses Tensor Cores)
Ointra = [(Q·KT) ⊙ M] · V where M is causal decay mask
Inter-block: Linear approximation between blocks
Rolling accumulator, minimal state transfer
Complexity: O(nd² + nBd) — linear in sequence length
Hybrid: 7 Lightning + 1 Softmax per 8 layers
Softmax layers preserve fine-grained interactions

Model Links

MiniMax Models
M2.5 — Coding flagship. 230B/10B MoE. MIT license.
M1-80k — Reasoning. 456B/46B. Apache 2.0.
MiniMax-01 Paper — Lightning Attention architecture details.
MiniMax-M1 Paper — Test-time compute scaling.

Mixture of Experts Architecture

MoE is the defining architecture of 2025-2026 frontier models. It enables massive parameter counts with efficient inference.

How MoE Works

Sparse Expert Routing

Instead of using all parameters for every token, MoE routes each token to a subset of "expert" networks.

Basic Components

  • Router network: Small network that decides which experts to use
  • Expert FFNs: Multiple parallel feedforward networks
  • Top-K selection: Only K experts (typically 2-8) process each token

Example: DeepSeek R1

  • 256 routed experts per layer
  • 8 experts selected per token (3.125% activation)
  • 1 shared expert always active
  • Result: 671B params, 37B activated

MoE Comparison Across Models

ModelTotalActiveExpertsRouting
Qwen 3.5397B17BGated Delta Networks
DeepSeek R1671B37B256+1Top-8 + shared
MiniMax-01456B46B32Top-2 + Lightning Attention
MiniMax M2.5230B10BSparse MoE
Llama 4 Maverick400B17B128Sparse
Llama 4 Scout109B17B16Sparse
Mixtral 8x7B47B13B8Top-2

Why MoE?

Advantages
✓ Higher capacity at same inference cost
✓ Better scaling — add experts cheaply
✓ Specialization — experts can learn different skills
✓ Training efficiency — more params without proportional compute
Challenges
✗ Load balancing — need even expert utilization
✗ Memory — all experts must fit in memory
✗ Training stability — requires careful tuning
✗ MFU — harder to achieve high utilization

nanochat: Build Your Own LLM for $100

Full nanochat Deep Dive

For complete architecture details, codebase map, and implementation guide, see the dedicated /nanochat page.

Andrej Karpathy's nanochat (October 2025) is the simplest complete harness for training LLMs from scratch. It covers every stage: tokenization, pretraining, SFT, RL, evaluation, and chat UI. Train GPT-2 capability for ~$72 in 3 hours.

Core philosophy

One dial controls everything: depth. Set --depth=26 for GPT-2 capability. All other hyperparameters (width, heads, learning rate, batch size, weight decay) are computed automatically using scaling laws.

What nanochat Includes

COMPLETE TRAINING PIPELINE
1. Tokenization: Rust BPE tokenizer (high performance)
2. Pretraining: FineWeb-EDU dataset
3. Mid-training: SmolTalk + MMLU aux + GSM8K with tool use tags
4. SFT: Supervised fine-tuning
5. RL: Simplified GRPO on GSM8K (optional)
6. Evaluation: DCLM CORE score
7. Inference: Engine with KV cache, Python interpreter
8. Chat UI: Talk to your model like ChatGPT

Cost & Performance

ConfigHardwareTimeCostResult
Speedrun8×H1003 hours~$72GPT-2 capability
Robust8×H10042 hours~$1000Better quality

In 2019, training GPT-2 cost ~$43,000. nanochat achieves the same for 0.2% of the cost.

Architecture Principles

GPT Architecture

Modern transformer with carefully selected components:

  • Rotary Embeddings (RoPE) — no learned position embeddings
  • QK Normalization — stabilizes attention scores
  • Group-Query Attention (GQA) — memory-efficient attention
  • ReLU² activation — relu(x)² in MLP for sparse, efficient computation
  • RMSNorm — no learnable params, faster than LayerNorm
  • No bias — removed from all linear layers
  • Untied embeddings — separate token embedding and lm_head weights
  • Flash Attention 3 — H100-optimized with sliding window support
Muon Optimizer

Combined optimizer: Muon for matrix params, AdamW for embeddings/scalars.

  • Muon (MomentUm Orthogonalized by Newton-schulz)
    • Standard SGD-momentum + orthogonalization post-processing
    • Polar Express iteration for fast orthogonalization in bfloat16
    • NorMuon variance reduction — per-neuron adaptive learning rate
    • Cautious weight decay — only decays weights aligned with gradient
  • Distributed (ZeRO-2 style): Reduce-scatter → compute → all-gather

Scaling Laws Applied

nanochat automatically calculates optimal hyperparameters from depth using empirically-derived scaling laws:

HyperparameterFormulaReference
Model dimdepth × aspect_ratio (default 64)Width ∝ depth
Training tokenstarget_param_data_ratio × params (default 10.5×)Chinchilla ~20×
Batch sizeB_ref × (D/D_ref)^0.383Power Lines
Learning rateη_ref × √(B/B_ref)AdamW scaling
Weight decayλ_ref × √(B/B_ref) × (D_ref/D)T_epoch paper

Leaderboard: GPT-2 Speedrun

Community competition to train GPT-2 capability as fast as possible. Target: beat GPT-2 (1.6B) CORE score of 0.256525.

nanochat Resources
GitHub — Main repository. ~8000 lines of clean code.
Discussion — Karpathy's intro and community Q&A.
MarkTechPost — Overview article.

nanoGPT: The Original Minimal GPT

nanoGPT (2022) is Karpathy's original minimal GPT implementation. While superseded by nanochat, it remains valuable for learning and as a baseline.

Deprecated but foundational
  • nanoGPT is now deprecated — use nanochat for new projects
  • Still valuable for learning the basics of GPT training
  • Spawned the nanoGPT speedrun competition

Code Structure

The entire codebase fits in two files:

FileLinesPurpose
model.py~300GPT model definition
train.py~300Training loop

Training Performance

Evolution

KARPATHY'S LLM JOURNEY
minGPT (2020) — Educational, prioritizes readability
nanoGPT (2022) — Minimal but fast, "teeth over education"
llm.c (2024) — Pure C/CUDA implementation
nanochat (2025) — Full stack: pretrain to chat UI
nanoGPT Resources
nanoGPT — Original repo. Good for learning.
build-nanogpt — Video lecture on building from scratch.
llm.c — Pure C/CUDA implementation.
Zero to Hero — Karpathy's video series (YouTube).

Training From Scratch

A practical guide to training your own LLM using open source tools.

Decision: What to Train?

Which approach should I use?
  • Learning: Start with nanoGPT or nanochat tutorials
  • Production small model: Fine-tune Qwen3-4B or Llama 3.1 8B
  • Reasoning task: Try R1-Distill models or GRPO training
  • Custom from scratch: Use nanochat with your data
  • Research: Explore TRM-style recursive architectures

Training Stack

Fine-Tuning Tools
PEFT — HuggingFace. LoRA, QLoRA, adapters. Most accessible.
Unsloth — 2× faster fine-tuning. 70% less memory.
Axolotl — Config-based fine-tuning. Many methods.
LitGPT — Lightning AI. Pretrain + fine-tune.
RL Training
TRL — HuggingFace. PPO, DPO, GRPO, KTO, ORPO.
OpenRLHF — Ray + vLLM. Scales to 70B+.
veRL — ByteDance. GRPO, VAPO. Scales to 671B.
From-Scratch Training
nanochat — Full pipeline. $100 for GPT-2 capability.
llm.c — Pure C/CUDA. No Python dependencies.
torchtitan — PyTorch native. Llama architecture.

Learning Path

Recommended learning path
1
Watch "Let's build GPT" (2h)
Karpathy builds a GPT from scratch. Best introduction.
2
Run nanoGPT on Shakespeare (30 min)
Train your first model. Even works on CPU.
3
Fine-tune Qwen3-4B with Unsloth (1h)
Learn LoRA fine-tuning on a real model.
4
Try DPO with TRL (2h)
Add preference alignment to your model.
5
Run nanochat speedrun ($72)
Train GPT-2 capability from scratch.

Key Papers

PaperYearKey Contribution
Qwen3 Technical Report2025Complete training methodology for frontier open model
DeepSeek-R12025GRPO, emergent reasoning from pure RL
Chinchilla2022Optimal data/param ratio (~20×)
LoRA2021Parameter-efficient fine-tuning
DPO2023Reward-free preference learning
QLoRA20234-bit quantization + LoRA. 70B on single GPU.

Model Distillation

Distillation transfers knowledge from large "teacher" models to smaller "student" models, achieving strong performance at lower inference cost.

DeepSeek R1 Distillation to Qwen

DeepSeek-R1-Distill models are fine-tuned on samples generated by DeepSeek-R1, transferring advanced reasoning capabilities to smaller architectures.

R1-Distill Models DeepSeek

Available Qwen-Based Models

ModelBaseParamsNotable Result
R1-Distill-Qwen-1.5BQwen2.5-1.5B1.5BEdge-deployable reasoning
R1-Distill-Qwen-7BQwen2.5-Math-7B7BOutperforms QwQ-32B-Preview
R1-Distill-Qwen-14BQwen2.5-14B14BStrong reasoning transfer
R1-Distill-Qwen-32BQwen2.5-32B32BOutperforms OpenAI o1-mini

Distillation Process

  • Data generation: R1 generates chain-of-thought reasoning traces
  • SFT: Student fine-tuned on teacher CoT outputs
  • Optional KL loss: Temperature-scaled KL divergence for soft targets
  • Result: 7B model matches/beats specialized 32B reasoning models
Key finding

DeepSeek-R1-Distill-Qwen-7B outperforms QwQ-32B-Preview despite being 4.5× smaller. The distillation captures reasoning patterns, not just final answers.

Distillation Tools

Production-Ready Toolkits
DistillKit — Arcee AI. Logit + hidden state distillation. Pre-captured Qwen3-235B datasets.
LLaMA-Factory — Unified fine-tuning. LoRA, QLoRA, DPO, KTO support.
Axolotl — Config-based training. Community-driven defaults.
EasyDistill — ModelScope toolkit for LLM knowledge distillation.

DIY Distillation with DistillKit

Example configuration for distilling from Qwen3-235B:

# config.yaml
project_name: qwen-distill
model: Qwen/Qwen3-8B
output_path: ./output
sequence_length: 8192

dataset:
  train_dataset:
    repo_id: arcee-ai/Qwen3-235B-Logits-Packed-8192
    split: train
    prepacked: true

teacher:
  kind: dataset  # Pre-captured logits

loss_functions:
  - function: cross_entropy
    weight: 0.5
  - function: kl
    weight: 0.5
    temperature: 1.0

Distillation Methods Compared

Logit-Based
✓ Transfers output distribution
✓ Better benchmark scores
✓ Can use pre-captured data
✗ Large storage for logits
Hidden-State
✓ Transfers "thinking process"
✓ Better generalization
✗ Requires architecture matching
✗ More complex to implement

Quantization

Quantization reduces model precision (FP32 → INT8/INT4) for faster inference on consumer hardware.

Methods Compared

MethodQualityBest ForEcosystem
AWQ~95%GPU inference, creative tasksvLLM, TGI
GGUF~92%CPU/Apple Silicon, Ollamallama.cpp, LM Studio
GPTQ~90%Raw CUDA throughputExLlama, AutoGPTQ

Quantizing Qwen3

Official Qwen documentation covers GPTQ, AWQ, and GGUF quantization:

# GGUF with llama.cpp
python convert-hf-to-gguf.py Qwen/Qwen3-8B \
  --outfile Qwen3-8B-F16.gguf \
  --outtype bf16

# Quantize to Q4_K_M
./llama-quantize Qwen3-8B-F16.gguf Qwen3-8B-Q4_K_M.gguf Q4_K_M
AWQ scale tip

To improve quantized model quality, apply AWQ scale before GGUF conversion. When running model.quantize() with AutoAWQ, add export_compatible=True.

Pre-Quantized Models

Qwen3 models are available pre-quantized on HuggingFace and LM Studio. Official quantization guide →

Source Code Analysis

Deep dives into model architectures and training codebases.

DeepSeek V3/R1 Architecture

Multi-head Latent Attention (MLA) DeepSeek Innovation

MLA compresses key-value tensors into lower-dimensional latent space, dramatically reducing KV cache memory:

  • Compression: h_t → latent vector with dimension d_c ≪ (n_h × d_h)
  • Recovery: Map latent back to full dimension for attention computation
  • Parameters: n_h=128 heads, d_h=128 per-head dim, d_c=512 KV compression
  • RoPE handling: "Decoupled RoPE" to maintain position sensitivity

Code location: inference/model.py:396

DeepSeekMoE Routing Sparse Experts
  • Fine-grained experts: More experts than standard MoE for better knowledge decomposition
  • Shared experts: Some experts always active for common patterns
  • All-experts layers: 3 layers activate all experts for cross-expert knowledge sharing
  • Auxiliary-loss-free balancing: Novel load balancing without explicit loss term

Qwen3 Architecture

Qwen3 is integrated into HuggingFace transformers. Key implementation files:

Code Locations
modeling_qwen3.py — HuggingFace model implementation
QwenLM/Qwen3 — Official repository with examples
arXiv:2505.09388 — Technical report with full specs

nanochat Architecture

Karpathy's nanochat distills modern transformer innovations into ~300 lines. Full deep dive →

Componentnanochat ChoiceWhy
PositionsRoPEO(1) memory, length extrapolation
NormRMSNorm (no params)30% faster than LayerNorm
ActivationReLU²15% faster than GELU
AttentionQK-norm + GQAStability + efficiency
OptimizerMuon + AdamW35% faster training