Graphify: Knowledge Graphs for Agent Systems

Transform codebases into queryable knowledge graphs. 71.5x fewer tokens per query.
Research completed: 2026-04-07

TL;DR: Graphify builds knowledge graphs from code, docs, papers, and images using a two-pass approach: deterministic AST extraction (tree-sitter) + semantic analysis (Claude vision). Every relationship is classified as EXTRACTED, INFERRED (with confidence), or AMBIGUOUS. Leiden community detection clusters by edge density, not embeddings. Result: query the graph structure instead of reading raw files—71.5x token reduction. Applied to our agent research hub, this means: "show me all isolation techniques" returns a connected subgraph (~700 tokens) vs reading 7 systems × docs (~50k tokens).

Core Concepts

Two-Pass Architecture

Pass 1 - Deterministic: tree-sitter extracts code structure (classes, functions, imports, call graphs) without LLM involvement. Fast, local, reproducible.

Pass 2 - Semantic: Claude subagents analyze docs, PDFs, images to extract concepts and relationships. Parallel execution, merged into NetworkX graph.

Relationship Classification

EXTRACTED: Directly discovered from source (imports, function calls, explicit mentions).

INFERRED: Reasonable deduction by Claude with confidence score 0.0-1.0. Example: "similar to X" → (A, X, SIMILAR_TO, 0.7).

AMBIGUOUS: Unclear or contradictory, flagged for human review.

Topology-Based Clustering

Leiden community detection identifies clusters based on graph edge density, not vector embeddings. The graph structure itself is the similarity signal. Discovers natural modules in code, shared patterns across systems.

Always-On Integration

PreToolUse hooks let AI assistants query the graph before reading raw files. Git integration rebuilds graph on commits/branch switches. Platform-agnostic: Claude Code, Codex, OpenCode, OpenClaw, Factory Droid.

Token Efficiency: The 71.5x Claim

Scenario: Developer asks "How does the authentication module work?"

Without Graphify: LLM must read auth.py (2.3k tokens), user.py (1.8k), session.py (1.5k), middleware.py (900 tokens), config.py (600 tokens), tests (3k tokens) = ~10k tokens to find relevant context.

With Graphify: Query graph for "authentication" → returns subgraph: AuthModule → uses → {JWTHandler, SessionStore, UserModel} → called_by → {LoginEndpoint, RefreshEndpoint}. Graph response: ~140 tokens. Then read ONLY the 2-3 relevant files suggested by graph.

Result: 140 tokens (query) + 4k (targeted reads) = 4,140 tokens vs 10k. But averaged across many queries, graph structure reuse compounds: same subgraph answers "who calls auth?", "what depends on sessions?", etc. 71.5x is aggregate efficiency across diverse queries.

Application to Agent Research Hub

Our /agents hub analyzes 7 systems across ~30 pages. Graphify transforms this into a queryable knowledge base.

Graph Element	Example Nodes	Example Relationships
Systems	OpenClaw, Muaddib, ElizaOS, Hermes, NemoClaw, Cline, Claude Code	`NemoClaw extends OpenClaw`
Patterns	Network isolation, doom loops, blueprints, context compaction (13 total)	`Stripe implements network_isolation`
Concepts	Subagents, memory, tools, security, routing, plugins	`tools concept applies_to all_systems`
Tech	tree-sitter, pgvector, QEMU, Landlock, seccomp, vis.js	`ElizaOS uses pgvector`
Implementations	Ring buffer hash, 70/20/10 truncation, Leiden clustering	`doom_loop_detection seen_in brainpro`

Community Detection Results

Running Leiden on our agent research graph would likely discover these clusters based on edge density:

Isolation Cluster

Muaddib, NemoClaw, Stripe, network isolation, QEMU, Landlock, seccomp. Dense connections around security and untrusted code execution.

Scale Cluster

OpenClaw, ElizaOS, pgvector, 24+ channels, plugin marketplace, multi-tenancy. Systems optimizing for audience reach.

Coding Agents Cluster

Cline, Claude Code, doom loops, context compaction, MCP, interactive workflows. Single-user developer tools.

Memory Cluster

ElizaOS pgvector, OpenClaw hybrid search, Muaddib 3-tier chronicles, NanoClaw CLAUDE.md. Different memory strategies.

Query Examples

Query: "Which systems prioritize multi-tenancy?"

Graph response: OpenClaw (24+ channels, one process), ElizaOS (PostgreSQL horizontal scale), Hermes (11+ channels). Muaddib explicitly rejects multi-tenancy (QEMU VMs are single-user).

Token cost: ~200 tokens vs 15k reading all system overviews.

Query: "Show me all isolation techniques and their tradeoffs."

Graph response: {Network isolation: cuts network, keeps filesystem} → Stripe. {QEMU micro-VMs: strongest isolation, max 8 concurrent} → Muaddib. {Landlock + seccomp + netns: sandboxes whole bot, not per-user} → NemoClaw. {bwrap: per-execution sandbox} → Claude Code.

Token cost: ~400 tokens vs 25k reading security sections across 7 systems.

Implementation Roadmap

Phase 1: Build Graph

Input: existing markdown in refs/, patterns.md, comparison.md. tree-sitter for code examples. Claude analyzes research notes. Manual seed: SYSTEMS.toml with metadata.

Phase 2: Classify Relationships

EXTRACTED from explicit statements. INFERRED from implicit connections (Claude analyzes "similar to", "inspired by"). Confidence scoring. Flag AMBIGUOUS for review.

Phase 3: Community Detection

Run Leiden clustering. Validate discovered clusters against intuitive categorization. Visualize with color-coded communities.

Phase 4: Web Integration

graph.html with vis.js interactive visualization. Click nodes, filter relationships, search. Embed on main hub page. Link to deep-dive pages.

Phase 5: LLM Integration

Export graph.json. AGENTS_GRAPH.md with query patterns. PreToolUse hooks: query graph before file reads. Telegram bot: graph queries from chat.

Key Insights

Graph structure IS similarity: Leiden clusters by edge density, not vector embeddings. The connections between concepts reveal natural groupings.
Transparency through classification: EXTRACTED vs INFERRED vs AMBIGUOUS prevents treating hallucinated relationships as facts.
Incremental > monolithic: Git hooks rebuild graph on changes. Graph evolves with codebase, not a one-time snapshot.
Query efficiency compounds: Same graph answers many questions. 71.5x is aggregate savings across diverse queries.
Human-readable export: GRAPH_REPORT.md alongside graph.json. Graphs are for machines AND humans.

Why Not Graphs? Claude Code's Haiku Explore Agent

Claude Code handles 34M codebase searches/week with Haiku Explore agents, not knowledge graphs. Understanding why reveals the tradeoffs.

⚡ Zero Cold Start

Graph: 5-30 min build time before first query. User waits or works with stale graph.

Explore: Instant. First query answered in 3-5 sec on fresh clone. Zero friction onboarding wins.

🔄 Always Fresh vs Staleness

Graph: Represents codebase at time T. Code changes → stale. Options: auto-rebuild (expensive), manual update (users forget), git hooks (miss uncommitted changes).

Explore: Always queries live filesystem. Sees uncommitted WIP, current branch, temp files. Zero sync needed.

📈 Stateless Scales Linearly

At 34M runs/week: Haiku costs $17k/week (34M × 2k tokens × $0.25/M). Graph maintenance: 100k graphs × churn × rebuild cost >> $17k.

Anthropic's calculus: Stateless agents scale O(queries). Graphs scale O(users × codebases × churn).

🎯 Grep Beats Inference

Graph INFERRED relationships: Confidence 0.7 = 30% hallucination risk. "Similar to X" based on what?

Explore + ripgrep: Exact matches, 100% precision. Regex support, line-level accuracy. LLM interprets, doesn't search.

🔍 Query Flexibility

Graphs excel: Structural queries ("trace call graph A→B→C"). Transitive relationships.

Graphs fail: "Find TODO mentioning 'security'", regex patterns, git-aware queries, exact line matches. Explore handles all.

💰 ROI for Typical Usage

Graph build: 500k tokens ($0.125). Break-even: 10 queries. Average session: 3-5 queries.

Reality: For 95% of users, graph ROI is negative. Power users (100+ queries) are <5%.

The 71.5x Claim: Cherry-Picked?

Graphify's metric: Query "How does auth work?" → Without graph: read 6 files (10k tokens). With graph: 140 tokens. 71.5x savings!

Assumptions: (1) User would read ALL 6 files (unlikely—they'd grep first), (2) Graph correctly identifies relevant files (assumes perfect recall), (3) Graph query alone answers question (usually you still read code).

Realistic comparison: Grep "auth" → 20 matches, read 2-3 files → 5k tokens. Graph query (140) + read 2-3 files → 5.14k tokens. Real savings: ~0%.

Where 71.5x holds: Power users, 100+ queries, never reading raw code (just querying structure). Rare.

When Graphs Win

Use Case	Why Graphs Win	Claude Code Alternative
Architectural queries	Transitive relationships (A→B→C→D). Trace call chains, module dependencies.	Multiple Explore calls iteratively (3-5 queries, 15 sec, $0.05 vs 1 graph query, instant, $0.0002). But only 1% of queries need this.
Cross-codebase patterns	Relationships span repos. "How do 7 agent systems handle memory?"	Claude Code scoped to single repo. Graphify's sweet spot: multi-repo research hubs (our /agents use case).
Concept discovery	Leiden clustering surfaces structure humans didn't label.	Doesn't try to solve. User asks "explain architecture" → Sonnet reads + synthesizes. Not automated discovery.

Anthropic's Design Philosophy

Optimize for UX, not tokens: 3.5 sec (Explore) vs 0.5 sec (graph) doesn't matter. 10 min setup + manual updates kills UX.
Stateless scales: 34M runs/week requires linear scaling. Graphs are O(users × codebases × churn).
Compose with existing tools: Grep, git, IDE. Don't invent new query paradigms.
LLM for interpretation, not search: Search is deterministic (ripgrep). LLM translates English → tool calls → English.

Takeaway: Graphify and Claude Code solve different problems. Graphify: multi-repo research, architecture discovery, power users. Claude Code: instant onboarding, always-fresh, millions of users, typical usage (3-5 queries/session). For our /agents hub (7 systems, cross-cutting analysis), graphs win. For single-codebase daily workflow, Haiku Explore wins.

Sources: Claude Code Subagent Docs, DEV Community Analysis, Claude Code leaked source analysis.

Tech Stack

Component	Technology	Purpose
Code parsing	tree-sitter	Deterministic AST extraction without LLM
Semantic analysis	Claude vision	Extract concepts from docs, PDFs, images
Graph structure	NetworkX	Python graph library, persistence to JSON
Clustering	Leiden algorithm	Community detection via modularity optimization
Visualization	vis.js	Interactive HTML graph rendering
Platforms	Claude Code, Codex, OpenCode, etc	PreToolUse hooks, platform-agnostic integration