Befunge-98

Code as Geography

EsoLang-Bench 2026

11.2% success — The highest-scoring esoteric language. Spatial reasoning still defeats LLMs.

The Question

What if code was two-dimensional?

Befunge-98 (Chris Pressey, 1993) is a stack-based language where source code is laid out on a 2D grid. An instruction pointer navigates this space moving in cardinal directions.

Consequence

Your program isn't a sequence—it's a navigable space. Control flow becomes geography. Code wraps around edges, crosses itself, moves in all four directions.

How It Works

The Playfield

2D Grid: Toroidal space (edges wrap—no boundaries)

Pointer: Position (x, y) + direction (dx, dy)

Execution: Run instruction, move by delta, repeat

Navigation

> < ^ v — Cardinal directions

_ | — Conditional branches (horizontal/vertical)

# — Bridge (skip next cell)

? — Random direction

Stack + Self-Modification

Stack: Push/pop, arithmetic, I/O

Self-mod: g (get), p (put) — rewrite code during execution

Example: Hello World

> v v "Hello World"< >:v ^,_@ Flow: Start (0,0) → right → down → left → push string → loop printing → @ exits

Why It Matters

1. Best LLM Performance (Still Terrible)

At 11.2%, Befunge is the only esoteric language above 10%. Why? Its stack-based model shares patterns with Forth/PostScript. But spatial navigation still defeats pattern matching—11.2% is the ceiling, not the baseline.

2. Code as Navigable Space

Your program is a map. Control flow = geography. Topology matters. The toroidal wrap-around means code can loop infinitely through space. This collapses the distinction between code structure and program behavior.

3. Spatial Reasoning Gap

LLMs can predict Python loops (trained on millions of examples). They can't trace paths through 2D grids (requires genuine spatial understanding). To execute Befunge, you must track (x, y, direction, stack, grid state) simultaneously—multi-dimensional state tracking that transformers struggle with.

The Research

EsoLang-Bench (2026) tested Befunge across 80 problems. Best result: GPT-5.2 at 11.2%.

Why Highest? Stack operations share patterns with mainstream languages. The 2D navigation is novel, but the computational model isn't completely alien.

Why Still Low? Spatial reasoning requires visualizing the grid and tracing execution paths. LLMs operate on sequential token streams, not spatial layouts.

Self-Modification: The p command means static analysis fails. Code at time T ≠ code at time T+1.

The Limit

Even with partial pattern matching success, spatial reasoning remains fundamentally difficult. 11.2% shows occasional lucky guesses, not systematic capability.

Connection to Other Languages

Each language tests different cognitive capabilities:

Brainfuck: 6.2% — 1D tape, sequential loops

Whitespace: 0.0% — Can't tokenize invisible syntax

Unlambda: 1.2% — Can't compose functions

Befunge: 11.2% — Can navigate 2D (barely)

Befunge vs Brainfuck: adding one spatial dimension transforms the challenge. A line becomes a plane. Sequential reasoning becomes geographical navigation.