Code as Geography
11.2% success — The highest-scoring esoteric language. Spatial reasoning still defeats LLMs.
What if code was two-dimensional?
Befunge-98 (Chris Pressey, 1993) is a stack-based language where source code is laid out on a 2D grid. An instruction pointer navigates this space moving in cardinal directions.
Your program isn't a sequence—it's a navigable space. Control flow becomes geography. Code wraps around edges, crosses itself, moves in all four directions.
2D Grid: Toroidal space (edges wrap—no boundaries)
Pointer: Position (x, y) + direction (dx, dy)
Execution: Run instruction, move by delta, repeat
> < ^ v — Cardinal directions
_ | — Conditional branches (horizontal/vertical)
# — Bridge (skip next cell)
? — Random direction
Stack: Push/pop, arithmetic, I/O
Self-mod: g (get), p (put) — rewrite code during execution
At 11.2%, Befunge is the only esoteric language above 10%. Why? Its stack-based model shares patterns with Forth/PostScript. But spatial navigation still defeats pattern matching—11.2% is the ceiling, not the baseline.
Your program is a map. Control flow = geography. Topology matters. The toroidal wrap-around means code can loop infinitely through space. This collapses the distinction between code structure and program behavior.
LLMs can predict Python loops (trained on millions of examples). They can't trace paths through 2D grids (requires genuine spatial understanding). To execute Befunge, you must track (x, y, direction, stack, grid state) simultaneously—multi-dimensional state tracking that transformers struggle with.
EsoLang-Bench (2026) tested Befunge across 80 problems. Best result: GPT-5.2 at 11.2%.
Why Highest? Stack operations share patterns with mainstream languages. The 2D navigation is novel, but the computational model isn't completely alien.
Why Still Low? Spatial reasoning requires visualizing the grid and tracing execution paths. LLMs operate on sequential token streams, not spatial layouts.
Self-Modification: The p command means static analysis fails. Code at time T ≠ code at time T+1.
Even with partial pattern matching success, spatial reasoning remains fundamentally difficult. 11.2% shows occasional lucky guesses, not systematic capability.
Each language tests different cognitive capabilities:
Brainfuck: 6.2% — 1D tape, sequential loops
Whitespace: 0.0% — Can't tokenize invisible syntax
Unlambda: 1.2% — Can't compose functions
Befunge: 11.2% — Can navigate 2D (barely)
Befunge vs Brainfuck: adding one spatial dimension transforms the challenge. A line becomes a plane. Sequential reasoning becomes geographical navigation.