Multimodal File Handling

How agent frameworks deal with images, screenshots, PDFs, and non-text content

TL;DR — Save files to disk, give the agent a read tool, let it choose when to look. Images enter context only on demand (as virtual tokens via a ViT, not as text). The Files API is the cloud equivalent. Inline base64 is fine for one-offs.

Three Patterns

1. Inline Base64

Encode the file, send it in the message payload. The model's vision encoder intercepts it before tokenization — so it doesn't eat context as raw text. Virtual tokens from a ViT, not string tokens. This is the default everywhere.

Claude API, OpenAI, most tool-calling agents

2. Files API / Server-Side Reference

Upload once, get back an ID, reference it across turns. Avoids re-uploading. Anthropic's Files API (beta, file_id) and OpenAI's file endpoints both do this. Good for multi-turn conversations about the same document.

Anthropic Files API, OpenAI file endpoints

3. Save to Disk + Tool Instruction

Files land on the filesystem. The agent is told via system prompt that it has a tool to load/view them. The tool converts to base64 on demand when the agent decides to look. This is how Claude Code and the Anthropic Agent SDK's built-in tools work. The cloud-native equivalent is the Files API — same idea, no local filesystem needed.

Claude Code, Anthropic Agent SDK, OpenAI computer_use

When to Use What

Pattern	When
Base64 inline	One-off images, screenshots, single-turn analysis
Files API (file_id)	Same doc referenced across turns
Disk + prompt	Agent workflows where files accumulate over time
Multimodal RAG	Large doc corpora with vision needs (LlamaIndex etc.)

Key Insight

Images don't tokenize as base64 text. A Vision Transformer produces compact virtual tokens. The real cost depends on each model's tiling math, not the string length. But even so, loading many images at once fills context fast.

Why "Disk + Prompt" Wins for Agents

The pattern of saving to disk and giving the agent a read tool is dominant in production agent frameworks:

Images only enter context when the agent chooses to look — not by default
Files persist across turns without re-upload overhead
An agent processing many documents doesn't need all of them in context simultaneously
The agent just sees a file path; the tool handles format conversion transparently

Framework Specifics

Anthropic Agent SDK

Built-in TextEditor + Bash + Read tools. Read handles images (→ base64), PDFs (→ pages), notebooks natively. Agent decides when to look.

OpenAI Agents SDK

computer_use / shell tools write screenshots to disk; they loop back as input on next turn. Tool outputs can include multimodal content experimentally.

LlamaIndex

Agentic Document Workflows (2025) combine doc processing, retrieval, structured output, and orchestration. Best for heavy doc corpora and multimodal RAG.

CrewAI

multimodal=True flag on agents; framework converts local paths / URLs under the hood.

LangGraph

Tool outputs can include multimodal content. Screenshot agents encode to base64 in the tool response.