How agent frameworks deal with images, screenshots, PDFs, and non-text content
TL;DR — Save files to disk, give the agent a read tool, let it choose when to look. Images enter context only on demand (as virtual tokens via a ViT, not as text). The Files API is the cloud equivalent. Inline base64 is fine for one-offs.
Encode the file, send it in the message payload. The model's vision encoder intercepts it before tokenization — so it doesn't eat context as raw text. Virtual tokens from a ViT, not string tokens. This is the default everywhere.
Upload once, get back an ID, reference it across turns. Avoids re-uploading. Anthropic's Files API (beta, file_id) and OpenAI's file endpoints both do this. Good for multi-turn conversations about the same document.
Files land on the filesystem. The agent is told via system prompt that it has a tool to load/view them. The tool converts to base64 on demand when the agent decides to look. This is how Claude Code and the Anthropic Agent SDK's built-in tools work. The cloud-native equivalent is the Files API — same idea, no local filesystem needed.
| Pattern | When |
|---|---|
| Base64 inline | One-off images, screenshots, single-turn analysis |
| Files API (file_id) | Same doc referenced across turns |
| Disk + prompt | Agent workflows where files accumulate over time |
| Multimodal RAG | Large doc corpora with vision needs (LlamaIndex etc.) |
Images don't tokenize as base64 text. A Vision Transformer produces compact virtual tokens. The real cost depends on each model's tiling math, not the string length. But even so, loading many images at once fills context fast.
The pattern of saving to disk and giving the agent a read tool is dominant in production agent frameworks:
Built-in TextEditor + Bash + Read tools. Read handles images (→ base64), PDFs (→ pages), notebooks natively. Agent decides when to look.
computer_use / shell tools write screenshots to disk; they loop back as input on next turn. Tool outputs can include multimodal content experimentally.
Agentic Document Workflows (2025) combine doc processing, retrieval, structured output, and orchestration. Best for heavy doc corpora and multimodal RAG.
multimodal=True flag on agents; framework converts local paths / URLs under the hood.
Tool outputs can include multimodal content. Screenshot agents encode to base64 in the tool response.