arizuko › concepts › voice

voice

Every surface so far — chat, files, the web — has been text. Voice adds a modality without changing the agent: it stays a text process, and the conversion happens at the edges. Voice in: a user sends an audio note, Whisper turns it into text before the agent reads anything. Voice out: the agent calls send_voice, the TTS daemon turns text into an ogg file, and the chat platform delivers it as a real voice note. The agent never handles audio; it reads and writes text, and the daemons translate on either side.

voice in — transcription

When a voice note arrives at an adapter (Telegram, WhatsApp, Discord), the adapter downloads the audio and hands it to routd. If WHISPER_BASE_URL is set and VOICE_TRANSCRIPTION_ENABLED=true, routd POSTs the audio to that Whisper server before the agent ever sees the message. The transcript comes back as plain text and gets glued onto the attachment tag the agent sees in its prompt:

<attachment path="/home/node/media/20260515/note.ogg"
            mime="audio/ogg"
            transcript="hey can you check the inbox">

The agent reads the transcript= like any other text and never re-transcribes. The raw audio is still on disk in media/<date>/ if a skill ever needs the bytes.

voice out — synthesis

The agent decides to reply with voice by calling the send_voice MCP tool with the chat’s JID, the text, and an optional voice name. routd handles the rest:

Pick the voice. Explicit argument wins; otherwise voice: from the persona’s PERSONA.md frontmatter; otherwise TTS_VOICE env.
Hash sha256(text + voice + model). If <data_dir>/tts/<hash>.ogg already exists, reuse it — no re-synth.
On miss, POST to TTS_BASE_URL at the OpenAI-shaped path /v1/audio/speech. The bundled ttsd daemon forwards to a Kokoro-FastAPI container running locally.
Hand the ogg file to the channel adapter; it sends it using the platform’s native voice primitive.

Text is capped at 5000 characters. The TTS endpoint is OpenAI-compatible, so swapping in Piper, Coqui, or OpenAI cloud is just changing TTS_BACKEND_URL on ttsd.

platform support

Platform	How `send_voice` is delivered
Telegram	`sendVoice` — push-to-talk UI
WhatsApp	audio with `ptt:true` via Baileys
Discord	`audio/ogg` attachment
Mastodon, Reddit, Bluesky, LinkedIn, X, email	Unsupported — `chanlib.ErrUnsupported`, agent falls back to `send`

when does the agent voice back?

The persona picks. A voice-first persona (think a phone-only assistant) always voices. A text-first persona only voices when the user voiced first — matching modality is friendlier than auto-converting every message to audio. The agent makes the choice in its prompt logic; the system doesn’t auto-convert.

failure mode

If ttsd is down, its /health returns 503 and routd flips TTS_ENABLED=false for that turn. send_voice returns chanlib.ErrUnsupported; the agent’s wrapper sees that and falls back to a plain text send. The user gets a text reply instead of silence.

go deeper

Canonical spec: specs/5/T-voice-synthesis.md. The voice: frontmatter that pins a per-agent voice lives in PERSONA.md — see ant for the folder layout. For how the JIDs in send_voice(chatJid, …) are shaped, see jid.