voice
Every surface so far — chat, files, the web — has been text. Voice adds a modality without changing the agent: it stays a text process, and the conversion happens at the edges. Voice in: a user sends an audio note, Whisper turns it into text before the agent reads anything. Voice out: the agent calls send_voice, the TTS daemon turns text into an ogg file, and the chat platform delivers it as a real voice note. The agent never handles audio; it reads and writes text, and the daemons translate on either side.
voice in — transcription
When a voice note arrives at an adapter (Telegram, WhatsApp, Discord), the adapter downloads the audio and hands it to routd. If WHISPER_BASE_URL is set and VOICE_TRANSCRIPTION_ENABLED=true, routd POSTs the audio to that Whisper server before the agent ever sees the message. The transcript comes back as plain text and gets glued onto the attachment tag the agent sees in its prompt:
<attachment path="/home/node/media/20260515/note.ogg"
mime="audio/ogg"
transcript="hey can you check the inbox">
The agent reads the transcript= like any other text and never re-transcribes. The raw audio is still on disk in media/<date>/ if a skill ever needs the bytes.
voice out — synthesis
The agent decides to reply with voice by calling the send_voice MCP tool with the chat’s JID, the text, and an optional voice name. routd handles the rest:
- Pick the voice. Explicit argument wins; otherwise
voice:from the persona’sPERSONA.mdfrontmatter; otherwiseTTS_VOICEenv. - Hash
sha256(text + voice + model). If<data_dir>/tts/<hash>.oggalready exists, reuse it — no re-synth. - On miss, POST to
TTS_BASE_URLat the OpenAI-shaped path/v1/audio/speech. The bundledttsddaemon forwards to a Kokoro-FastAPI container running locally. - Hand the ogg file to the channel adapter; it sends it using the platform’s native voice primitive.
Text is capped at 5000 characters. The TTS endpoint is OpenAI-compatible, so swapping in Piper, Coqui, or OpenAI cloud is just changing TTS_BACKEND_URL on ttsd.
platform support
| Platform | How send_voice is delivered |
|---|---|
| Telegram | sendVoice — push-to-talk UI |
audio with ptt:true via Baileys | |
| Discord | audio/ogg attachment |
| Mastodon, Reddit, Bluesky, LinkedIn, X, email | Unsupported — chanlib.ErrUnsupported, agent falls back to send |
when does the agent voice back?
The persona picks. A voice-first persona (think a phone-only assistant) always voices. A text-first persona only voices when the user voiced first — matching modality is friendlier than auto-converting every message to audio. The agent makes the choice in its prompt logic; the system doesn’t auto-convert.
failure mode
If ttsd is down, its /health returns 503 and routd flips TTS_ENABLED=false for that turn. send_voice returns chanlib.ErrUnsupported; the agent’s wrapper sees that and falls back to a plain text send. The user gets a text reply instead of silence.
go deeper
Canonical spec: specs/5/T-voice-synthesis.md. The voice: frontmatter that pins a per-agent voice lives in PERSONA.md — see ant for the folder layout. For how the JIDs in send_voice(chatJid, …) are shaped, see jid.