concepts / voice · ← concepts
Voice in: a user sends an audio note, Whisper turns it into text before the agent reads anything. Voice out: the agent calls send_voice, the TTS daemon turns text into an ogg file, and the chat platform delivers it as a real voice note.
When a voice note arrives at an adapter (Telegram, WhatsApp, Discord), the adapter downloads the audio and hands it to gated. If WHISPER_BASE_URL is set and VOICE_TRANSCRIPTION_ENABLED=true, the gateway POSTs the audio to that Whisper server before the agent ever sees the message. The transcript comes back as plain text and gets glued onto the attachment tag the agent sees in its prompt:
<attachment path="/home/node/media/20260515/note.ogg"
mime="audio/ogg"
transcript="hey can you check the inbox">
The agent reads the transcript= like any other text and never re-transcribes. The raw audio is still on disk in media/<date>/ if a skill ever needs the bytes.
The agent decides to reply with voice by calling the send_voice MCP tool with the chat’s JID, the text, and an optional voice name. gated handles the rest:
voice: from the persona’s PERSONA.md frontmatter; otherwise TTS_VOICE env.sha256(text + voice + model). If <data_dir>/tts/<hash>.ogg already exists, reuse it — no re-synth.TTS_BASE_URL at the OpenAI-shaped path /v1/audio/speech. The bundled ttsd daemon forwards to a Kokoro-FastAPI container running locally.Text is capped at 5000 characters. The TTS endpoint is OpenAI-compatible, so swapping in Piper, Coqui, or OpenAI cloud is just changing TTS_BACKEND_URL on ttsd.
| Platform | How send_voice is delivered |
|---|---|
| Telegram | sendVoice — push-to-talk UI |
audio with ptt:true via Baileys | |
| Discord | audio/ogg attachment |
| Mastodon, Reddit, Bluesky, LinkedIn, X, email | Unsupported — chanlib.ErrUnsupported, agent falls back to send |
The persona picks. A voice-first persona (think a phone-only assistant) always voices. A text-first persona only voices when the user voiced first — matching modality is friendlier than auto-converting every message to audio. The agent makes the choice in its prompt logic; the system doesn’t auto-convert.
If ttsd is down, its /health returns 503 and gated flips TTS_ENABLED=false for that turn. send_voice returns chanlib.ErrUnsupported; the agent’s wrapper sees that and falls back to a plain text send. The user gets a text reply instead of silence.
Canonical spec: specs/5/T-voice-synthesis.md. The voice: frontmatter that pins a per-agent voice lives in PERSONA.md — see ant for the folder layout. For how the JIDs in send_voice(chatJid, …) are shaped, see jid.