voice

concepts / voice · ← concepts

Voice in: a user sends an audio note, Whisper turns it into text before the agent reads anything. Voice out: the agent calls send_voice, the TTS daemon turns text into an ogg file, and the chat platform delivers it as a real voice note.

voice in — transcription

When a voice note arrives at an adapter (Telegram, WhatsApp, Discord), the adapter downloads the audio and hands it to gated. If WHISPER_BASE_URL is set and VOICE_TRANSCRIPTION_ENABLED=true, the gateway POSTs the audio to that Whisper server before the agent ever sees the message. The transcript comes back as plain text and gets glued onto the attachment tag the agent sees in its prompt:

<attachment path="/home/node/media/20260515/note.ogg"
            mime="audio/ogg"
            transcript="hey can you check the inbox">

The agent reads the transcript= like any other text and never re-transcribes. The raw audio is still on disk in media/<date>/ if a skill ever needs the bytes.

voice out — synthesis

The agent decides to reply with voice by calling the send_voice MCP tool with the chat’s JID, the text, and an optional voice name. gated handles the rest:

  1. Pick the voice. Explicit argument wins; otherwise voice: from the persona’s PERSONA.md frontmatter; otherwise TTS_VOICE env.
  2. Hash sha256(text + voice + model). If <data_dir>/tts/<hash>.ogg already exists, reuse it — no re-synth.
  3. On miss, POST to TTS_BASE_URL at the OpenAI-shaped path /v1/audio/speech. The bundled ttsd daemon forwards to a Kokoro-FastAPI container running locally.
  4. Hand the ogg file to the channel adapter; it sends it using the platform’s native voice primitive.

Text is capped at 5000 characters. The TTS endpoint is OpenAI-compatible, so swapping in Piper, Coqui, or OpenAI cloud is just changing TTS_BACKEND_URL on ttsd.

platform support

PlatformHow send_voice is delivered
TelegramsendVoice — push-to-talk UI
WhatsAppaudio with ptt:true via Baileys
Discordaudio/ogg attachment
Mastodon, Reddit, Bluesky, LinkedIn, X, emailUnsupported — chanlib.ErrUnsupported, agent falls back to send

when does the agent voice back?

The persona picks. A voice-first persona (think a phone-only assistant) always voices. A text-first persona only voices when the user voiced first — matching modality is friendlier than auto-converting every message to audio. The agent makes the choice in its prompt logic; the system doesn’t auto-convert.

failure mode

If ttsd is down, its /health returns 503 and gated flips TTS_ENABLED=false for that turn. send_voice returns chanlib.ErrUnsupported; the agent’s wrapper sees that and falls back to a plain text send. The user gets a text reply instead of silence.

go deeper

Canonical spec: specs/5/T-voice-synthesis.md. The voice: frontmatter that pins a per-agent voice lives in PERSONA.md — see ant for the folder layout. For how the JIDs in send_voice(chatJid, …) are shaped, see jid.