arizukocomponents › ttsd

ttsd

What it is

ttsd is a thin reverse proxy in front of a text-to-speech backend. It accepts POST /v1/audio/speech in the OpenAI shape ({model, voice, input, response_format}), forwards the request verbatim to the configured backend, and streams audio bytes back to the caller.

It also exposes GET /v1/voices (a passthrough for backends that list voices) and a GET /health that returns 503 when the upstream is unreachable.

Why it exists

arizuko speaks one TTS protocol — OpenAI’s. ttsd pins that contract at the daemon boundary so the gateway and the send_voice MCP tool never see the choice of backend. Default is the bundled Kokoro-FastAPI container; flip TTS_BACKEND_URL to Piper, Coqui, OpenAI cloud, or anything else that speaks /v1/audio/speech, and no caller code changes.

It also normalises the health signal. Kokoro’s readiness probe and OpenAI’s 401 on an unauthenticated HEAD would each need a custom check in gated; ttsd hides that and reports one {status:"ok"|"disconnected"} shape matching every other arizuko adapter.

The loop

One request, one upstream call. ttsd runs an httputil.NewSingleHostReverseProxy against TTS_BACKEND_URL:

  1. Receive POST /v1/audio/speech with the OpenAI body.
  2. Director copies URL.Path verbatim onto the backend URL.
  3. Stream the response (audio bytes, any Content-Type) back to the caller.
  4. On dial error, return 502 tts backend unreachable.

The /health handler probes the backend’s /health with a 3-second timeout, falling back to a HEAD / for backends that don’t expose one. Used by gated to gate send_voice: when the probe returns 503, the tool returns chanlib.ErrUnsupported and the agent falls back to a plain text reply.

How it fits

agent
   |  send_voice(chat_jid, text, voice)
   v
 gated  --->  POST TTS_BASE_URL/v1/audio/speech
   |
   v
 ttsd  --->  POST TTS_BACKEND_URL/v1/audio/speech
   |
   v
 kokoro / piper / openai cloud
   |
   v  audio bytes (ogg, mp3, …)
 gated caches at <data_dir>/tts/<hash>.ogg, hands to adapter

Inputs: HTTP from gated (or any other OpenAI-shaped TTS caller). Outputs: audio bytes from the upstream backend. Hard deps: a reachable backend at TTS_BACKEND_URL.

Concepts: concepts/voice covers the full voice-in / voice-out flow including transcription, voice selection, caching, and per-platform delivery.

Standalone usage

Yes. ttsd has no DB, no auth, no admin UI — just an env-configured reverse proxy. Front it with proxyd or any other auth layer when exposing it.

# bundled Kokoro backend
docker run -d --name kokoro -p 8881:8880 \
  ghcr.io/remsky/kokoro-fastapi-cpu:latest
docker run -d --name ttsd -p 8880:8880 \
  -e TTS_BACKEND_URL=http://localhost:8881 \
  arizuko-ttsd:latest

# probe
curl -s localhost:8880/health
curl -s -X POST localhost:8880/v1/audio/speech \
  -H 'content-type: application/json' \
  -d '{"model":"kokoro","voice":"af_sky","input":"hello","response_format":"mp3"}' \
  --output hello.mp3

Key env vars

Full list and defaults in reference/env.

Go deeper