Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,15 @@ an entry here.

### 2

- §4.0 (new) — one audio-format rule for all audio payloads: the
container is the single source of truth (self-describing WAV / MP3 /
OGG / FLAC); `mime` and `sample_rate` are OPTIONAL advisory fields on
`ovos.audio.queue`, `ovos.audio.play_sound`, and `ovos.audio.speech`,
and both become REQUIRED only for headerless raw streams (raw PCM).
- §5.3 — `ovos.audio.is_speaking` is answered via the `response`
derivation (OVOS-MSG-1 §5.3), on `ovos.audio.is_speaking.response`.
- References OVOS-TRANSFORM-1 by its title, *Transformer Plugins
Specification*.
- The audio output service: the rendering pipeline (dialog-transformer
chain, TTS synthesis, TTS-transformer chain, playback queue), the
sequential playback queue shared by speech (`ovos.utterance.speak`) and
Expand Down
50 changes: 45 additions & 5 deletions audio-out.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ It builds on three companion specifications:
- the *Bus Message Specification* (OVOS-MSG-1) — the envelope,
routing keys, session carrier, and derivations every Message
defined here travels in;
- the *Transformer Injection Point Specification*
- the *Transformer Plugins Specification*
(OVOS-TRANSFORM-1) — the dialog-transformer and TTS-transformer
chains that run before and after TTS synthesis.

Expand Down Expand Up @@ -188,6 +188,32 @@ instant-sound mechanism:
queue: they play over whatever is currently scheduled, MAY overlap
each other, and are not stoppable.

### 4.0 Audio format

Every audio payload in this specification — whether referenced by
`uri` or carried inline as base64 in `audio` — follows one format
rule:

- **The container is the single source of truth.** Audio payloads
SHOULD be delivered in a self-describing container (e.g. WAV, MP3,
OGG, FLAC). Such containers identify themselves through their
leading bytes, and WAV additionally carries sample rate and channel
layout; the payload therefore needs no out-of-band format
declaration.
- **`mime` is OPTIONAL and advisory.** A producer MAY set `mime` as a
hint; a consumer MAY use it to skip container detection. When the
hint and the payload's container disagree, the container is
authoritative.
- **`sample_rate` is OPTIONAL.** It is meaningful only when the
container does not carry the sample rate itself.
- **Headerless raw streams are the exception.** When the payload is a
raw, containerless stream (e.g. raw PCM), nothing in the bytes
describes the format, so the producer MUST set both `mime` and
`sample_rate`.

These rules apply uniformly to `ovos.audio.queue` (§4.1),
`ovos.audio.play_sound` (§4.2), and `ovos.audio.speech` (§4.3).

### 4.1 Scheduled playback queue

This queue holds TTS speech (from `ovos.utterance.speak`, §3.2)
Expand Down Expand Up @@ -215,9 +241,13 @@ participants and their audio is delivered via
|-------|------|----------|---------|
| `uri` | string | no | URI referencing the audio data. |
| `audio` | string | no | Base64-encoded audio data, used when the audio source is on a different host (alternative to `uri`). |
| `mime` | string | no* | Advisory MIME type of the audio (§4.0). |
| `sample_rate` | number | no* | Sample rate in Hz (§4.0). |
| `listen` | bool | no | When `true`, re-opens the user input channel after this item plays (§4.4). |

Exactly one of `uri` or `audio` MUST be present.
Exactly one of `uri` or `audio` MUST be present. \* `mime` and
`sample_rate` are REQUIRED when the payload is a headerless raw
stream (§4.0).

### 4.2 Instant sounds

Expand All @@ -232,8 +262,12 @@ affected by stop signals (§6).
|-------|------|----------|---------|
| `uri` | string | no | URI referencing the audio data. |
| `audio` | string | no | Base64-encoded audio data, used when the audio source is on a different host (alternative to `uri`). |
| `mime` | string | no* | Advisory MIME type of the audio (§4.0). |
| `sample_rate` | number | no* | Sample rate in Hz (§4.0). |

Exactly one of `uri` or `audio` MUST be present.
Exactly one of `uri` or `audio` MUST be present. \* `mime` and
`sample_rate` are REQUIRED when the payload is a headerless raw
stream (§4.0).

### 4.3 Synthesised audio delivery — `ovos.audio.speech`

Expand All @@ -245,8 +279,13 @@ for decoding and playing it.
| Field | Type | Required | Meaning |
|-------|------|----------|---------|
| `audio` | string | yes | Base64-encoded synthesised audio. |
| `mime` | string | no* | Advisory MIME type of the audio (§4.0). |
| `sample_rate` | number | no* | Sample rate in Hz (§4.0). |
| `listen` | bool | no | When `true`, the client SHOULD re-open its microphone after playback. |

\* `mime` and `sample_rate` are REQUIRED when the payload is a
headerless raw stream (§4.0).

The session is identified via `context.session` as usual. A bridge
(OVOS-BRIDGE-1 §4.2.4) subscribes by `session_id` or `destination`
and relays this message to the client.
Expand Down Expand Up @@ -317,7 +356,8 @@ Message; the service answers for that session only. An absent or
`"default"` `session_id` asks about the device-local default session
(OVOS-SESSION-1 §3.1); it is not a wildcard over all sessions.

The service replies with:
The service **MUST** reply via the `response` derivation
(OVOS-MSG-1 §5.3) — on `ovos.audio.is_speaking.response` — with:

```json
{ "speaking": true }
Expand Down Expand Up @@ -411,7 +451,7 @@ The audio output service **MAY** scope its response to that session.
— the pipeline iteration, `ovos.utterance.speak`, and `ovos.utterance.handled`.
- *Bus Message Specification* (OVOS-MSG-1) — the envelope and
derivations used for all bus communication.
- *Transformer Injection Point Specification* (OVOS-TRANSFORM-1) —
- *Transformer Plugins Specification* (OVOS-TRANSFORM-1) —
the dialog-transformer and TTS-transformer chains that plug into
the rendering pipeline.
- *Stop Pipeline Plugin Specification* (OVOS-STOP-1) — the universal
Expand Down