-
Notifications
You must be signed in to change notification settings - Fork 324
[SDKv2] Add speech result types. #746
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
skottmckay
wants to merge
3
commits into
main
Choose a base branch
from
skottmckay/AudioOutputTypes
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Changes from 1 commit
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
129 changes: 129 additions & 0 deletions
129
sdk_v2/cpp/src/inferencing/generative/audio/SPEECH_TYPES.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,129 @@ | ||
| # Speech Types — Design | ||
|
|
||
| Native SDK types returned by `AudioSession` for speech-to-text (file and live), | ||
| translation, and ASR scenarios. References: OpenAI `verbose_json` / Realtime | ||
| transcription events, Azure Speech SDK recognition results. | ||
|
|
||
| ## Design rules | ||
|
|
||
| - **One set of types covers transcription, translation, and ASR.** Task | ||
| selection (transcribe vs translate, target language) is a Request parameter, | ||
| not a type variant. `text` is the recognized-or-translated string either way. | ||
| - **One shared segment type** for both streaming events and final-result entries, | ||
| discriminated by `kind`. | ||
| - **No event wrapper, no `event_id`, no segment `id`.** Ordering is a property of | ||
| the callback channel; segment identity is implicit in stream order (zero-or-more | ||
| `kPartial` for the current segment, then one `kFinal` closes it). A web service | ||
| above the SDK can add envelope/sequence metadata. | ||
| - **`text` on `kPartial` is the cumulative current hypothesis for the segment**, | ||
| not a delta-since-last-event (Azure-style). A delta is recoverable by diffing | ||
| against the previous hypothesis. | ||
| - **`utterance_start` is a boolean on the segment.** Knowable at emission time | ||
| (VAD says "speech started" → producer tags the first `kPartial` of the new | ||
| segment). There is no `utterance_end` field: end-of-utterance can't be known | ||
| when the `kFinal` is emitted without delaying it by the silence threshold. | ||
| Instead, end is implicit — the next `utterance_start` marks it (consumer | ||
| infers end at the previous `kFinal.end_time`), a future `kSilence` event | ||
| marks it explicitly, or the final `SpeechResult` marks it for file | ||
| transcription. | ||
| - **Time as `int64_t` milliseconds.** Must survive the C ABI. Typedef'd so the | ||
| unit is legible and changeable in one place. | ||
| - **Two C ABI item types** — one for streaming segments, one for the final | ||
| aggregate. Both additive to existing items. | ||
|
|
||
| ## Types | ||
|
|
||
| ```cpp | ||
| namespace fl { | ||
|
|
||
| using DurationMs = std::int64_t; // milliseconds; C ABI-safe | ||
|
|
||
| enum class SpeechSegmentKind : int { | ||
| kNone = 0, // entry in a final aggregate result | ||
| kPartial = 1, // streaming: hypothesis for the current segment; may change | ||
| kFinal = 2, // streaming: segment is stable, or an entry in the final result | ||
| }; | ||
|
|
||
| struct SpeechWord { | ||
| std::string text; | ||
| std::optional<DurationMs> start_time; | ||
| std::optional<DurationMs> end_time; | ||
| std::optional<float> confidence; // 0..1 | ||
| std::optional<std::string> speaker_id; | ||
| }; | ||
|
|
||
| struct SpeechSegment { | ||
| SpeechSegmentKind kind = SpeechSegmentKind::kNone; | ||
|
|
||
| std::string text; // for kPartial: cumulative current hypothesis | ||
| std::optional<DurationMs> start_time; | ||
| std::optional<DurationMs> end_time; | ||
|
|
||
| // Utterance start signal — tagged on the first kPartial of a new utterance. | ||
| // Knowable at emission time. End-of-utterance is implicit (see design rules). | ||
| bool utterance_start = false; | ||
|
|
||
| std::vector<SpeechWord> words; // word-timestamp opt-in | ||
|
|
||
| // Future / opt-in. Included here for visibility in review. | ||
| // We should only add fields that we expect to use as the C API types need to be ABI stable, | ||
| // so we can't remove anything added. | ||
| std::optional<float> confidence; // 0..1 aggregate | ||
| std::optional<std::string> language; // per-segment, for code-switching | ||
| std::optional<std::string> speaker_id; | ||
| std::optional<std::int32_t> channel; | ||
| // we could maybe use something more generic if we want to report these things instead of having per-value fields | ||
| // e.g. shared float[] of fixed size and an enum saying which value is in which slot. | ||
| std::optional<float> avg_logprob; // Whisper-family diagnostic | ||
| std::optional<float> no_speech_prob; // Whisper-family diagnostic | ||
| std::optional<float> compression_ratio; // Whisper-family diagnostic | ||
|
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Remove these.
|
||
| }; | ||
|
|
||
| struct SpeechResult { | ||
| std::string text; // concatenated final transcript | ||
| std::optional<std::string> language; // detected source language | ||
| std::optional<DurationMs> duration; // total audio duration | ||
| std::vector<SpeechSegment> segments; // entries are kFinal or kNone | ||
| }; | ||
|
|
||
| } // namespace fl | ||
| ``` | ||
|
|
||
| ## C ABI item types | ||
|
|
||
| ```c | ||
| FOUNDRY_LOCAL_ITEM_SPEECH_SEGMENT = 31, // pushed via streaming callback | ||
| FOUNDRY_LOCAL_ITEM_SPEECH_RESULT = 32, // final aggregate in response.items | ||
| ``` | ||
|
|
||
| `TextItem` remains the trivial fallback for `response_format: "text"`. | ||
|
|
||
| ## V1 scope | ||
|
|
||
| Populated in the initial implementation: | ||
|
|
||
| - `SpeechSegmentKind`: `kNone`, `kPartial`, `kFinal` | ||
| - `SpeechSegment`: `kind`, `text`, `start_time`, `end_time`, | ||
| `utterance_start` (defaulted; populated when computable) | ||
| - `SpeechResult`: `text`, `language`, `duration`, `segments` | ||
|
|
||
| Defined in the header but unpopulated until a producer exists: | ||
|
|
||
| - `SpeechWord` and `SpeechSegment::words` (word-timestamp opt-in) | ||
| - `confidence` (segment and word) | ||
| - `avg_logprob`, `no_speech_prob`, `compression_ratio` (Whisper diagnostics) | ||
| - `language` / `speaker_id` / `channel` on segment | ||
| - `speaker_id` on word | ||
|
|
||
| ## Growth headroom (not built) | ||
|
|
||
| - **Diarization**: `speaker_id` already present on word and segment. | ||
| - **Multi-channel audio**: `channel` already present on segment. | ||
| - **N-best alternatives**: future `std::vector<SpeechAlternative> alternatives` | ||
| on `SpeechSegment`. | ||
| - **OpenAI `verbose_json` compatibility**: handled by a | ||
| `ToOpenAIVerboseJson(const SpeechResult&)` adapter in | ||
| `contracts/audio_transcriptions.*`, not by changing native types. | ||
|
|
||
| Multi-target translation in a single pass is intentionally out of scope — | ||
| that's a server-side concern, not a local-inferencing one. | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also going to remove
confidencegiven that's available at the word level.