feat(room-io): add jsonFormat option for timed transcription output#1305
Merged
toubatbrian merged 7 commits intomainfrom Apr 30, 2026
Merged
feat(room-io): add jsonFormat option for timed transcription output#1305toubatbrian merged 7 commits intomainfrom
toubatbrian merged 7 commits intomainfrom
Conversation
Port of livekit/agents#5472. Adds `jsonFormat` to `RoomOutputOptions`; when enabled, chunks published on the `lk.transcription` datastream topic are serialized as newline-delimited JSON objects with `text` and `start_time`/`end_time` fields when the chunk is a `TimedString`. The `TranscriptionSynchronizer` now emits `TimedString` items with `end_time` reflecting synchronized playback timing so subscribers can align chunks against playback without extra bookkeeping.
🦋 Changeset detectedLatest commit: 64aac55 The changes in this PR will be included in the next version bump. This PR includes changesets to release 28 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
|
|
…medString for trailing sentence fragment - `ParticipantTranscriptionOutput` now stores the JSON-encoded payload (not the raw text) as `latestText` when `jsonFormat` is enabled, so the non-delta `FINAL=true` flush publishes the same newline-delimited JSON shape as interim chunks. Without this, `userTranscriptOutput` (which uses `isDeltaStream: false`) broke line-by-line JSON parsers on the terminal message. Mirrors the Python behavior in `_output.py` where `_latest_text` is reassigned to the encoded payload before `_latest_text = text`. - `SegmentSynchronizerImpl` now wraps the trailing sentence fragment (anything after the last word) as a `TimedString` with `endTime`, matching every other emission on the same output stream.
toubatbrian
commented
Apr 28, 2026
Contributor
Author
|
@toubatbrian — done in 979bafa, the changeset is now Generated by Claude Code |
theomonnom
approved these changes
Apr 30, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Ports livekit/agents#5472 — "feat(room-io): add json_format option for timed transcription output" — into
agents-js.When
jsonFormatis enabled onRoomOutputOptions, every chunk published byParticipantTranscriptionOutputon thelk.transcriptiondatastream topic is emitted as a newline-delimited JSON object instead of a raw string. Each object has:text: the transcript chunkstart_time: seconds since the segment started (only set if the chunk is aTimedStringwithstartTime)end_time: seconds since the segment started (only set if the chunk is aTimedStringwithendTime)confidence: optional STT confidence (when present)start_time_offset: optional segment offset (when present)Subscribers can parse the stream line-by-line (each chunk ends with
\n).Ported changes
RoomOutputOptions.jsonFormat: boolean(agents/src/voice/room_io/room_io.ts) — new option, defaults tofalse.ParticipantTranscriptionOutput(agents/src/voice/room_io/_output.ts) — now takes an optionalParticipantTranscriptionOutputOptions4th constructor argument with{ jsonFormat?: boolean }. OverridescaptureTextso the JSON serialization happens beforehandleCaptureText.TranscriptionSynchronizer(agents/src/voice/transcription/synchronizer.ts) —SegmentSynchronizerImplnow writesTimedStringitems (withendTimereflecting synchronized playback timing) to its output stream instead of plain strings. The downstreamTextOutput.captureText(string | TimedString)contract already accepted both, so the only downstream that materially changes behavior isParticipantTranscriptionOutputwhenjsonFormat: true..changeset/room-io-json-transcription.md) — minor bump for@livekit/agents.// Ref: python …comments on every ported line, per theagents-jsporting guide inCLAUDE.md.Implementation nuances (JS vs Python)
Cases where strict code-level parity was not practical:
TextOutputOptionsvsRoomOutputOptions— Python has a dedicatedTextOutputOptionsdataclass onRoomOptions;agents-jskeeps all room output settings inline onRoomOutputOptions.jsonFormatis therefore added directly toRoomOutputOptionsand threaded intocreateTranscriptionOutputthroughthis.outputOptions.jsonFormat, rather than routed through a nested options object.livekit.protocol.agent_pb.TimedStringprotobuf, then serializes viaMessageToDict(preserving_proto_field_name=True). The JS port emits the same wire shape directly ({ text, start_time?, end_time?, confidence?, start_time_offset? }) without introducing a@livekit/protocolruntime dependency on the JS side. Field names remain snake_case to match the Python output byte-for-byte for consumers.agents-jsuses ms internally (startWallTime = Date.now()), so the synchronizer divides by1000when stampingendTimeon the emittedTimedString, keeping the JSON output on the same seconds-based scale as Python.SegmentSynchronizerImpl.mainTaskhas a small fallback (if (textCursor < sentence.length)) that emits leftover whitespace/punctuation between the last word and the end of a sentence. Python has no direct equivalent because its word-splitting handles this differently. To keep behavior conservative the leftover chunk is still emitted as a plain string (i.e.{"text": " "}with no timing whenjsonFormat: true) rather than a fabricatedTimedString.uv.lock/pyproject.tomlbumps (newcerebras/krisp/runwayworkspace entries andlivekit-protocol>=1.1.6) were intentionally not ported — those are Python-dependency-manager changes.Test plan
pnpm build:agents— passespnpm lint— passes (no new warnings introduced)pnpm exec vitest run src/voice/room_io src/voice/transcription— 23/23 passjsonFormat: trueonRoomOutputOptions, subscribe tolk.transcriptionfrom a client, confirm each message is a parseable JSON object withtextandend_time(andstart_timewhen the upstream STT provides it).cc @toubatbrian @livekit/agent-devs for review.
Generated by Claude Code