feat(stt): add speaker diarization support to STT interface and proxy by russellmartin-livekit · Pull Request #5283 · livekit/agents

russellmartin-livekit · 2026-03-30T23:29:32Z

Related gateway change: https://github.com/livekit/agent-gateway/pull/557

Changes in `agents`

Options: Updates DeepgramOptions and AssemblyaiOptions to explicitly type diarization flags (diarize and speaker_labels).
Capabilities: Modifies the STT inference wrapper to dynamically set the diarization capability based on the provided extra_kwargs during initialization and update_options.
Data Models: Adds speaker_id to TimedString and populates it from the inference proxy's response.
Testing: Adds tests to verify diarization capability detection.

Fixes AGT-2608

Slack thread: https://live-kit.slack.com/archives/C06TN33TV44/p1772573869144129?thread_ts=1771977322.899519&cid=C06TN33TV44

https://claude.ai/code/session_01VRKQuBXiq8BHKr9AiJ6uEw

The inference STT capabilities.diarization was hardcoded to False, which caused MultiSpeakerAdapter to not work since it checks capabilities.diarization before enabling diarization. This change: - Adds diarize option to DeepgramOptions TypedDict - Adds speaker_labels option to AssemblyaiOptions TypedDict - Detects diarization params in extra_kwargs and sets capabilities - Updates capabilities when update_options() is called with diarization - Adds comprehensive tests for diarization capability detection Fixes AGT-2608 Slack thread: https://live-kit.slack.com/archives/C06TN33TV44/p1772573869144129?thread_ts=1771977322.899519&cid=C06TN33TV44 https://claude.ai/code/session_01VRKQuBXiq8BHKr9AiJ6uEw

CLAassistant · 2026-03-30T23:29:41Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ russellmartin-livekit
❌ claude
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

theomonnom · 2026-04-02T02:34:23Z

tests/test_inference_stt_fallback.py

+class TestSTTDiarizationCapabilities:
+    """Tests for STT diarization capability detection from extra_kwargs."""
+
+    def test_no_diarization_by_default(self):
+        """Without diarization params, capabilities.diarization is False."""
+        stt = _make_stt()
+        assert stt.capabilities.diarization is False
+
+    def test_diarization_enabled_with_deepgram_diarize(self):
+        """Deepgram's 'diarize' param enables diarization capability."""
+        stt = _make_stt(extra_kwargs={"diarize": True})
+        assert stt.capabilities.diarization is True
+
+    def test_diarization_disabled_with_diarize_false(self):
+        """Deepgram's 'diarize: False' keeps diarization capability False."""
+        stt = _make_stt(extra_kwargs={"diarize": False})
+        assert stt.capabilities.diarization is False
+
+    def test_diarization_enabled_with_assemblyai_speaker_labels(self):
+        """AssemblyAI's 'speaker_labels' param enables diarization capability."""
+        stt = _make_stt(model="assemblyai/universal-streaming", extra_kwargs={"speaker_labels": True})
+        assert stt.capabilities.diarization is True
+
+    def test_diarization_disabled_with_speaker_labels_false(self):
+        """AssemblyAI's 'speaker_labels: False' keeps diarization capability False."""
+        stt = _make_stt(model="assemblyai/universal-streaming", extra_kwargs={"speaker_labels": False})
+        assert stt.capabilities.diarization is False
+
+    def test_diarization_with_other_extra_kwargs(self):
+        """Diarization works alongside other extra_kwargs."""
+        stt = _make_stt(extra_kwargs={"diarize": True, "punctuate": True, "smart_format": True})
+        assert stt.capabilities.diarization is True
+
+    def test_update_options_enables_diarization(self):
+        """update_options with diarization params enables diarization capability."""
+        stt = _make_stt()
+        assert stt.capabilities.diarization is False
+        stt.update_options(extra={"diarize": True})
+        assert stt.capabilities.diarization is True
+
+    def test_update_options_disables_diarization(self):
+        """update_options can disable diarization by setting params to False."""
+        stt = _make_stt(extra_kwargs={"diarize": True})
+        assert stt.capabilities.diarization is True
+        stt.update_options(extra={"diarize": False})
+        assert stt.capabilities.diarization is False


Those tests aren't really useful?

yeah its overkill, removed most of them

This update introduces the ability to determine the speaker ID in the SpeechData object when processing speech events. If the speaker is not provided and the speech is final, the system will infer the speaker based on the most common speaker ID from the recognized words. Additionally, several outdated tests related to diarization capabilities have been removed, and the test for toggling diarization has been updated for clarity. This change improves the accuracy of speaker identification in speech recognition scenarios.

theomonnom · 2026-04-02T20:55:50Z

livekit-agents/livekit/agents/inference/stt.py

            end_time=self.start_time_offset + data.get("start", 0) + data.get("duration", 0),
            confidence=data.get("confidence", 1.0),
            text=text,
+            speaker_id=f"S{speaker}" if speaker is not None else None,


What's tricky with diarization in our inference API. Is that we need a way for the speaker_id to be consistent across every provider.

Maybe some of them aren't even int, and str?

I could standardize the inference side or I could just pass through whatever we get from the provider

I think the gateway should standardize it.
It would be OK if it was extra, but everything inside the core "inference" API should be the same across every provider

Agree, my latest changes standardizes it on the gateway size

Standardize on gateway side to return string ints so it stays compatible with plugins that require strings. Also added support for word level diarization

This update introduces a new speaker_id attribute to the TimedString class, allowing for better tracking of speaker information in speech recognition scenarios. The STT processing logic has been updated to utilize this new attribute, enhancing the accuracy of speaker identification in the SpeechData object.

This update modifies the speaker_id attribute in the TimedString class to be of type str instead of int. The STT processing logic has been adjusted accordingly to ensure compatibility with this change, enhancing the handling of speaker identification in speech recognition scenarios.

This update simplifies the extraction of speaker_id in the SpeechData object by directly using the value from the input data and words, removing unnecessary conversions. This change enhances code clarity and maintains compatibility with recent updates to the TimedString class.

russellmartin-livekit requested review from a team, adrian-cowham and theomonnom March 30, 2026 23:29

russellmartin-livekit self-assigned this Mar 30, 2026

This comment was marked as resolved.

Sign in to view

theomonnom reviewed Apr 2, 2026

View reviewed changes

This comment was marked as resolved.

Sign in to view

russellmartin-livekit changed the title ~~fix(inference): set STT capabilities.diarization from extra_kwargs~~ feat(stt): add speaker diarization support to STT interface and proxy Apr 4, 2026

russellmartin-livekit requested a review from theomonnom April 7, 2026 16:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(stt): add speaker diarization support to STT interface and proxy#5283

feat(stt): add speaker diarization support to STT interface and proxy#5283
russellmartin-livekit wants to merge 5 commits intomainfrom
claude/slack-support-diarization-stt-providers-cWpcE

russellmartin-livekit commented Mar 30, 2026 •

edited

Loading

Uh oh!

CLAassistant commented Mar 30, 2026 •

edited

Loading

Uh oh!

This comment was marked as resolved.

Uh oh!

theomonnom Apr 2, 2026

Uh oh!

russellmartin-livekit Apr 2, 2026

Uh oh!

theomonnom Apr 2, 2026

Uh oh!

russellmartin-livekit Apr 3, 2026

Uh oh!

theomonnom Apr 6, 2026

Uh oh!

russellmartin-livekit Apr 6, 2026

Uh oh!

russellmartin-livekit Apr 9, 2026

Uh oh!

This comment was marked as resolved.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

russellmartin-livekit commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes in agents

Uh oh!

CLAassistant commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

theomonnom Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

russellmartin-livekit Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

theomonnom Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

russellmartin-livekit Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

theomonnom Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

russellmartin-livekit Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

russellmartin-livekit Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

This comment was marked as resolved.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

russellmartin-livekit commented Mar 30, 2026 •

edited

Loading

Changes in `agents`

CLAassistant commented Mar 30, 2026 •

edited

Loading