Skip to content

feat(smallestai): add Pulse STT with real-time streaming and batch transcription#5312

Open
harshitajain165 wants to merge 8 commits intolivekit:mainfrom
harshitajain165:smallest-stt
Open

feat(smallestai): add Pulse STT with real-time streaming and batch transcription#5312
harshitajain165 wants to merge 8 commits intolivekit:mainfrom
harshitajain165:smallest-stt

Conversation

@harshitajain165
Copy link
Copy Markdown

@harshitajain165 harshitajain165 commented Apr 2, 2026

Summary

This PR adds speech-to-text support to the existing livekit-plugins-smallestai package via the Smallest AI Pulse STT API, complementing the Lightning TTS integration that already exists.

  • Streaming (SpeechStream): real-time transcription over WebSocket with interim and final transcripts, ~64ms TTFT
  • Batch (_recognize_impl): pre-recorded transcription via HTTP POST
  • Word-level timestamps: per-word start/end/confidence included by default (word_timestamps=True)
  • Speaker diarization: opt-in via diarize=True
  • Configurable end-of-utterance timeout: eou_timeout_ms (100–10,000ms, default 800ms)

Implementation notes

  • Follows the same patterns as other STT plugins in the repo
  • API field names (transcript, is_final, is_last, finalize message) verified against docs.smallest.ai
  • START_OF_SPEECH is inferred from the first non-empty transcript since the Pulse API does not emit a dedicated speech-start event

Test plan

  • test_recognize[livekit.plugins.smallestai] passes
  • test_stream[livekit.plugins.smallestai] passes
  • ruff format and ruff check pass
  • mypy --strict passes

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 potential issue.

View 5 additional findings in Devin Review.

Open in Devin Review

Comment on lines +521 to +528
else:
self._event_ch.send_nowait(
stt.SpeechEvent(
type=stt.SpeechEventType.INTERIM_TRANSCRIPT,
request_id=self._session_id,
alternatives=alts,
)
)
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 STT capability declares interim_results=False but code emits INTERIM_TRANSCRIPT events

The STT constructor at line 139 declares interim_results=False in STTCapabilities, but _process_stream_event at lines 517-524 emits stt.SpeechEventType.INTERIM_TRANSCRIPT events whenever the server returns a non-final transcript (is_final=False). The Smallest AI Pulse API does return partial transcripts (the schema comment at line 475 says transcript is "partial or final text"), so the capability should be True. This mismatch causes incorrect behavior in the FallbackAdapter (livekit-agents/livekit/agents/stt/fallback_adapter.py:80) which uses all(t.capabilities.interim_results for t in stt) to compose capabilities — it would incorrectly report that the combined STT doesn't support interim results even if the other STT does.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 2, 2026

CLA assistant check
All committers have signed the CLA.

devin-ai-integration[bot]

This comment was marked as resolved.

Copy link
Copy Markdown
Member

@tinalenguyen tinalenguyen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi, thank you for the PR! i have a few notes, could you:

  • address all of the devin comments, especially the one regarding interim transcripts
  • remove smallest ai from the test files, as we do not have a smallestai api key for testing as of yet
  • sign the CLA if possible

@harshitajain165
Copy link
Copy Markdown
Author

Hey @tinalenguyen

Thanks for the comment. I'm addressing the devin comments, removing smallest ai from test files and signing the CLA. Will keep you posted once all are done

@harshitajain165 harshitajain165 force-pushed the smallest-stt branch 2 times, most recently from fc0b7bb to 26ba81c Compare April 8, 2026 07:14
@harshitajain165
Copy link
Copy Markdown
Author

recheck

@harshitajain165
Copy link
Copy Markdown
Author

Hey @tinalenguyen

The devin comments have been incorporated, smallest ai has been removed from test files and I have signed the CLA too. Please feel free to re-review/take this forward.

…support

Adds speech-to-text support to the existing Smallest AI plugin via the
Waves Pulse API, covering both real-time WebSocket streaming and
pre-recorded HTTP batch transcription.
- Add lightning-v3.1 as the new default model (80+ voices, ~100ms latency)
- Remove deprecated lightning and lightning-large models
- Update base URL to api.smallest.ai/waves/v1
- Simplify endpoint to get_speech for all models (removes get_speech_long_text)
- Add alaw encoding support (v3.1)
- Restrict consistency/similarity/enhancement params to lightning-v2 only
- Remove unused `interim_results` option from STT (constructor, options
  dataclass, and update_options). The Pulse API does not support
  server-side interim filtering and the plugin never honoured the flag.
  STTCapabilities now declares interim_results=False.
- Remove smallestai from test_stt.py and test_tts.py since there is no
  Smallest AI API key available in CI.
- Remove spurious TTS warning about consistency/similarity/enhancement
  params that fired on every default TTS() instantiation. The downstream
  _to_smallest_options already correctly excludes those params for
  non-v2 models.

Made-with: Cursor
@tinalenguyen
Copy link
Copy Markdown
Member

@harshitajain165 Thank you for iterating on the feedback!

For the STT, I printed out the received events and it does seem that interim results are emitted. Is there a setting to pass to the API or does the API always send interim results? If that is always the case, I would set interim_results to True and disregard the devin comment. I also noticed that for the final transcript, there are often spaces in the beginning. Is that expected?

Also, when testing the TTS, I keep facing this error:
failed to synthesize speech: message='Bad Request (400)', status_code=400, retryable=False, retrying in 2.0s

encoding: STTEncoding | str = "linear16",
word_timestamps: bool = True,
diarize: bool = False,
eou_timeout_ms: int = 800,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be 0?

With our end-of-turn detection model, we should prioritize minimizing latency to receive transcripts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants