feat(mistral): support voxtral realtime streaming stt & modernize mistral plugin by jeanprbt · Pull Request #5289 · livekit/agents

jeanprbt · 2026-03-31T16:23:00Z

Summary

Modernizes the livekit-plugins-mistralai plugin with Voxtral Realtime streaming STT support, a context_bias argument for batch STT and an up-to-date models list.

Changes

Voxtral Realtime streaming STT

This solves issue #4754.

Adds full support for voxtral-mini-transcribe-realtime-2602, Mistral's streaming speech-to-text model. The implementation opens a persistent WebSocket connection (via mistralai's RealtimeTranscription API), streams audio frames, and emits START_OF_SPEECH, INTERIM_TRANSCRIPT, FINAL_TRANSCRIPT, and END_OF_SPEECH events.

Internal VAD for flush-based endpointing

Voxtral Realtime does not perform server-side endpointing — it never sends a TranscriptionStreamDone event on its own. Without that event, no FINAL_TRANSCRIPT is produced, the pipeline has no committed text, and end-of-turn detection never triggers.

To solve this, the plugin runs an internal Silero VAD stream alongside the Mistral WebSocket. Every audio frame is fed to both the Mistral connection and the VAD. When the VAD detects END_OF_SPEECH (default: 550 ms of silence), the plugin calls connection.flush_audio(), which forces the server to emit TranscriptionStreamDone. The plugin then emits FINAL_TRANSCRIPT followed by END_OF_SPEECH, making the Mistral STT behave like any other streaming STT with native endpointing from the pipeline's perspective.

Silero VAD is auto-loaded with default settings when no VAD is provided. Users can pass a custom VAD instance (e.g. with a shorter silence threshold) for finer control:

stt = mistralai.STT(
    model="voxtral-mini-transcribe-realtime-2602",
    vad=VAD.load(min_silence_duration=0.3),
)

Batch STT — `context_bias` argument

Adds a context_bias parameter (up to 100 words or phrases) to guide the model toward correct spelling of names, domain-specific vocabulary, or uncommon terms. Supported in both STT() constructor and update_options(). Only compatible with batch models.

LLM — `temperature` & `max_completion_tokens` arguments

Adds temperature and max_completion_tokens to the update_options() function.

Updated models list

Refreshes ChatModels, STTModels, TTSModels type literals to reflect currently available Mistral models and adds -latest aliases.

Other cleanups

Consistent constructor signatures across STT/TTS/LLM (parameter ordering, NotGivenOr usage, docstrings).
Updated README with usage examples for all three services including Voxtral Realtime.

jeanprbt · 2026-04-02T08:02:07Z

@theomonnom or @longcw can you review this? 🙌

zkewal · 2026-04-02T10:15:02Z

@jeanprbt I tested the new Voxtral realtime STT path in both a normal voice-agent flow and a transcription-only flow.

In both cases:

the websocket connected successfully
but speaking into the mic never produced a committed user transcript
and in the voice-agent case, the agent never responded

One useful detail: I also did not observe any provider-side interim/delta transcript events during speech. The only provider events I saw were the session lifecycle events (session.created / session.updated), but no transcription.text.delta and no transcription.done.

So from my side, this does not currently look like only an end-of-turn/finalization issue. It looks like the end-to-end realtime STT path is not producing transcript events at all in these setups, even though the provider connection itself appears healthy.

Do you have a minimal example or a known-good setup for this PR that I can use to validate the intended behavior?

jeanprbt · 2026-04-02T10:29:41Z

@zkewal I can't manage to reproduce your issue. Here's a working setup on my side, based on this current PR.

I added the following at line 367 in livekit-plugins-mistralai/livekit/plugins/mistralai/stt.py, so that I can see the streamed transcript in realtime when running my agent in console mode.

logger.debug(event.text)

In agent.py:

import logging

from dotenv import load_dotenv

from livekit.agents import (
    Agent,
    AgentServer,
    AgentSession,
    JobContext,
    JobProcess,
    TurnHandlingOptions,
    cli,
)
from livekit.plugins import mistralai, silero
from livekit.plugins.turn_detector.multilingual import MultilingualModel

logger = logging.getLogger("basic-agent")

load_dotenv()


class Assistant(Agent):
    def __init__(self) -> None:
        super().__init__(
            instructions="""You are a helpful voice AI assistant. The user is interacting with you via voice, even if you perceive the conversation as text.
            You eagerly assist users with their questions by providing information from your extensive knowledge.
            Your responses are concise, to the point, and without any complex formatting or punctuation including emojis, asterisks, or other symbols.
            You are curious, friendly, and have a sense of humor.""",
        )


server = AgentServer()


def prewarm(proc: JobProcess) -> None:
    proc.userdata["vad"] = silero.VAD.load()


server.setup_fnc = prewarm


@server.rtc_session()
async def entrypoint(ctx: JobContext) -> None:
    ctx.log_context_fields = {
        "room": ctx.room.name,
    }
    session: AgentSession = AgentSession(
        stt=mistralai.STT(model="voxtral-mini-transcribe-realtime-2602"),
        llm=mistralai.LLM(),
        tts=mistralai.TTS(response_format="mp3"),
        vad=ctx.proc.userdata["vad"],
        turn_handling=TurnHandlingOptions(
            turn_detection=MultilingualModel(),
        ),
        preemptive_generation=True,
    )

    await session.start(
        agent=Assistant(),
        room=ctx.room,
    )


if __name__ == "__main__":
    cli.run_app(server)

If you have a Mistral API key this agent runs fully on Mistral models. Use uv run agent.py console --log-level debug to speak directly in your terminal and see the logs, if it does not work tell me what you get (ensure your terminal has microphone access).

davidzhao

lgtm. good work cleaning up the other pieces too

davidzhao · 2026-04-05T16:55:04Z

livekit-plugins/livekit-plugins-mistralai/livekit/plugins/mistralai/stt.py

            client: Optional pre-configured MistralAI client instance.
+            api_key: Your Mistral AI API key. If not provided, will use the MISTRAL_API_KEY environment variable.
+            model: The Mistral AI model to use for transcription, default is "voxtral-mini-latest".
+            language: The language code to use for transcription (e.g., "fr" for French), default is "en".


does the model support not auto-detecting language?

Good catch, I just updated the PR with better language handling.

Batch models

If optional language is not specified, nothing is passed to the batch model. Else, it is passed to the batch model for better transcription accuracy. The SpeechEvent contain the model's detected language if existing, else this optionally passed language, else an empty LanguageCode.

Streaming models

Since streaming models do not support language argument (i.e. they always auto-detect language), this argument when calling stream is only used as a fallback for the language attribute of INTERIM_TRANSCRIPT events, if the streaming model did not return the detected language in a TranscriptionStreamLanguage event.

…rnization

…d model

jeanprbt · 2026-04-07T12:44:51Z

@theomonnom do you think you could merge this PR?

This comment was marked as resolved.

Sign in to view

jeanprbt force-pushed the jean/feat/add-voxtral-realtime-support branch from 2f51ae9 to fc46a83 Compare March 31, 2026 16:46

This comment was marked as resolved.

Sign in to view

jeanprbt force-pushed the jean/feat/add-voxtral-realtime-support branch from fc46a83 to cd5e5eb Compare April 1, 2026 07:32

This comment was marked as resolved.

Sign in to view

jeanprbt force-pushed the jean/feat/add-voxtral-realtime-support branch from cd5e5eb to 90eae5e Compare April 1, 2026 07:59

jeanprbt mentioned this pull request Apr 2, 2026

feat(mistralai): add streaming STT support for Voxtral Realtime #4773

Closed

jeanprbt force-pushed the jean/feat/add-voxtral-realtime-support branch from 90eae5e to 093c9ee Compare April 2, 2026 07:50

jeanprbt force-pushed the jean/feat/add-voxtral-realtime-support branch from 093c9ee to bf4988e Compare April 3, 2026 13:23

This comment was marked as resolved.

Sign in to view

davidzhao approved these changes Apr 5, 2026

View reviewed changes

jeanprbt force-pushed the jean/feat/add-voxtral-realtime-support branch from dd8ad6a to 3aaab3e Compare April 7, 2026 07:42

jeanprbt added 8 commits April 7, 2026 09:42

feat(mistral): add voxtral-realtime support

978bc11

feat(mistral): include VAD in mistral stt plugin

6051fe9

feat(mistral): add context-bias support for batch STT & few code mode…

9872f81

…rnization

fix: devin suggestion & typo in stt model name

b03918f

fix: remove useless vad fallback

1b1173b

fix: add back pcm as tts default format

1d0c527

fix: better language specification/handling

cc066bb

fix: avoid websocket for batch model & stale capabilities when update…

5901f4e

…d model

jeanprbt force-pushed the jean/feat/add-voxtral-realtime-support branch from 3aaab3e to 5901f4e Compare April 7, 2026 08:54

davidzhao merged commit 72c1abc into livekit:main Apr 7, 2026
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(mistral): support voxtral realtime streaming stt & modernize mistral plugin#5289

feat(mistral): support voxtral realtime streaming stt & modernize mistral plugin#5289
davidzhao merged 8 commits intolivekit:mainfrom
jeanprbt:jean/feat/add-voxtral-realtime-support

jeanprbt commented Mar 31, 2026 •

edited

Loading

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

jeanprbt commented Apr 2, 2026

Uh oh!

zkewal commented Apr 2, 2026 •

edited

Loading

Uh oh!

jeanprbt commented Apr 2, 2026 •

edited

Loading

Uh oh!

This comment was marked as resolved.

Uh oh!

davidzhao left a comment

Uh oh!

davidzhao Apr 5, 2026

Uh oh!

jeanprbt Apr 7, 2026 •

edited

Loading

Uh oh!

jeanprbt commented Apr 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jeanprbt commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Voxtral Realtime streaming STT

Internal VAD for flush-based endpointing

Batch STT — context_bias argument

LLM — temperature & max_completion_tokens arguments

Updated models list

Other cleanups

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

jeanprbt commented Apr 2, 2026

Uh oh!

zkewal commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeanprbt commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

davidzhao left a comment

Choose a reason for hiding this comment

Uh oh!

davidzhao Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

jeanprbt Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Batch models

Streaming models

Uh oh!

jeanprbt commented Apr 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jeanprbt commented Mar 31, 2026 •

edited

Loading

Batch STT — `context_bias` argument

LLM — `temperature` & `max_completion_tokens` arguments

zkewal commented Apr 2, 2026 •

edited

Loading

jeanprbt commented Apr 2, 2026 •

edited

Loading

jeanprbt Apr 7, 2026 •

edited

Loading