Skip to content

feat(mistral): support voxtral realtime streaming stt & modernize mistral plugin#5289

Merged
davidzhao merged 8 commits intolivekit:mainfrom
jeanprbt:jean/feat/add-voxtral-realtime-support
Apr 7, 2026
Merged

feat(mistral): support voxtral realtime streaming stt & modernize mistral plugin#5289
davidzhao merged 8 commits intolivekit:mainfrom
jeanprbt:jean/feat/add-voxtral-realtime-support

Conversation

@jeanprbt
Copy link
Copy Markdown
Contributor

@jeanprbt jeanprbt commented Mar 31, 2026

Summary

Modernizes the livekit-plugins-mistralai plugin with Voxtral Realtime streaming STT support, a context_bias argument for batch STT and an up-to-date models list.

Changes

Voxtral Realtime streaming STT

This solves issue #4754.

Adds full support for voxtral-mini-transcribe-realtime-2602, Mistral's streaming speech-to-text model. The implementation opens a persistent WebSocket connection (via mistralai's RealtimeTranscription API), streams audio frames, and emits START_OF_SPEECH, INTERIM_TRANSCRIPT, FINAL_TRANSCRIPT, and END_OF_SPEECH events.

Internal VAD for flush-based endpointing

Voxtral Realtime does not perform server-side endpointing — it never sends a TranscriptionStreamDone event on its own. Without that event, no FINAL_TRANSCRIPT is produced, the pipeline has no committed text, and end-of-turn detection never triggers.

To solve this, the plugin runs an internal Silero VAD stream alongside the Mistral WebSocket. Every audio frame is fed to both the Mistral connection and the VAD. When the VAD detects END_OF_SPEECH (default: 550 ms of silence), the plugin calls connection.flush_audio(), which forces the server to emit TranscriptionStreamDone. The plugin then emits FINAL_TRANSCRIPT followed by END_OF_SPEECH, making the Mistral STT behave like any other streaming STT with native endpointing from the pipeline's perspective.

Silero VAD is auto-loaded with default settings when no VAD is provided. Users can pass a custom VAD instance (e.g. with a shorter silence threshold) for finer control:

stt = mistralai.STT(
    model="voxtral-mini-transcribe-realtime-2602",
    vad=VAD.load(min_silence_duration=0.3),
)

Batch STT — context_bias argument

Adds a context_bias parameter (up to 100 words or phrases) to guide the model toward correct spelling of names, domain-specific vocabulary, or uncommon terms. Supported in both STT() constructor and update_options(). Only compatible with batch models.

LLM — temperature & max_completion_tokens arguments

Adds temperature and max_completion_tokens to the update_options() function.

Updated models list

Refreshes ChatModels, STTModels, TTSModels type literals to reflect currently available Mistral models and adds -latest aliases.

Other cleanups

  • Consistent constructor signatures across STT/TTS/LLM (parameter ordering, NotGivenOr usage, docstrings).
  • Updated README with usage examples for all three services including Voxtral Realtime.

devin-ai-integration[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

@jeanprbt jeanprbt force-pushed the jean/feat/add-voxtral-realtime-support branch from 2f51ae9 to fc46a83 Compare March 31, 2026 16:46
devin-ai-integration[bot]

This comment was marked as resolved.

@jeanprbt jeanprbt force-pushed the jean/feat/add-voxtral-realtime-support branch from fc46a83 to cd5e5eb Compare April 1, 2026 07:32
devin-ai-integration[bot]

This comment was marked as resolved.

@jeanprbt jeanprbt force-pushed the jean/feat/add-voxtral-realtime-support branch from cd5e5eb to 90eae5e Compare April 1, 2026 07:59
@jeanprbt jeanprbt force-pushed the jean/feat/add-voxtral-realtime-support branch from 90eae5e to 093c9ee Compare April 2, 2026 07:50
@jeanprbt
Copy link
Copy Markdown
Contributor Author

jeanprbt commented Apr 2, 2026

@theomonnom or @longcw can you review this? 🙌

@zkewal
Copy link
Copy Markdown

zkewal commented Apr 2, 2026

@jeanprbt I tested the new Voxtral realtime STT path in both a normal voice-agent flow and a transcription-only flow.

In both cases:

  • the websocket connected successfully
  • but speaking into the mic never produced a committed user transcript
  • and in the voice-agent case, the agent never responded

One useful detail: I also did not observe any provider-side interim/delta transcript events during speech. The only provider events I saw were the session lifecycle events (session.created / session.updated), but no transcription.text.delta and no transcription.done.

So from my side, this does not currently look like only an end-of-turn/finalization issue. It looks like the end-to-end realtime STT path is not producing transcript events at all in these setups, even though the provider connection itself appears healthy.

Do you have a minimal example or a known-good setup for this PR that I can use to validate the intended behavior?

@jeanprbt
Copy link
Copy Markdown
Contributor Author

jeanprbt commented Apr 2, 2026

@zkewal I can't manage to reproduce your issue. Here's a working setup on my side, based on this current PR.

I added the following at line 367 in livekit-plugins-mistralai/livekit/plugins/mistralai/stt.py, so that I can see the streamed transcript in realtime when running my agent in console mode.

logger.debug(event.text)

In agent.py:

import logging

from dotenv import load_dotenv

from livekit.agents import (
    Agent,
    AgentServer,
    AgentSession,
    JobContext,
    JobProcess,
    TurnHandlingOptions,
    cli,
)
from livekit.plugins import mistralai, silero
from livekit.plugins.turn_detector.multilingual import MultilingualModel

logger = logging.getLogger("basic-agent")

load_dotenv()


class Assistant(Agent):
    def __init__(self) -> None:
        super().__init__(
            instructions="""You are a helpful voice AI assistant. The user is interacting with you via voice, even if you perceive the conversation as text.
            You eagerly assist users with their questions by providing information from your extensive knowledge.
            Your responses are concise, to the point, and without any complex formatting or punctuation including emojis, asterisks, or other symbols.
            You are curious, friendly, and have a sense of humor.""",
        )


server = AgentServer()


def prewarm(proc: JobProcess) -> None:
    proc.userdata["vad"] = silero.VAD.load()


server.setup_fnc = prewarm


@server.rtc_session()
async def entrypoint(ctx: JobContext) -> None:
    ctx.log_context_fields = {
        "room": ctx.room.name,
    }
    session: AgentSession = AgentSession(
        stt=mistralai.STT(model="voxtral-mini-transcribe-realtime-2602"),
        llm=mistralai.LLM(),
        tts=mistralai.TTS(response_format="mp3"),
        vad=ctx.proc.userdata["vad"],
        turn_handling=TurnHandlingOptions(
            turn_detection=MultilingualModel(),
        ),
        preemptive_generation=True,
    )

    await session.start(
        agent=Assistant(),
        room=ctx.room,
    )


if __name__ == "__main__":
    cli.run_app(server)

If you have a Mistral API key this agent runs fully on Mistral models. Use uv run agent.py console --log-level debug to speak directly in your terminal and see the logs, if it does not work tell me what you get (ensure your terminal has microphone access).

@jeanprbt jeanprbt force-pushed the jean/feat/add-voxtral-realtime-support branch from 093c9ee to bf4988e Compare April 3, 2026 13:23
devin-ai-integration[bot]

This comment was marked as resolved.

Copy link
Copy Markdown
Member

@davidzhao davidzhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. good work cleaning up the other pieces too

client: Optional pre-configured MistralAI client instance.
api_key: Your Mistral AI API key. If not provided, will use the MISTRAL_API_KEY environment variable.
model: The Mistral AI model to use for transcription, default is "voxtral-mini-latest".
language: The language code to use for transcription (e.g., "fr" for French), default is "en".
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the model support not auto-detecting language?

Copy link
Copy Markdown
Contributor Author

@jeanprbt jeanprbt Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I just updated the PR with better language handling.

Batch models

If optional language is not specified, nothing is passed to the batch model. Else, it is passed to the batch model for better transcription accuracy. The SpeechEvent contain the model's detected language if existing, else this optionally passed language, else an empty LanguageCode.

Streaming models

Since streaming models do not support language argument (i.e. they always auto-detect language), this argument when calling stream is only used as a fallback for the language attribute of INTERIM_TRANSCRIPT events, if the streaming model did not return the detected language in a TranscriptionStreamLanguage event.

@jeanprbt jeanprbt force-pushed the jean/feat/add-voxtral-realtime-support branch from dd8ad6a to 3aaab3e Compare April 7, 2026 07:42
@jeanprbt jeanprbt force-pushed the jean/feat/add-voxtral-realtime-support branch from 3aaab3e to 5901f4e Compare April 7, 2026 08:54
@jeanprbt
Copy link
Copy Markdown
Contributor Author

jeanprbt commented Apr 7, 2026

@theomonnom do you think you could merge this PR?

@davidzhao davidzhao merged commit 72c1abc into livekit:main Apr 7, 2026
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants