feat(mistral): support voxtral realtime streaming stt & modernize mistral plugin#5289
Conversation
2f51ae9 to
fc46a83
Compare
fc46a83 to
cd5e5eb
Compare
cd5e5eb to
90eae5e
Compare
90eae5e to
093c9ee
Compare
|
@theomonnom or @longcw can you review this? 🙌 |
|
@jeanprbt I tested the new Voxtral realtime STT path in both a normal voice-agent flow and a transcription-only flow. In both cases:
One useful detail: I also did not observe any provider-side interim/delta transcript events during speech. The only provider events I saw were the session lifecycle events ( So from my side, this does not currently look like only an end-of-turn/finalization issue. It looks like the end-to-end realtime STT path is not producing transcript events at all in these setups, even though the provider connection itself appears healthy. Do you have a minimal example or a known-good setup for this PR that I can use to validate the intended behavior? |
|
@zkewal I can't manage to reproduce your issue. Here's a working setup on my side, based on this current PR. I added the following at line 367 in logger.debug(event.text)In import logging
from dotenv import load_dotenv
from livekit.agents import (
Agent,
AgentServer,
AgentSession,
JobContext,
JobProcess,
TurnHandlingOptions,
cli,
)
from livekit.plugins import mistralai, silero
from livekit.plugins.turn_detector.multilingual import MultilingualModel
logger = logging.getLogger("basic-agent")
load_dotenv()
class Assistant(Agent):
def __init__(self) -> None:
super().__init__(
instructions="""You are a helpful voice AI assistant. The user is interacting with you via voice, even if you perceive the conversation as text.
You eagerly assist users with their questions by providing information from your extensive knowledge.
Your responses are concise, to the point, and without any complex formatting or punctuation including emojis, asterisks, or other symbols.
You are curious, friendly, and have a sense of humor.""",
)
server = AgentServer()
def prewarm(proc: JobProcess) -> None:
proc.userdata["vad"] = silero.VAD.load()
server.setup_fnc = prewarm
@server.rtc_session()
async def entrypoint(ctx: JobContext) -> None:
ctx.log_context_fields = {
"room": ctx.room.name,
}
session: AgentSession = AgentSession(
stt=mistralai.STT(model="voxtral-mini-transcribe-realtime-2602"),
llm=mistralai.LLM(),
tts=mistralai.TTS(response_format="mp3"),
vad=ctx.proc.userdata["vad"],
turn_handling=TurnHandlingOptions(
turn_detection=MultilingualModel(),
),
preemptive_generation=True,
)
await session.start(
agent=Assistant(),
room=ctx.room,
)
if __name__ == "__main__":
cli.run_app(server)If you have a Mistral API key this agent runs fully on Mistral models. Use |
093c9ee to
bf4988e
Compare
davidzhao
left a comment
There was a problem hiding this comment.
lgtm. good work cleaning up the other pieces too
| client: Optional pre-configured MistralAI client instance. | ||
| api_key: Your Mistral AI API key. If not provided, will use the MISTRAL_API_KEY environment variable. | ||
| model: The Mistral AI model to use for transcription, default is "voxtral-mini-latest". | ||
| language: The language code to use for transcription (e.g., "fr" for French), default is "en". |
There was a problem hiding this comment.
does the model support not auto-detecting language?
There was a problem hiding this comment.
Good catch, I just updated the PR with better language handling.
Batch models
If optional language is not specified, nothing is passed to the batch model. Else, it is passed to the batch model for better transcription accuracy. The SpeechEvent contain the model's detected language if existing, else this optionally passed language, else an empty LanguageCode.
Streaming models
Since streaming models do not support language argument (i.e. they always auto-detect language), this argument when calling stream is only used as a fallback for the language attribute of INTERIM_TRANSCRIPT events, if the streaming model did not return the detected language in a TranscriptionStreamLanguage event.
dd8ad6a to
3aaab3e
Compare
3aaab3e to
5901f4e
Compare
|
@theomonnom do you think you could merge this PR? |
Summary
Modernizes the
livekit-plugins-mistralaiplugin with Voxtral Realtime streaming STT support, acontext_biasargument for batch STT and an up-to-date models list.Changes
Voxtral Realtime streaming STT
Adds full support for
voxtral-mini-transcribe-realtime-2602, Mistral's streaming speech-to-text model. The implementation opens a persistent WebSocket connection (viamistralai'sRealtimeTranscriptionAPI), streams audio frames, and emitsSTART_OF_SPEECH,INTERIM_TRANSCRIPT,FINAL_TRANSCRIPT, andEND_OF_SPEECHevents.Internal VAD for flush-based endpointing
Voxtral Realtime does not perform server-side endpointing — it never sends a
TranscriptionStreamDoneevent on its own. Without that event, noFINAL_TRANSCRIPTis produced, the pipeline has no committed text, and end-of-turn detection never triggers.To solve this, the plugin runs an internal Silero VAD stream alongside the Mistral WebSocket. Every audio frame is fed to both the Mistral connection and the VAD. When the VAD detects
END_OF_SPEECH(default: 550 ms of silence), the plugin callsconnection.flush_audio(), which forces the server to emitTranscriptionStreamDone. The plugin then emitsFINAL_TRANSCRIPTfollowed byEND_OF_SPEECH, making the Mistral STT behave like any other streaming STT with native endpointing from the pipeline's perspective.Silero VAD is auto-loaded with default settings when no VAD is provided. Users can pass a custom VAD instance (e.g. with a shorter silence threshold) for finer control:
Batch STT —
context_biasargumentAdds a
context_biasparameter (up to 100 words or phrases) to guide the model toward correct spelling of names, domain-specific vocabulary, or uncommon terms. Supported in bothSTT()constructor andupdate_options(). Only compatible with batch models.LLM —
temperature&max_completion_tokensargumentsAdds
temperatureandmax_completion_tokensto theupdate_options()function.Updated models list
Refreshes
ChatModels,STTModels,TTSModelstype literals to reflect currently available Mistral models and adds-latestaliases.Other cleanups
NotGivenOrusage, docstrings).