Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new livekit-plugins-hf integration to run local HuggingFace LLM inference within the LiveKit Agents framework, with optional TurboQuant-GPU KV cache compression for faster/cheaper generation.
Changes:
- Introduces a new HuggingFace
LLM+LLMStreamimplementation with streaming generation and optional TurboQuant path. - Adds packaging metadata (
pyproject.toml), versioning, and plugin registration boilerplate. - Adds a README documenting install and basic usage.
Reviewed changes
Copilot reviewed 6 out of 7 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| livekit-plugins/livekit-plugins-hf/README.md | Documents installation and usage examples for the new plugin and TurboQuant extra. |
| livekit-plugins/livekit-plugins-hf/pyproject.toml | Defines the new Python package, dependencies, and optional TurboQuant dependency group. |
| livekit-plugins/livekit-plugins-hf/livekit/plugins/hf/init.py | Exposes public API, registers the plugin with livekit.agents.Plugin. |
| livekit-plugins/livekit-plugins-hf/livekit/plugins/hf/llm.py | Core implementation of local HF inference + streaming + TurboQuant generation and tool-call parsing. |
| livekit-plugins/livekit-plugins-hf/livekit/plugins/hf/log.py | Adds a module logger. |
| livekit-plugins/livekit-plugins-hf/livekit/plugins/hf/version.py | Adds plugin version constant for packaging/registration. |
π‘ Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| await self._run_turboquant() | ||
| else: | ||
| await self._run_streamer() | ||
| except Exception as e: |
There was a problem hiding this comment.
LLMStream._run catches all exceptions and wraps them into APIConnectionError, which will also convert asyncio.CancelledError into a connection error and prevent normal stream cancellation. Special-case CancelledError to re-raise, and avoid blanket-wrapping non-APIError exceptions (or at least preserve the original message/exception type).
| except Exception as e: | |
| except asyncio.CancelledError: | |
| raise | |
| except APIConnectionError: | |
| raise | |
| except OSError as e: |
| try: | ||
| template_kwargs["tools"] = self._tool_schemas | ||
| except Exception: | ||
| logger.warning("model tokenizer does not support tools in chat template") | ||
|
|
||
| input_text = tokenizer.apply_chat_template(self._messages, **template_kwargs) |
There was a problem hiding this comment.
This try/except does not protect against tokenizers that donβt accept a tools kwarg: the TypeError will be raised by apply_chat_template(), not by assigning into template_kwargs. Wrap the apply_chat_template() call (or retry without tools) so models without tool-template support still work when tools are provided.
| try: | |
| template_kwargs["tools"] = self._tool_schemas | |
| except Exception: | |
| logger.warning("model tokenizer does not support tools in chat template") | |
| input_text = tokenizer.apply_chat_template(self._messages, **template_kwargs) | |
| template_kwargs["tools"] = self._tool_schemas | |
| try: | |
| input_text = tokenizer.apply_chat_template(self._messages, **template_kwargs) | |
| except TypeError: | |
| if "tools" not in template_kwargs: | |
| raise | |
| logger.warning("model tokenizer does not support tools in chat template") | |
| template_kwargs = {k: v for k, v in template_kwargs.items() if k != "tools"} | |
| input_text = tokenizer.apply_chat_template(self._messages, **template_kwargs) |
| stop_event.set() | ||
| raise | ||
| finally: | ||
| await gen_future |
There was a problem hiding this comment.
On cancellation, the finally block awaits gen_future directly; if the task is cancelled, this await can be cancelled too, leaving the background generate() thread running without being awaited/cleaned up. Consider shielding the await (or cancelling/handling the executor future) to ensure cleanup completes reliably.
| await gen_future | |
| try: | |
| await asyncio.shield(gen_future) | |
| except asyncio.CancelledError: | |
| stop_event.set() | |
| await asyncio.shield(gen_future) | |
| raise |
| input_text = tokenizer.apply_chat_template(self._messages, **template_kwargs) | ||
| input_ids = tokenizer(input_text, return_tensors="pt")["input_ids"].to(model.device) | ||
| prompt_tokens = input_ids.shape[1] |
There was a problem hiding this comment.
TurboQuant path tokenizes only input_ids and drops attention_mask. Some HF models require an attention mask even for single prompts, and itβs needed if padding is ever introduced; include attention_mask from the tokenizer and pass it into the model forward step (at least for the first prompt pass).
| if opts.temperature > 0: | ||
| logits = logits / opts.temperature | ||
| probs = torch.softmax(logits, dim=-1) | ||
| next_token = torch.multinomial(probs, num_samples=1).squeeze(-1) | ||
| else: |
There was a problem hiding this comment.
TurboQuant sampling applies temperature but ignores configured top_p and repetition_penalty, so outputs will differ from the standard streamer path for the same options. Consider implementing nucleus sampling and repetition penalty (or explicitly document that these options arenβt supported for TurboQuant mode).
| name = data.get("name", "") | ||
| arguments = data.get("arguments") or data.get("parameters") or {} | ||
| if isinstance(arguments, dict): | ||
| arguments = json.dumps(arguments) | ||
|
|
There was a problem hiding this comment.
_parse_tool_calls appends a FunctionToolCall even when the parsed JSON has no name (defaulting to empty string). Skip entries without a valid tool name to avoid emitting invalid tool calls downstream.
| if token_text is sentinel: | ||
| break | ||
| if token_text: | ||
| full_text += token_text | ||
| self._event_ch.send_nowait( |
There was a problem hiding this comment.
Building full_text via repeated string concatenation in the token loop is O(n^2) and can be noticeably slower for long generations. Accumulate token strings in a list and join once at the end (and use the joined value for tool parsing / token counting).
| try: | ||
| template_kwargs["tools"] = self._tool_schemas | ||
| except Exception: | ||
| logger.warning("model tokenizer does not support tools in chat template") | ||
|
|
||
| input_text = tokenizer.apply_chat_template(self._messages, **template_kwargs) |
There was a problem hiding this comment.
Same issue as above: adding tools to template_kwargs wonβt throw, but apply_chat_template() will if the tokenizer doesnβt support a tools kwarg. Wrap the apply_chat_template() call (or retry without tools) to avoid failing tool-enabled chats on unsupported models.
| try: | |
| template_kwargs["tools"] = self._tool_schemas | |
| except Exception: | |
| logger.warning("model tokenizer does not support tools in chat template") | |
| input_text = tokenizer.apply_chat_template(self._messages, **template_kwargs) | |
| template_kwargs["tools"] = self._tool_schemas | |
| try: | |
| input_text = tokenizer.apply_chat_template(self._messages, **template_kwargs) | |
| except TypeError: | |
| if "tools" not in template_kwargs: | |
| raise | |
| logger.warning("model tokenizer does not support tools in chat template") | |
| template_kwargs = {k: v for k, v in template_kwargs.items() if k != "tools"} | |
| input_text = tokenizer.apply_chat_template(self._messages, **template_kwargs) |
| if self._tool_schemas: | ||
| try: | ||
| template_kwargs["tools"] = self._tool_schemas | ||
| except Exception: | ||
| logger.warning("model tokenizer does not support tools in chat template") |
There was a problem hiding this comment.
π΄ try/except wraps a dict assignment that can never fail, leaving tool-template errors unhandled
The try/except on lines 234-238 wraps template_kwargs["tools"] = self._tool_schemas, a simple dict key assignment that can never raise an exception. The intent was clearly to catch errors from models whose chat template doesn't support the tools parameter, but the actual call that can fail β tokenizer.apply_chat_template(self._messages, **template_kwargs) on line 240 β is outside the try/except. As a result, if a user passes tools to a model whose template doesn't support them, the exception propagates unhandled (eventually wrapped as APIConnectionError by the outer handler in _run), and the warning on line 238 is never logged. The identical dead-code pattern exists in _run_turboquant at lines 341-345.
Prompt for agents
The try/except block at lines 234-238 (and the identical block at 341-345 in _run_turboquant) wraps a dict assignment that cannot fail. The intent is to gracefully handle models whose chat template does not support the tools parameter. The fix should move the try/except to wrap the tokenizer.apply_chat_template call instead. For example, in _run_streamer around line 240, and in _run_turboquant around line 347, the apply_chat_template call should be wrapped: first try calling it with tools in template_kwargs, and if that raises an exception (e.g. Jinja TemplateError), log the warning and retry apply_chat_template without the tools key. Apply this fix in both _run_streamer and _run_turboquant methods.
Was this helpful? React with π or π to provide feedback.
| # Sample next token | ||
| logits = output.logits[:, -1, :] | ||
| if opts.temperature > 0: | ||
| logits = logits / opts.temperature | ||
| probs = torch.softmax(logits, dim=-1) | ||
| next_token = torch.multinomial(probs, num_samples=1).squeeze(-1) | ||
| else: | ||
| next_token = logits.argmax(dim=-1) |
There was a problem hiding this comment.
π΄ TurboQuant generation path ignores top_p and repetition_penalty settings
In _forward_step (lines 364-371), the manual token sampling only applies temperature scaling to logits but never applies top_p (nucleus sampling) filtering or repetition_penalty. In contrast, the standard _run_streamer path passes both top_p (line 266) and repetition_penalty (line 261) to HF's generate(), which handles them internally. This means users who enable turboquant get silently different sampling behavior β top_p and repetition_penalty parameters are accepted but have no effect, leading to lower-quality or unexpected generation output.
Prompt for agents
In _forward_step (around lines 364-371 of llm.py), the manual sampling logic only applies temperature. It needs to also implement top_p (nucleus sampling) and repetition_penalty to match the behavior of the standard _run_streamer path. For top_p: after applying temperature and computing softmax probabilities, sort tokens by probability descending, compute the cumulative sum, and zero out tokens whose cumulative probability exceeds opts.top_p before sampling with torch.multinomial. For repetition_penalty: before temperature scaling, apply the repetition penalty to logits of previously generated tokens (tokens in the input_ids / generated so far). This requires tracking previously generated token IDs across steps, which may mean passing an additional parameter to _forward_step or using a closure variable.
Was this helpful? React with π or π to provide feedback.
|
Thanks! Can you address the Devin PR reviews? Also do you think there is a way to avoid downloading the model at runtime? (vs the |
What does this PR does?
#5351