Skip to content

Feat: hf llm plugin#5352

Open
mahimairaja wants to merge 2 commits intolivekit:mainfrom
mahimairaja:feat/hf-llm-plugin
Open

Feat: hf llm plugin#5352
mahimairaja wants to merge 2 commits intolivekit:mainfrom
mahimairaja:feat/hf-llm-plugin

Conversation

@mahimairaja
Copy link
Copy Markdown
Contributor

@mahimairaja mahimairaja commented Apr 6, 2026

What does this PR does?

  • Adds new plugin to load any llm from hugging face models
  • Adds turbo quant optimization for local models

#5351

Copilot AI review requested due to automatic review settings April 6, 2026 16:17
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new livekit-plugins-hf integration to run local HuggingFace LLM inference within the LiveKit Agents framework, with optional TurboQuant-GPU KV cache compression for faster/cheaper generation.

Changes:

  • Introduces a new HuggingFace LLM + LLMStream implementation with streaming generation and optional TurboQuant path.
  • Adds packaging metadata (pyproject.toml), versioning, and plugin registration boilerplate.
  • Adds a README documenting install and basic usage.

Reviewed changes

Copilot reviewed 6 out of 7 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
livekit-plugins/livekit-plugins-hf/README.md Documents installation and usage examples for the new plugin and TurboQuant extra.
livekit-plugins/livekit-plugins-hf/pyproject.toml Defines the new Python package, dependencies, and optional TurboQuant dependency group.
livekit-plugins/livekit-plugins-hf/livekit/plugins/hf/init.py Exposes public API, registers the plugin with livekit.agents.Plugin.
livekit-plugins/livekit-plugins-hf/livekit/plugins/hf/llm.py Core implementation of local HF inference + streaming + TurboQuant generation and tool-call parsing.
livekit-plugins/livekit-plugins-hf/livekit/plugins/hf/log.py Adds a module logger.
livekit-plugins/livekit-plugins-hf/livekit/plugins/hf/version.py Adds plugin version constant for packaging/registration.

πŸ’‘ Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

await self._run_turboquant()
else:
await self._run_streamer()
except Exception as e:
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LLMStream._run catches all exceptions and wraps them into APIConnectionError, which will also convert asyncio.CancelledError into a connection error and prevent normal stream cancellation. Special-case CancelledError to re-raise, and avoid blanket-wrapping non-APIError exceptions (or at least preserve the original message/exception type).

Suggested change
except Exception as e:
except asyncio.CancelledError:
raise
except APIConnectionError:
raise
except OSError as e:

Copilot uses AI. Check for mistakes.
Comment on lines +235 to +240
try:
template_kwargs["tools"] = self._tool_schemas
except Exception:
logger.warning("model tokenizer does not support tools in chat template")

input_text = tokenizer.apply_chat_template(self._messages, **template_kwargs)
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This try/except does not protect against tokenizers that don’t accept a tools kwarg: the TypeError will be raised by apply_chat_template(), not by assigning into template_kwargs. Wrap the apply_chat_template() call (or retry without tools) so models without tool-template support still work when tools are provided.

Suggested change
try:
template_kwargs["tools"] = self._tool_schemas
except Exception:
logger.warning("model tokenizer does not support tools in chat template")
input_text = tokenizer.apply_chat_template(self._messages, **template_kwargs)
template_kwargs["tools"] = self._tool_schemas
try:
input_text = tokenizer.apply_chat_template(self._messages, **template_kwargs)
except TypeError:
if "tools" not in template_kwargs:
raise
logger.warning("model tokenizer does not support tools in chat template")
template_kwargs = {k: v for k, v in template_kwargs.items() if k != "tools"}
input_text = tokenizer.apply_chat_template(self._messages, **template_kwargs)

Copilot uses AI. Check for mistakes.
stop_event.set()
raise
finally:
await gen_future
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On cancellation, the finally block awaits gen_future directly; if the task is cancelled, this await can be cancelled too, leaving the background generate() thread running without being awaited/cleaned up. Consider shielding the await (or cancelling/handling the executor future) to ensure cleanup completes reliably.

Suggested change
await gen_future
try:
await asyncio.shield(gen_future)
except asyncio.CancelledError:
stop_event.set()
await asyncio.shield(gen_future)
raise

Copilot uses AI. Check for mistakes.
Comment on lines +347 to +349
input_text = tokenizer.apply_chat_template(self._messages, **template_kwargs)
input_ids = tokenizer(input_text, return_tensors="pt")["input_ids"].to(model.device)
prompt_tokens = input_ids.shape[1]
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TurboQuant path tokenizes only input_ids and drops attention_mask. Some HF models require an attention mask even for single prompts, and it’s needed if padding is ever introduced; include attention_mask from the tokenizer and pass it into the model forward step (at least for the first prompt pass).

Copilot uses AI. Check for mistakes.
Comment on lines +366 to +370
if opts.temperature > 0:
logits = logits / opts.temperature
probs = torch.softmax(logits, dim=-1)
next_token = torch.multinomial(probs, num_samples=1).squeeze(-1)
else:
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TurboQuant sampling applies temperature but ignores configured top_p and repetition_penalty, so outputs will differ from the standard streamer path for the same options. Consider implementing nucleus sampling and repetition penalty (or explicitly document that these options aren’t supported for TurboQuant mode).

Copilot uses AI. Check for mistakes.
Comment on lines +445 to +449
name = data.get("name", "")
arguments = data.get("arguments") or data.get("parameters") or {}
if isinstance(arguments, dict):
arguments = json.dumps(arguments)

Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_parse_tool_calls appends a FunctionToolCall even when the parsed JSON has no name (defaulting to empty string). Skip entries without a valid tool name to avoid emitting invalid tool calls downstream.

Copilot uses AI. Check for mistakes.
Comment on lines +288 to +292
if token_text is sentinel:
break
if token_text:
full_text += token_text
self._event_ch.send_nowait(
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Building full_text via repeated string concatenation in the token loop is O(n^2) and can be noticeably slower for long generations. Accumulate token strings in a list and join once at the end (and use the joined value for tool parsing / token counting).

Copilot uses AI. Check for mistakes.
Comment on lines +342 to +347
try:
template_kwargs["tools"] = self._tool_schemas
except Exception:
logger.warning("model tokenizer does not support tools in chat template")

input_text = tokenizer.apply_chat_template(self._messages, **template_kwargs)
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as above: adding tools to template_kwargs won’t throw, but apply_chat_template() will if the tokenizer doesn’t support a tools kwarg. Wrap the apply_chat_template() call (or retry without tools) to avoid failing tool-enabled chats on unsupported models.

Suggested change
try:
template_kwargs["tools"] = self._tool_schemas
except Exception:
logger.warning("model tokenizer does not support tools in chat template")
input_text = tokenizer.apply_chat_template(self._messages, **template_kwargs)
template_kwargs["tools"] = self._tool_schemas
try:
input_text = tokenizer.apply_chat_template(self._messages, **template_kwargs)
except TypeError:
if "tools" not in template_kwargs:
raise
logger.warning("model tokenizer does not support tools in chat template")
template_kwargs = {k: v for k, v in template_kwargs.items() if k != "tools"}
input_text = tokenizer.apply_chat_template(self._messages, **template_kwargs)

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 potential issues.

View 5 additional findings in Devin Review.

Open in Devin Review

Comment on lines +234 to +238
if self._tool_schemas:
try:
template_kwargs["tools"] = self._tool_schemas
except Exception:
logger.warning("model tokenizer does not support tools in chat template")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

πŸ”΄ try/except wraps a dict assignment that can never fail, leaving tool-template errors unhandled

The try/except on lines 234-238 wraps template_kwargs["tools"] = self._tool_schemas, a simple dict key assignment that can never raise an exception. The intent was clearly to catch errors from models whose chat template doesn't support the tools parameter, but the actual call that can fail β€” tokenizer.apply_chat_template(self._messages, **template_kwargs) on line 240 β€” is outside the try/except. As a result, if a user passes tools to a model whose template doesn't support them, the exception propagates unhandled (eventually wrapped as APIConnectionError by the outer handler in _run), and the warning on line 238 is never logged. The identical dead-code pattern exists in _run_turboquant at lines 341-345.

Prompt for agents
The try/except block at lines 234-238 (and the identical block at 341-345 in _run_turboquant) wraps a dict assignment that cannot fail. The intent is to gracefully handle models whose chat template does not support the tools parameter. The fix should move the try/except to wrap the tokenizer.apply_chat_template call instead. For example, in _run_streamer around line 240, and in _run_turboquant around line 347, the apply_chat_template call should be wrapped: first try calling it with tools in template_kwargs, and if that raises an exception (e.g. Jinja TemplateError), log the warning and retry apply_chat_template without the tools key. Apply this fix in both _run_streamer and _run_turboquant methods.
Open in Devin Review

Was this helpful? React with πŸ‘ or πŸ‘Ž to provide feedback.

Comment on lines +364 to +371
# Sample next token
logits = output.logits[:, -1, :]
if opts.temperature > 0:
logits = logits / opts.temperature
probs = torch.softmax(logits, dim=-1)
next_token = torch.multinomial(probs, num_samples=1).squeeze(-1)
else:
next_token = logits.argmax(dim=-1)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

πŸ”΄ TurboQuant generation path ignores top_p and repetition_penalty settings

In _forward_step (lines 364-371), the manual token sampling only applies temperature scaling to logits but never applies top_p (nucleus sampling) filtering or repetition_penalty. In contrast, the standard _run_streamer path passes both top_p (line 266) and repetition_penalty (line 261) to HF's generate(), which handles them internally. This means users who enable turboquant get silently different sampling behavior β€” top_p and repetition_penalty parameters are accepted but have no effect, leading to lower-quality or unexpected generation output.

Prompt for agents
In _forward_step (around lines 364-371 of llm.py), the manual sampling logic only applies temperature. It needs to also implement top_p (nucleus sampling) and repetition_penalty to match the behavior of the standard _run_streamer path. For top_p: after applying temperature and computing softmax probabilities, sort tokens by probability descending, compute the cumulative sum, and zero out tokens whose cumulative probability exceeds opts.top_p before sampling with torch.multinomial. For repetition_penalty: before temperature scaling, apply the repetition penalty to logits of previously generated tokens (tokens in the input_ids / generated so far). This requires tracking previously generated token IDs across steps, which may mean passing an additional parameter to _forward_step or using a closure variable.
Open in Devin Review

Was this helpful? React with πŸ‘ or πŸ‘Ž to provide feedback.

@theomonnom
Copy link
Copy Markdown
Member

theomonnom commented Apr 6, 2026

Thanks! Can you address the Devin PR reviews?

Also do you think there is a way to avoid downloading the model at runtime? (vs the download-files CLI)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants