Feat: hf llm plugin by mahimairaja · Pull Request #5352 · livekit/agents

mahimairaja · 2026-04-06T16:17:16Z

What does this PR does?

Adds new plugin to load any llm from hugging face models
Adds turbo quant optimization for local models

Copilot

Pull request overview

Adds a new livekit-plugins-hf integration to run local HuggingFace LLM inference within the LiveKit Agents framework, with optional TurboQuant-GPU KV cache compression for faster/cheaper generation.

Changes:

Introduces a new HuggingFace LLM + LLMStream implementation with streaming generation and optional TurboQuant path.
Adds packaging metadata (pyproject.toml), versioning, and plugin registration boilerplate.
Adds a README documenting install and basic usage.

Reviewed changes

Copilot reviewed 6 out of 7 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
livekit-plugins/livekit-plugins-hf/README.md	Documents installation and usage examples for the new plugin and TurboQuant extra.
livekit-plugins/livekit-plugins-hf/pyproject.toml	Defines the new Python package, dependencies, and optional TurboQuant dependency group.
livekit-plugins/livekit-plugins-hf/livekit/plugins/hf/init.py	Exposes public API, registers the plugin with `livekit.agents.Plugin`.
livekit-plugins/livekit-plugins-hf/livekit/plugins/hf/llm.py	Core implementation of local HF inference + streaming + TurboQuant generation and tool-call parsing.
livekit-plugins/livekit-plugins-hf/livekit/plugins/hf/log.py	Adds a module logger.
livekit-plugins/livekit-plugins-hf/livekit/plugins/hf/version.py	Adds plugin version constant for packaging/registration.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-06T16:22:08Z

livekit-plugins/livekit-plugins-hf/livekit/plugins/hf/llm.py

+                    await self._run_turboquant()
+                else:
+                    await self._run_streamer()
+        except Exception as e:


LLMStream._run catches all exceptions and wraps them into APIConnectionError, which will also convert asyncio.CancelledError into a connection error and prevent normal stream cancellation. Special-case CancelledError to re-raise, and avoid blanket-wrapping non-APIError exceptions (or at least preserve the original message/exception type).

Suggested change

except Exception as e:

except asyncio.CancelledError:

raise

except APIConnectionError:

raise

except OSError as e:

Copilot · 2026-04-06T16:22:09Z

livekit-plugins/livekit-plugins-hf/livekit/plugins/hf/llm.py

+            try:
+                template_kwargs["tools"] = self._tool_schemas
+            except Exception:
+                logger.warning("model tokenizer does not support tools in chat template")
+
+        input_text = tokenizer.apply_chat_template(self._messages, **template_kwargs)


This try/except does not protect against tokenizers that don’t accept a tools kwarg: the TypeError will be raised by apply_chat_template(), not by assigning into template_kwargs. Wrap the apply_chat_template() call (or retry without tools) so models without tool-template support still work when tools are provided.

Suggested change

try:

template_kwargs["tools"] = self._tool_schemas

except Exception:

logger.warning("model tokenizer does not support tools in chat template")

input_text = tokenizer.apply_chat_template(self._messages, **template_kwargs)

template_kwargs["tools"] = self._tool_schemas

try:

input_text = tokenizer.apply_chat_template(self._messages, **template_kwargs)

except TypeError:

if "tools" not in template_kwargs:

raise

logger.warning("model tokenizer does not support tools in chat template")

template_kwargs = {k: v for k, v in template_kwargs.items() if k != "tools"}

input_text = tokenizer.apply_chat_template(self._messages, **template_kwargs)

Copilot · 2026-04-06T16:22:09Z

livekit-plugins/livekit-plugins-hf/livekit/plugins/hf/llm.py

+            stop_event.set()
+            raise
+        finally:
+            await gen_future


On cancellation, the finally block awaits gen_future directly; if the task is cancelled, this await can be cancelled too, leaving the background generate() thread running without being awaited/cleaned up. Consider shielding the await (or cancelling/handling the executor future) to ensure cleanup completes reliably.

Suggested change

await gen_future

try:

await asyncio.shield(gen_future)

except asyncio.CancelledError:

stop_event.set()

await asyncio.shield(gen_future)

raise

Copilot · 2026-04-06T16:22:09Z

livekit-plugins/livekit-plugins-hf/livekit/plugins/hf/llm.py

+        input_text = tokenizer.apply_chat_template(self._messages, **template_kwargs)
+        input_ids = tokenizer(input_text, return_tensors="pt")["input_ids"].to(model.device)
+        prompt_tokens = input_ids.shape[1]


TurboQuant path tokenizes only input_ids and drops attention_mask. Some HF models require an attention mask even for single prompts, and it’s needed if padding is ever introduced; include attention_mask from the tokenizer and pass it into the model forward step (at least for the first prompt pass).

Copilot · 2026-04-06T16:22:10Z

livekit-plugins/livekit-plugins-hf/livekit/plugins/hf/llm.py

+            if opts.temperature > 0:
+                logits = logits / opts.temperature
+                probs = torch.softmax(logits, dim=-1)
+                next_token = torch.multinomial(probs, num_samples=1).squeeze(-1)
+            else:


TurboQuant sampling applies temperature but ignores configured top_p and repetition_penalty, so outputs will differ from the standard streamer path for the same options. Consider implementing nucleus sampling and repetition penalty (or explicitly document that these options aren’t supported for TurboQuant mode).

Copilot · 2026-04-06T16:22:10Z

livekit-plugins/livekit-plugins-hf/livekit/plugins/hf/llm.py

+        name = data.get("name", "")
+        arguments = data.get("arguments") or data.get("parameters") or {}
+        if isinstance(arguments, dict):
+            arguments = json.dumps(arguments)
+


_parse_tool_calls appends a FunctionToolCall even when the parsed JSON has no name (defaulting to empty string). Skip entries without a valid tool name to avoid emitting invalid tool calls downstream.

Copilot · 2026-04-06T16:22:10Z

livekit-plugins/livekit-plugins-hf/livekit/plugins/hf/llm.py

+                if token_text is sentinel:
+                    break
+                if token_text:
+                    full_text += token_text
+                    self._event_ch.send_nowait(


Building full_text via repeated string concatenation in the token loop is O(n^2) and can be noticeably slower for long generations. Accumulate token strings in a list and join once at the end (and use the joined value for tool parsing / token counting).

Copilot · 2026-04-06T16:22:10Z

livekit-plugins/livekit-plugins-hf/livekit/plugins/hf/llm.py

+            try:
+                template_kwargs["tools"] = self._tool_schemas
+            except Exception:
+                logger.warning("model tokenizer does not support tools in chat template")
+
+        input_text = tokenizer.apply_chat_template(self._messages, **template_kwargs)


Same issue as above: adding tools to template_kwargs won’t throw, but apply_chat_template() will if the tokenizer doesn’t support a tools kwarg. Wrap the apply_chat_template() call (or retry without tools) to avoid failing tool-enabled chats on unsupported models.

Suggested change

try:

template_kwargs["tools"] = self._tool_schemas

except Exception:

logger.warning("model tokenizer does not support tools in chat template")

input_text = tokenizer.apply_chat_template(self._messages, **template_kwargs)

template_kwargs["tools"] = self._tool_schemas

try:

input_text = tokenizer.apply_chat_template(self._messages, **template_kwargs)

except TypeError:

if "tools" not in template_kwargs:

raise

logger.warning("model tokenizer does not support tools in chat template")

template_kwargs = {k: v for k, v in template_kwargs.items() if k != "tools"}

input_text = tokenizer.apply_chat_template(self._messages, **template_kwargs)

devin-ai-integration

Devin Review found 2 potential issues.

View 5 additional findings in Devin Review.

devin-ai-integration · 2026-04-06T16:26:32Z

livekit-plugins/livekit-plugins-hf/livekit/plugins/hf/llm.py

+        if self._tool_schemas:
+            try:
+                template_kwargs["tools"] = self._tool_schemas
+            except Exception:
+                logger.warning("model tokenizer does not support tools in chat template")


🔴 try/except wraps a dict assignment that can never fail, leaving tool-template errors unhandled

The try/except on lines 234-238 wraps template_kwargs["tools"] = self._tool_schemas, a simple dict key assignment that can never raise an exception. The intent was clearly to catch errors from models whose chat template doesn't support the tools parameter, but the actual call that can fail — tokenizer.apply_chat_template(self._messages, **template_kwargs) on line 240 — is outside the try/except. As a result, if a user passes tools to a model whose template doesn't support them, the exception propagates unhandled (eventually wrapped as APIConnectionError by the outer handler in _run), and the warning on line 238 is never logged. The identical dead-code pattern exists in _run_turboquant at lines 341-345.

Prompt for agents

The try/except block at lines 234-238 (and the identical block at 341-345 in _run_turboquant) wraps a dict assignment that cannot fail. The intent is to gracefully handle models whose chat template does not support the tools parameter. The fix should move the try/except to wrap the tokenizer.apply_chat_template call instead. For example, in _run_streamer around line 240, and in _run_turboquant around line 347, the apply_chat_template call should be wrapped: first try calling it with tools in template_kwargs, and if that raises an exception (e.g. Jinja TemplateError), log the warning and retry apply_chat_template without the tools key. Apply this fix in both _run_streamer and _run_turboquant methods.

Was this helpful? React with 👍 or 👎 to provide feedback.

devin-ai-integration · 2026-04-06T16:26:33Z

livekit-plugins/livekit-plugins-hf/livekit/plugins/hf/llm.py

+            # Sample next token
+            logits = output.logits[:, -1, :]
+            if opts.temperature > 0:
+                logits = logits / opts.temperature
+                probs = torch.softmax(logits, dim=-1)
+                next_token = torch.multinomial(probs, num_samples=1).squeeze(-1)
+            else:
+                next_token = logits.argmax(dim=-1)


🔴 TurboQuant generation path ignores top_p and repetition_penalty settings

In _forward_step (lines 364-371), the manual token sampling only applies temperature scaling to logits but never applies top_p (nucleus sampling) filtering or repetition_penalty. In contrast, the standard _run_streamer path passes both top_p (line 266) and repetition_penalty (line 261) to HF's generate(), which handles them internally. This means users who enable turboquant get silently different sampling behavior — top_p and repetition_penalty parameters are accepted but have no effect, leading to lower-quality or unexpected generation output.

Prompt for agents

In _forward_step (around lines 364-371 of llm.py), the manual sampling logic only applies temperature. It needs to also implement top_p (nucleus sampling) and repetition_penalty to match the behavior of the standard _run_streamer path. For top_p: after applying temperature and computing softmax probabilities, sort tokens by probability descending, compute the cumulative sum, and zero out tokens whose cumulative probability exceeds opts.top_p before sampling with torch.multinomial. For repetition_penalty: before temperature scaling, apply the repetition penalty to logits of previously generated tokens (tokens in the input_ids / generated so far). This requires tracking previously generated token IDs across steps, which may mean passing an additional parameter to _forward_step or using a closure variable.

Was this helpful? React with 👍 or 👎 to provide feedback.

theomonnom · 2026-04-06T19:19:09Z

Thanks! Can you address the Devin PR reviews?

Also do you think there is a way to avoid downloading the model at runtime? (vs the download-files CLI)

mahimairaja added 2 commits April 6, 2026 12:14

feat: init hf plugin

5316c73

feat: added llm module for hf plugin

7321251

Copilot AI review requested due to automatic review settings April 6, 2026 16:17

Copilot started reviewing on behalf of mahimairaja April 6, 2026 16:17 View session

Copilot AI reviewed Apr 6, 2026

View reviewed changes

devin-ai-integration bot reviewed Apr 6, 2026

View reviewed changes

-        except Exception as e:
+        except asyncio.CancelledError:
+            raise
+        except APIConnectionError:
+            raise
+        except OSError as e:

Conversation

mahimairaja commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR does?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

theomonnom commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mahimairaja commented Apr 6, 2026 •

edited

Loading

theomonnom commented Apr 6, 2026 •

edited

Loading