Skip to content

feat: add experimental runtime LoRA delta injection (ENABLE_LORA_RUNTIME)#2046

Open
wonsup-shin wants to merge 1 commit into
OpenNMT:masterfrom
wonsup-shin:feat/lora-runtime-delta
Open

feat: add experimental runtime LoRA delta injection (ENABLE_LORA_RUNTIME)#2046
wonsup-shin wants to merge 1 commit into
OpenNMT:masterfrom
wonsup-shin:feat/lora-runtime-delta

Conversation

@wonsup-shin
Copy link
Copy Markdown

This adds Model::apply_lora_delta(name, lora_A, lora_B, scale), which writes
W' = W + scale * (lora_B @ lora_A) in-place into an existing weight buffer.
The motivation is runtime adapter swapping without reloading the full model.

The feature is disabled by default and requires -DENABLE_LORA_RUNTIME=ON.

Changes:

  • CMakeLists.txt: add ENABLE_LORA_RUNTIME option
  • include/ctranslate2/models/model.h: declare apply_lora_delta (ifdef-guarded)
  • src/models/model.cc: implement with float32/float16/bfloat16 dispatch
  • python/cpp/whisper.cc: expose on WhisperWrapper with inference-lock guard

Scope is intentionally narrow — this is a single-variable primitive only.
Adapter pool management and revert logic are the caller's responsibility.

…IME)

Adds Model::apply_lora_delta(name, lora_A, lora_B, scale) which writes
W' = W + scale * (B @ A) in-place into an existing StorageView buffer.

- Guarded by -DENABLE_LORA_RUNTIME=ON cmake option (off by default)
- float32/float16/bfloat16 supported; int8 and packed weights raise
  std::runtime_error
- Python binding exposed on WhisperWrapper with inference-lock guard
- Scope: single-variable primitive only; adapter pool management and
  revert logic are entirely the caller's responsibility
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant