feat: add experimental runtime LoRA delta injection (ENABLE_LORA_RUNTIME)#2046
Open
wonsup-shin wants to merge 1 commit into
Open
feat: add experimental runtime LoRA delta injection (ENABLE_LORA_RUNTIME)#2046wonsup-shin wants to merge 1 commit into
wonsup-shin wants to merge 1 commit into
Conversation
…IME) Adds Model::apply_lora_delta(name, lora_A, lora_B, scale) which writes W' = W + scale * (B @ A) in-place into an existing StorageView buffer. - Guarded by -DENABLE_LORA_RUNTIME=ON cmake option (off by default) - float32/float16/bfloat16 supported; int8 and packed weights raise std::runtime_error - Python binding exposed on WhisperWrapper with inference-lock guard - Scope: single-variable primitive only; adapter pool management and revert logic are entirely the caller's responsibility
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This adds
Model::apply_lora_delta(name, lora_A, lora_B, scale), which writesW' = W + scale * (lora_B @ lora_A)in-place into an existing weight buffer.The motivation is runtime adapter swapping without reloading the full model.
The feature is disabled by default and requires
-DENABLE_LORA_RUNTIME=ON.Changes:
CMakeLists.txt: addENABLE_LORA_RUNTIMEoptioninclude/ctranslate2/models/model.h: declareapply_lora_delta(ifdef-guarded)src/models/model.cc: implement with float32/float16/bfloat16 dispatchpython/cpp/whisper.cc: expose onWhisperWrapperwith inference-lock guardScope is intentionally narrow — this is a single-variable primitive only.
Adapter pool management and revert logic are the caller's responsibility.