π― Problem Statement
Qwen3.5 inference already works well in qvac-fabric-llm.cpp (the Gated-DeltaNet op is in place as of the 8828 line), but LoRA fine-tuning is not yet supported for Qwen3.5 β the fine-tuning path currently covers Qwen3, Gemma3, and BitNet only. We'd like to train lightweight LoRA adapters on the small dense Qwen3.5 variants (0.8B / 2B / 4B) for language and domain adaptation, and the blocker is that Qwen3.5's hybrid architecture β ~75% linear-attention "Gated DeltaNet" layers mixed with full-attention layers β needs custom backward-pass implementations for those linear-attention layers that don't exist yet. Could you add Qwen3.5 to the supported fine-tuning architectures? Ideally this would also include being able to apply the resulting adapter at inference time, since the standalone LoRAβGGUF adapter export is currently broken upstream for this architecture (ggml-org/llama.cpp#21125), which today forces a merge-and-reconvert workaround and rules out runtime adapter hot-swapping. Support for the small variants would make Qwen3.5 viable for on-device, fine-tuned use cases.
π‘ Proposed Solution
Two possible tracks, depending on appetite:
Track A β full on-device LoRA training (the proper fix). Implement the backward pass for the Gated-DeltaNet / linear-attention op in the fine-tuning engine. Note that even though LoRA only targets the standard projection modules (q/k/v/o_proj, gate/up/down_proj), the linear-attention layers make up ~75% of the stack and are interleaved with the full-attention layers, so gradients have to flow through the recurrent op to reach almost every adapter β i.e. a forward-only op isn't enough, the op's gradient is required. The chunked gated-delta-rule has known analytical gradients; the flash-linear-attention (FLA) library is a reasonable reference implementation for the fwd+bwd to port.
Track B β interim that unblocks the workflow without new training kernels. Two smaller pieces: (1) Officially support importing a merged, externally-fine-tuned Qwen3.5 model (train the LoRA off-device with PEFT/Unsloth, merge_and_unload() into a 16-bit base, convert with convert_hf_to_gguf.py) β this already produces a working Qwen3.5 GGUF, it just needs to be a documented/supported path. (2) Fix standalone LoRAβGGUF export so adapters can be loaded at runtime (enabling hot-swap instead of shipping a full merged model per adapter): the failure is in _reorder_v_heads / LoraTorchTensor.reshape (upstream ggml-org/llama.cpp#21125); a previously-proposed approach was to permute the LoRA B/A factors rather than reshape them (see the closed ggml-org/llama.cpp#21354).
Track B (1) is essentially free and would help immediately; Track B (2) restores runtime adapters; Track A is the full on-device capability.
π Use Cases
LoRA fine-tuning for local dialects
π Expected Impact
High - Improves common workflows
π Alternatives Considered
No response
β οΈ Constraints & Considerations
No response
π€ Contribution
π Additional Context
No response
π― Problem Statement
Qwen3.5 inference already works well in qvac-fabric-llm.cpp (the Gated-DeltaNet op is in place as of the 8828 line), but LoRA fine-tuning is not yet supported for Qwen3.5 β the fine-tuning path currently covers Qwen3, Gemma3, and BitNet only. We'd like to train lightweight LoRA adapters on the small dense Qwen3.5 variants (0.8B / 2B / 4B) for language and domain adaptation, and the blocker is that Qwen3.5's hybrid architecture β ~75% linear-attention "Gated DeltaNet" layers mixed with full-attention layers β needs custom backward-pass implementations for those linear-attention layers that don't exist yet. Could you add Qwen3.5 to the supported fine-tuning architectures? Ideally this would also include being able to apply the resulting adapter at inference time, since the standalone LoRAβGGUF adapter export is currently broken upstream for this architecture (ggml-org/llama.cpp#21125), which today forces a merge-and-reconvert workaround and rules out runtime adapter hot-swapping. Support for the small variants would make Qwen3.5 viable for on-device, fine-tuned use cases.
π‘ Proposed Solution
Two possible tracks, depending on appetite:
Track A β full on-device LoRA training (the proper fix). Implement the backward pass for the Gated-DeltaNet / linear-attention op in the fine-tuning engine. Note that even though LoRA only targets the standard projection modules (q/k/v/o_proj, gate/up/down_proj), the linear-attention layers make up ~75% of the stack and are interleaved with the full-attention layers, so gradients have to flow through the recurrent op to reach almost every adapter β i.e. a forward-only op isn't enough, the op's gradient is required. The chunked gated-delta-rule has known analytical gradients; the flash-linear-attention (FLA) library is a reasonable reference implementation for the fwd+bwd to port.
Track B β interim that unblocks the workflow without new training kernels. Two smaller pieces: (1) Officially support importing a merged, externally-fine-tuned Qwen3.5 model (train the LoRA off-device with PEFT/Unsloth, merge_and_unload() into a 16-bit base, convert with convert_hf_to_gguf.py) β this already produces a working Qwen3.5 GGUF, it just needs to be a documented/supported path. (2) Fix standalone LoRAβGGUF export so adapters can be loaded at runtime (enabling hot-swap instead of shipping a full merged model per adapter): the failure is in _reorder_v_heads / LoraTorchTensor.reshape (upstream ggml-org/llama.cpp#21125); a previously-proposed approach was to permute the LoRA B/A factors rather than reshape them (see the closed ggml-org/llama.cpp#21354).
Track B (1) is essentially free and would help immediately; Track B (2) restores runtime adapters; Track A is the full on-device capability.
π Use Cases
LoRA fine-tuning for local dialects
π Expected Impact
High - Improves common workflows
π Alternatives Considered
No response
No response
π€ Contribution
π Additional Context
No response