Skip to content

[Feature]: Add LoRA fine-tuning support for the Qwen3.5 (Gated DeltaNet) model familyΒ #2319

@helloluis

Description

@helloluis

🎯 Problem Statement

Qwen3.5 inference already works well in qvac-fabric-llm.cpp (the Gated-DeltaNet op is in place as of the 8828 line), but LoRA fine-tuning is not yet supported for Qwen3.5 — the fine-tuning path currently covers Qwen3, Gemma3, and BitNet only. We'd like to train lightweight LoRA adapters on the small dense Qwen3.5 variants (0.8B / 2B / 4B) for language and domain adaptation, and the blocker is that Qwen3.5's hybrid architecture — ~75% linear-attention "Gated DeltaNet" layers mixed with full-attention layers — needs custom backward-pass implementations for those linear-attention layers that don't exist yet. Could you add Qwen3.5 to the supported fine-tuning architectures? Ideally this would also include being able to apply the resulting adapter at inference time, since the standalone LoRA→GGUF adapter export is currently broken upstream for this architecture (ggml-org/llama.cpp#21125), which today forces a merge-and-reconvert workaround and rules out runtime adapter hot-swapping. Support for the small variants would make Qwen3.5 viable for on-device, fine-tuned use cases.

πŸ’‘ Proposed Solution

Two possible tracks, depending on appetite:

Track A β€” full on-device LoRA training (the proper fix). Implement the backward pass for the Gated-DeltaNet / linear-attention op in the fine-tuning engine. Note that even though LoRA only targets the standard projection modules (q/k/v/o_proj, gate/up/down_proj), the linear-attention layers make up ~75% of the stack and are interleaved with the full-attention layers, so gradients have to flow through the recurrent op to reach almost every adapter β€” i.e. a forward-only op isn't enough, the op's gradient is required. The chunked gated-delta-rule has known analytical gradients; the flash-linear-attention (FLA) library is a reasonable reference implementation for the fwd+bwd to port.

Track B — interim that unblocks the workflow without new training kernels. Two smaller pieces: (1) Officially support importing a merged, externally-fine-tuned Qwen3.5 model (train the LoRA off-device with PEFT/Unsloth, merge_and_unload() into a 16-bit base, convert with convert_hf_to_gguf.py) — this already produces a working Qwen3.5 GGUF, it just needs to be a documented/supported path. (2) Fix standalone LoRA→GGUF export so adapters can be loaded at runtime (enabling hot-swap instead of shipping a full merged model per adapter): the failure is in _reorder_v_heads / LoraTorchTensor.reshape (upstream ggml-org/llama.cpp#21125); a previously-proposed approach was to permute the LoRA B/A factors rather than reshape them (see the closed ggml-org/llama.cpp#21354).

Track B (1) is essentially free and would help immediately; Track B (2) restores runtime adapters; Track A is the full on-device capability.

πŸ“‹ Use Cases

LoRA fine-tuning for local dialects

πŸ“Š Expected Impact

High - Improves common workflows

πŸ”„ Alternatives Considered

No response

⚠️ Constraints & Considerations

No response

🀝 Contribution

  • I would be willing to submit a PR for this feature

πŸ“Ž Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions