Skip to content

issue/401 refactor(rope): add scaling factory and TP-safe RoPE cache#402

Merged
wooway777 merged 1 commit into
InfiniTensor:mainfrom
rubik-hua:refactor_rope
Jun 3, 2026
Merged

issue/401 refactor(rope): add scaling factory and TP-safe RoPE cache#402
wooway777 merged 1 commit into
InfiniTensor:mainfrom
rubik-hua:refactor_rope

Conversation

@rubik-hua
Copy link
Copy Markdown

@rubik-hua rubik-hua commented May 28, 2026

  • Decouple scaling config instantiation from ModelConfig via factory and registry pattern.
  • Add thread-local RoPE cache with device-scoped keys to reduce VRAM usage and ensure TP safety.
  • Centralize rotary dimension calculation into ModelConfig.

ModelConfig 扩展(纯粹的数据承载)
ModelConfig 不再掺杂任何具体模型的业务判断,仅提供默认值和读写接口。RoPE::Algo 的差异由具体的模型构建入口(如 csrc/models/chatglm/chatglm_for_causal_lm.cpp)显式指定:
// model_config.hpp
class ModelConfig {
private:
infinicore::nn::RoPE::Algo rope_algo_ = infinicore::nn::RoPE::Algo::GPT_NEOX; // 默认值
public:
infinicore::nn::RoPE::Algo get_rope_algo() const { return rope_algo_; }

};
// csrc/models/chatglm/chatglm_for_causal_lm.cpp
std::shared_ptr create_chatglm_config(const json& hf_config) {
auto config = std::make_shared(hf_config);
// 只有 ChatGLM/GLM4 需要 GPT_J,在此处显式注入,不污染基类
config->set_rope_algo(infinicore::nn::RoPE::Algo::GPT_J);
return config;
}
工厂与注册表机制(字符串路由分发)
引入注册表模式,将 JSON 中的字符串(如 "longrope")映射到具体的对象构造逻辑,替代冗长的 if-else。
// rotary_embedding_factory.hpp
using ScalingCreator = std::function<std::shared_ptrinfinicore::nn::ScalingConfig(
const std::shared_ptrinfinilm::config::ModelConfig&)>;
std::unordered_map<std::string, ScalingCreator>& get_scaling_registry();
std::shared_ptrinfinicore::nn::RoPE make_rope(/* ... */);
工厂核心实现极简,仅负责组装与路由,不因为新增类型而修改:
// rotary_embedding_factory.cpp
std::shared_ptrinfinicore::nn::ScalingConfig
make_scaling_config(const std::shared_ptrinfinilm::config::ModelConfig& model_config) {
std::string scaling_type = model_config->get_orstd::string("rope_scaling_type", "default");

// 分发点:注册表路由,将字符串映射到具体的 Creator 函数
auto& registry = get_scaling_registry();
auto it = registry.find(scaling_type);
if (it != registry.end()) {
    return it->second(model_config); 
}
throw std::runtime_error("Unsupported rope_scaling_type: " + scaling_type);

}

需要与InfiniCore的下面PR一起合入,
InfiniTensor/InfiniCore#1181

重构后,新增的rope实现都集中在csrc/layers/rotary_embedding/rope_scaling_creators.cpp增加,其它地方无需修改,跟rotary_embedding.cpp和model_config.cpp解耦掉。

之前@pengcheng888 给的建议是把algo参数收编进model_config中,然后在xx_for_causal_lm.cpp 中写入,我实现了一版,但感觉特别别扭,我理解model_config还是纯粹一点好,能从json中读出来或者加工出来的。后来,我又改动了一下,还是直接放到运行时传参更加优雅吧。

重构后所有现有支持的模型已经跑通

image image image image image image image image image image image image image image image image image image

@rubik-hua rubik-hua requested a review from a team May 28, 2026 09:36
@rubik-hua
Copy link
Copy Markdown
Author

@wooway777 @pengcheng888 rope重构可以帮忙检视起来了,infinicore上也有一个pr

@wooway777
Copy link
Copy Markdown
Collaborator

@wooway777 @pengcheng888 rope重构可以帮忙检视起来了,infinicore上也有一个pr

谢谢老师,在看了

Copy link
Copy Markdown
Collaborator

@pengcheng888 pengcheng888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

请华老师再评估下,infinicore和infinilm这样改动后,如果后续添加其他类型的rope,能够hold住

Comment thread csrc/layers/rotary_embedding/rotary_embedding_factory.hpp
Comment thread csrc/models/llama_legacy/llama_config.hpp
Comment thread csrc/layers/rotary_embedding/rope_scaling_creators.cpp Outdated
Comment thread csrc/layers/rotary_embedding/rotary_embedding.cpp Outdated
@pengcheng888
Copy link
Copy Markdown
Collaborator

pengcheng888 commented Jun 1, 2026

algo参数收编进model_config中,然后在xx_for_causal_lm.cpp 中写入。

这个想法的初衷是为了删除glm4_attention.hpp/cpp文件; narrow移动到Infinicore后,使用using Glm4Attention = infinilm::layers::attention::Attention;即可。

vllm中也是类似的,在RoPE模块中实现narrow: 如下
query_rot = query[..., :rotary_dim]

https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/glm4.py中 Glm4Attention的forward中:q, k = self.rotary_emb(positions, q, k)。
narrow操作不会出现在Glm4Attention的层级中。

这样后续维护的话,维护一个layers::attention::Attention就可。(ps: 后续我们确是会对Attention类进行小调整)

这个改动是否修改,需要再商议,不在这个pr修改。

@rubik-hua
Copy link
Copy Markdown
Author

algo参数收编进model_config中,然后在xx_for_causal_lm.cpp 中写入。

这个想法的初衷是为了删除glm4_attention.hpp/cpp文件; narrow移动到Infinicore后,使用using Glm4Attention = infinilm::layers::attention::Attention;即可。

vllm中也是类似的,在RoPE模块中实现narrow: 如下 query_rot = query[..., :rotary_dim]

https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/glm4.py中 Glm4Attention的forward中:q, k = self.rotary_emb(positions, q, k)。 narrow操作不会出现在Glm4Attention的层级中。

这样后续维护的话,维护一个layers::attention::Attention就可。(ps: 后续我们确是会对Attention类进行小调整)

这个改动是否修改,需要再商议,不在这个pr修改。

对的,这样我就理解了,已经在model_config中提供了对algo的get和set方法,默认infinicore::nn::RoPE::Algo::GPT_NEOX。

所有检视意见修改完后我又重跑了一遍examples/test_infer.py,没问题。

Comment thread csrc/models/glm4/glm4_for_causal_lm.cpp
Comment thread csrc/layers/rotary_embedding/rope_scaling_creators.cpp
- Decouple scaling config instantiation from ModelConfig via factory
  and registry pattern.
- Add thread-local RoPE cache with device-scoped keys to reduce VRAM
  usage and ensure TP safety.
- Centralize rotary dimension calculation into ModelConfig.
@rubik-hua
Copy link
Copy Markdown
Author

infinilm和infinicore中的检视意见都已修改完毕,已有模型验证过了。

信号不好,上传图片太困难了

Total: 10 | OK: 10 | Failed: 0

Copy link
Copy Markdown
Collaborator

@pengcheng888 pengcheng888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已approve,但需等待Infinicore pr和终审

@wooway777 wooway777 merged commit 89c0a16 into InfiniTensor:main Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants