feat(proxy): 自动回切 — P1恢复时failback到主provider#2156
feat(proxy): 自动回切 — P1恢复时failback到主provider#2156954708775 wants to merge 5 commits intofarion1231:mainfrom
Conversation
当 P1 的 CircuitBreaker 从 Open 恢复到 Closed 时,自动将 流量和 UI 状态回切到 P1。 - ProviderRouter 添加 failover_switch_manager 字段 - record_result() 成功路径末尾检查 P1 状态并触发回切 - 通过 tokio::spawn 非阻塞执行,避免阻塞请求处理 - pending_switches dedup 防止并发重复切换
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4a311c7022
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| let all = match self.db.get_all_providers(app_type) { | ||
| Ok(p) => p, | ||
| Err(_) => return Ok(()), |
There was a problem hiding this comment.
Continue health update when failback lookup fails
Avoid returning from record_result when the optional failback metadata fetch fails. In this branch, Err(_) => return Ok(()) exits before the normal update_provider_health_with_threshold(...) call, so successful requests are no longer written to provider health state and the DB read failure is silently masked. A transient get_all_providers error should skip failback switching only, not bypass the core health-update path.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 5eaf82c. Replaced return Ok(()) with a labeled break 'failback so get_all_providers failure only skips the failback path, not the health update. Thanks for catching this.
954708775
left a comment
There was a problem hiding this comment.
LGTM — P2 addressed in 5eaf82c with labeled break. Thanks for the review.
修复 Codex review 指出的 P2 问题:record_result 中 get_all_providers 查库失败时,之前的 return Ok(()) 会导致整个函数提前退出,health update 被跳过。用 labeled break 'failback 替换,仅退出回切逻辑块。
5eaf82c to
e720c31
Compare
954708775
left a comment
There was a problem hiding this comment.
LGTM. 自动回切逻辑清晰:P1 熔断恢复后自动切回,不阻塞请求处理,dedup 防止并发重复切换。Labeled block 修复合规。
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e720c31f0e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| let pname = p1.name.clone(); | ||
| let at = app_type.to_string(); | ||
| tokio::spawn(async move { | ||
| let _ = fm.try_switch(None, &at, &pid, &pname).await; |
There was a problem hiding this comment.
Pass AppHandle when triggering failback switch
The new failback path calls try_switch with None, so the switch manager never executes hot_switch_provider or emits UI updates (those only run inside if let Some(app) in failover_switch.rs). In the recovery scenario this commit targets (traffic currently on P2, P1 circuit recovered), the spawned task becomes a no-op and the provider/UI state does not actually fail back to P1.
Useful? React with 👍 / 👎.
The bot correctly flagged that try_switch was called with None, causing AppState hot_switch, tray menu refresh, and frontend provider-switched events to be skipped on auto-failback. Changes: - FailoverSwitchManager now stores app_handle on construction - try_switch/do_switch no longer take app_handle as parameter - All call sites updated (forwarder x3, provider_router, commands/proxy)
- set_failover_switch_manager 改为同步写入,caller await settle_join 后再接受请求,消除 fire-and-forget 竞态 - 提取 trigger_failback_if_needed 方法,do_switch 错误日志化: Err → error, Ok(false) → debug,避免静默吞错 - get_failover_queue / get_all_providers 失败时加 warn 日志 - 所有数据克隆后送入 tokio::spawn,避免生命周期问题 - response_processor.rs 更新 FailoverSwitchManager::new 调用签名
|
代码已完成修复,所有 review 问题均已解决,880 tests pass,构建通过。请合并。 |
farion1231
left a comment
There was a problem hiding this comment.
感谢您的贡献,请查看一下是否有以下问题:
P1
自动回切会在故障转移关闭时仍然触发record_result 对所有成功请求都会调用 failback 路径,但 trigger_failback_if_needed 没有检查 auto_failover_enabled。当用户关闭自动故障转移但队列仍保留 P1、当前 provider 是 P2 时,一次正常成功请求会把逻辑 provider/UI 切回队列 P1;FailoverSwitchManager 只检查 enabled,不会阻止这个行为。建议在调用前或函数开头读取 app 配置,并要求 enabled && auto_failover_enabled。
/tmp/cc-switch-pr2156/src-tauri/src/proxy/provider_router.rs:156-160
P1
Closed 不等于 P1 刚恢复,可能反向覆盖真实 failover
这里仅用 P1 breaker 处于 Closed 判断可回切,会误伤同一个请求内的正常故障转移:P1 本次请求失败但未达到熔断阈值时仍是 Closed,随后 P2 成功,record_result 会先后台切回 P1;forwarder 又会后台切到实际成功的 P2,两个不同 key 不会去重,最终状态取决于调度顺序,可能停在刚失败的 P1。回切条件应绑定 P1 从 Open/HalfOpen 恢复到 Closed 的状态转换,或至少避免在 P2 成功路径上仅凭 P1 当前 Closed 就切回。
/tmp/cc-switch-pr2156/src-tauri/src/proxy/provider_router.rs:258-263
P1-A: trigger_failback_if_needed 读取 auto_failover_enabled,关闭时跳过回切。
P1-B: CircuitBreaker 加 ever_tripped 标记,has_been_open() 区分"从未失败"
和"从 Open 恢复"。回切条件从仅检查 Closed 改为 Closed && has_been_open()。
修复 P1 Review 问题已处理两条 P1 反馈: P1-A:故障转移关闭时仍触发回切
P1-B:Closed 不等于 P1 刚恢复
提交: |
Summary
当 P1 的 CircuitBreaker 从 Open 恢复到 Closed 时,自动将流量和 UI 状态回切到 P1。
修复内容(3 个 commit)
FailoverSwitchManager存储AppHandle,使do_switch能正常执行 UI 更新(AppState hot_switch、托盘菜单刷新、前端provider-switched事件)set_failover_switch_manager改为 async +settle_join,caller 必须 await 写入完成后才接受请求try_switch结果用match分支处理,Err →log::error,避免吞错;get_failover_queue/get_all_providers失败时加 warn 日志测试
Review 结果
第一轮发现 3 个 Critical 全部已修复:
let _ = fm.try_switch()吞错 → match 分支 + warn/error