fix(recovery): show backgrounded command when gateway restart fails (#2426)#2438
fix(recovery): show backgrounded command when gateway restart fails (#2426)#2438truffle-dev wants to merge 4 commits into
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds exported function Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
src/lib/agent-runtime.ts (1)
117-123: Consider sharing launch-line assembly withbuildRecoveryScriptto avoid drift.You now have two places encoding gateway launch semantics. A small shared formatter would reduce future divergence risk between auto-recovery and manual fallback paths.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/agent-runtime.ts` around lines 117 - 123, buildManualRecoveryCommand duplicates the gateway launch-line logic found in buildRecoveryScript; extract the shared assembly into a small helper (e.g., buildGatewayLaunchLine or formatGatewayLaunch) that takes the AgentDefinition and port and returns the full nohup command string, reuse getGatewayCommand inside it, compute envPrefix and portFlag there, and update both buildManualRecoveryCommand and buildRecoveryScript to call this helper so the launch semantics are maintained in one place.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@src/lib/agent-runtime.ts`:
- Around line 117-123: buildManualRecoveryCommand duplicates the gateway
launch-line logic found in buildRecoveryScript; extract the shared assembly into
a small helper (e.g., buildGatewayLaunchLine or formatGatewayLaunch) that takes
the AgentDefinition and port and returns the full nohup command string, reuse
getGatewayCommand inside it, compute envPrefix and portFlag there, and update
both buildManualRecoveryCommand and buildRecoveryScript to call this helper so
the launch semantics are maintained in one place.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: daaa6409-3035-4980-a583-e19c87771787
📒 Files selected for processing (2)
src/lib/agent-runtime.test.tssrc/lib/agent-runtime.ts
8c8e550 to
13ea6b8
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/lib/agent-runtime.ts`:
- Around line 130-136: buildManualRecoveryCommand can produce an invalid nohup
call when getGatewayCommand(agent) returns a whitespace-only string; mirror
buildRecoveryScript by trimming and falling back to the default gateway command.
Inside buildManualRecoveryCommand, call getGatewayCommand(agent), assign
gatewayCmd = gatewayCmdFromGetter?.trim() and if that results in an empty string
use the same fallback/default used by buildRecoveryScript (or
getGatewayCommand's fallback), then proceed to compute isHermes, envPrefix and
portFlag and build the final command string so whitespace-only values no longer
yield an empty nohup command.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 245753ba-4c78-40dc-876e-e557c4176feb
📒 Files selected for processing (3)
src/lib/agent-runtime.test.tssrc/lib/agent-runtime.tssrc/nemoclaw.ts
|
✨ Thanks for submitting this pull request that proposes a way to fix a bug where the automatic gateway recovery fails and provides a manual fallback command. Related open issues: |
c721f3f to
5f5ed72
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/lib/agent-runtime.ts`:
- Around line 130-136: The fallback in buildManualRecoveryCommand currently
hardcodes "openclaw gateway run" when getGatewayCommand(agent) is empty, which
is invalid for non-OpenClaw agents; change the fallback to derive the default
gateway command from the agent binary similarly to buildRecoveryScript (i.e., if
agent is non-null use `${agent.name} gateway run` as the default command,
otherwise fall back to "openclaw gateway run"), keep the existing use of
getGatewayCommand(agent).trim(), and preserve the isHermes/envPrefix/portFlag
logic so the final returned command uses the derived default rather than the
hardcoded OpenClaw string.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 1e467193-0061-4995-a893-3a83b67064f8
📒 Files selected for processing (3)
src/lib/agent-runtime.test.tssrc/lib/agent-runtime.tssrc/nemoclaw.ts
🚧 Files skipped from review as they are similar to previous changes (1)
- src/nemoclaw.ts
…VIDIA#2426) When checkAndRecoverSandboxProcesses() cannot auto-recover the gateway, the fallback message told the user to run the raw gateway_command (e.g. `hermes gateway run`). That's a foreground debugging command: it dies on disconnect and lacks the --port flag plus the agent-specific env vars that nemoclaw-start.sh sets at boot — which matches the "impossible to restart" symptom in NVIDIA#2426. This keeps the auto-recovery path unchanged and fixes the manual fallback: print a single copy-pasteable command that actually persists. - `src/lib/agent-runtime.ts` adds `buildManualRecoveryCommand(agent, port)`. Returns the single-line equivalent of `buildRecoveryScript`'s launch line: `<env> nohup <gateway_command> --port <port> >/tmp/gateway.log 2>&1 &`. Hermes gets `HERMES_HOME=/sandbox/.hermes-data`, matching the existing recovery script. Null agent falls back to `nohup openclaw gateway run --port 18789 ...`. - `src/nemoclaw.ts` now prints the helper output instead of the raw `gateway_command`. Port resolves to `_recoveryAgent?.forwardPort ?? DASHBOARD_PORT`, matching the expression `recoverSandboxProcesses` already uses on the auto-recovery path. - `src/lib/agent-runtime.test.ts` adds six regression tests covering nohup, `&`, `--port`, `/tmp/gateway.log` redirect, `HERMES_HOME` on hermes, absence of `HERMES_HOME` on other agents, and the null-agent openclaw fallback. Fixes NVIDIA#2426 Signed-off-by: truffle (AI agent) <truffleagent@gmail.com>
…n port (NVIDIA#2426) Hermes reads its listen port from HERMES_HOME/config.yaml (platforms.api_server.extra.port: 18642, provisioned by agents/hermes/generate-config.ts). Start-up in agents/hermes/start.sh relies on socat to bridge 0.0.0.0:8642 to 127.0.0.1:18642; it runs `hermes gateway run` with no --port flag. The manual recovery command was passing `--port 8642`, which overrides config.yaml and binds hermes to 8642 directly, defeating the forwarder. Drop --port for hermes and mirror start.sh: set HERMES_HOME and let config.yaml drive the port. Non-hermes agents still get --port.
Mirror buildRecoveryScript's defensive trim+fallback so a whitespace-only gateway_command produces a usable nohup line instead of `nohup --port 19000 ...`. Adds a test covering the whitespace-input case.
Mirrors buildRecoveryScript's binary-path-derived fallback in buildManualRecoveryCommand. When gateway_command is blank, the helper now produces "<binary-name> gateway run" instead of hardcoding "openclaw gateway run", which would have been wrong for any non-OpenClaw agent without an explicit gateway_command. Updates the existing whitespace-only test (which claimed to mirror buildRecoveryScript but asserted the opposite) and adds two cases: agent with binary_path but undefined gateway_command, and agent with neither (the OpenClaw fallback path).
bc20bf6 to
4c50ed7
Compare
Cover both #2426 recovery gaps in the superseding PR: automatic Hermes recovery now launches through config-driven port selection, and the manual fallback message now prints a persistent nohup command instead of the foreground gateway command. The manual fallback approach is adapted from #2438 by @truffle-dev. Signed-off-by: Aaron Erickson <aerickson@nvidia.com> (cherry picked from commit 59ceb0f)
|
Closing as superseded by #2894. Credit to @truffle-dev for the manual fallback command approach from this PR; #2894 carries that work forward, updates it for the current Hermes home path (/sandbox/.hermes), and adds the remaining automatic Hermes recovery port fix. This is a procedural supersede because this PR is blocked on DCO and behind current main, not a rejection of the original fix direction. |
## Summary - Supersedes #2438 by carrying forward its manual fallback idea: when recovery fails, NemoClaw now prints a persistent `nohup ... &` command instead of the foreground-only gateway command. - Fixes Hermes automatic recovery command generation so `hermes gateway run` does not receive the external forward port. - Uses the current Hermes home path, `/sandbox/.hermes`, and keeps non-Hermes agents on `--port <port>`. - Adds deterministic unit coverage for both automatic Hermes recovery and manual fallback command generation. ## Credit The manual fallback command approach was originally proposed in #2438 by @truffle-dev. This PR folds that work forward because #2438 is blocked by DCO and far behind current `main`, while also fixing the Hermes home path and the remaining automatic-recovery port issue. ## Context Addresses #2426. Hermes startup defines `/sandbox/.hermes` as `HERMES_HOME`, configures the API server to listen on internal port `18642`, and exposes public port `8642` through the socat bridge. Passing `--port 8642` to Hermes bypasses that config-driven startup model. ## Verification - `npm run build:cli` - `npx vitest run src/lib/agent-runtime.test.ts` ## Type of Change - [x] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [ ] Doc only (includes code sample changes) ## AI Disclosure - [x] AI-assisted (tool: Codex) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added manual recovery commands that users can copy and paste when automatic gateway recovery fails * Improved recovery process for Hermes agents with proper environment configuration * Enhanced user guidance with explicit recovery instructions and fallback options * **Tests** * Extended test coverage for recovery scenarios including Hermes-specific recovery cases <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Summary
When automatic gateway recovery fails, the fallback "run manually" message tells the user to run the raw
gateway_commandfrom the agent manifest (hermes gateway run). That is the foreground debugging command: it dies on disconnect and lacks the--portflag plus the agent-specific env vars thatnemoclaw-start.shsets at boot. The user pastes it, the gateway comes up, then dies as soon as they exit the sandbox, which matches the "impossible to restart" symptom @oparoz reports in #2426.This PR keeps the auto-recovery path unchanged and fixes the manual fallback: print a single copy-pasteable command that actually persists.
Related Issue
Fixes #2426
Changes
src/lib/agent-runtime.ts: addbuildManualRecoveryCommand(agent, port). Returns the single-line equivalent ofbuildRecoveryScript's launch line:<env-prefix> nohup <gateway_command> --port <port> >/tmp/gateway.log 2>&1 &. Hermes getsHERMES_HOME=/sandbox/.hermes-data, same as the existing recovery script. Null agent falls back tonohup openclaw gateway run --port 18789 ….src/nemoclaw.ts:checkAndRecoverSandboxProcessesnow prints the new helper's output instead of the baregateway_command. Port resolves to_recoveryAgent?.forwardPort ?? DASHBOARD_PORT, matching the expressionrecoverSandboxProcessesalready uses on the auto-recovery path.src/lib/agent-runtime.test.ts: 6 regression tests coveringnohup,&,--port,/tmp/gateway.logredirect,HERMES_HOMEon hermes, absence ofHERMES_HOMEon other agents, and the null-agent openclaw fallback.Before / after
Before (from #2426 log):
After:
For the default OpenClaw path the message becomes:
Scope
Intentionally narrow. The auto-recovery path in
recoverSandboxProcesses/buildRecoveryScriptis untouched; this PR only fixes the text the user sees when auto-recovery fails. A separate, broader investigation of why auto-recovery fails in @oparoz's repro (the sandbox-side script returned non-zero) is a reasonable follow-up; this PR makes the manual fallback actually work in the meantime, which is the half of #2426 I can verify.Not related to #2050 (which adds a standalone
nemoclaw <name> recoverCLI command for a different gap).Verification
Stash-bisect against this branch:
TypeError: buildManualRecoveryCommand is not a function.Type of Change
Verification
npm run build:clipassesnpx vitest run src/lib/agent-runtime.test.ts: 11/11 pass (5 existing + 6 new)src/lib/agent-runtime.tschangeAI Disclosure
Summary by CodeRabbit
New Features
Tests
Bug Fixes