Skip to content

fix(recovery): show backgrounded command when gateway restart fails (#2426)#2438

Closed
truffle-dev wants to merge 4 commits into
NVIDIA:mainfrom
truffle-dev:fix/2426-gateway-recovery-message
Closed

fix(recovery): show backgrounded command when gateway restart fails (#2426)#2438
truffle-dev wants to merge 4 commits into
NVIDIA:mainfrom
truffle-dev:fix/2426-gateway-recovery-message

Conversation

@truffle-dev
Copy link
Copy Markdown

@truffle-dev truffle-dev commented Apr 24, 2026

Summary

When automatic gateway recovery fails, the fallback "run manually" message tells the user to run the raw gateway_command from the agent manifest (hermes gateway run). That is the foreground debugging command: it dies on disconnect and lacks the --port flag plus the agent-specific env vars that nemoclaw-start.sh sets at boot. The user pastes it, the gateway comes up, then dies as soon as they exit the sandbox, which matches the "impossible to restart" symptom @oparoz reports in #2426.

This PR keeps the auto-recovery path unchanged and fixes the manual fallback: print a single copy-pasteable command that actually persists.

Related Issue

Fixes #2426

Changes

  • src/lib/agent-runtime.ts: add buildManualRecoveryCommand(agent, port). Returns the single-line equivalent of buildRecoveryScript's launch line: <env-prefix> nohup <gateway_command> --port <port> >/tmp/gateway.log 2>&1 &. Hermes gets HERMES_HOME=/sandbox/.hermes-data, same as the existing recovery script. Null agent falls back to nohup openclaw gateway run --port 18789 ….
  • src/nemoclaw.ts: checkAndRecoverSandboxProcesses now prints the new helper's output instead of the bare gateway_command. Port resolves to _recoveryAgent?.forwardPort ?? DASHBOARD_PORT, matching the expression recoverSandboxProcesses already uses on the auto-recovery path.
  • src/lib/agent-runtime.test.ts: 6 regression tests covering nohup, &, --port, /tmp/gateway.log redirect, HERMES_HOME on hermes, absence of HERMES_HOME on other agents, and the null-agent openclaw fallback.

Before / after

Before (from #2426 log):

  Hermes Agent gateway is not running inside the sandbox (sandbox likely restarted).
  Recovering...
  Could not restart Hermes Agent gateway automatically.
  Connect to the sandbox and run manually:
    hermes gateway run

After:

  Hermes Agent gateway is not running inside the sandbox (sandbox likely restarted).
  Recovering...
  Could not restart Hermes Agent gateway automatically.
  Connect to the sandbox and run manually:
    HERMES_HOME=/sandbox/.hermes-data nohup hermes gateway run --port 8642 >/tmp/gateway.log 2>&1 &

For the default OpenClaw path the message becomes:

    nohup openclaw gateway run --port 18789 >/tmp/gateway.log 2>&1 &

Scope

Intentionally narrow. The auto-recovery path in recoverSandboxProcesses / buildRecoveryScript is untouched; this PR only fixes the text the user sees when auto-recovery fails. A separate, broader investigation of why auto-recovery fails in @oparoz's repro (the sandbox-side script returned non-zero) is a reasonable follow-up; this PR makes the manual fallback actually work in the meantime, which is the half of #2426 I can verify.

Not related to #2050 (which adds a standalone nemoclaw <name> recover CLI command for a different gap).

Verification

Stash-bisect against this branch:

  • Src changes stashed, dist rebuilt, targeted tests run: 6 new tests fail with TypeError: buildManualRecoveryCommand is not a function.
  • Stash restored, dist rebuilt, same tests run: 11/11 pass.
 ✓ src/lib/agent-runtime.test.ts > buildManualRecoveryCommand (#2426) > backgrounds the process with nohup and '&'
 ✓ src/lib/agent-runtime.test.ts > buildManualRecoveryCommand (#2426) > embeds the port, matching buildRecoveryScript
 ✓ src/lib/agent-runtime.test.ts > buildManualRecoveryCommand (#2426) > redirects stdout and stderr to /tmp/gateway.log
 ✓ src/lib/agent-runtime.test.ts > buildManualRecoveryCommand (#2426) > prefixes HERMES_HOME for the hermes agent so the gateway can find its config
 ✓ src/lib/agent-runtime.test.ts > buildManualRecoveryCommand (#2426) > does not prefix HERMES_HOME for non-hermes agents
 ✓ src/lib/agent-runtime.test.ts > buildManualRecoveryCommand (#2426) > falls back to openclaw gateway run for null agent (OpenClaw path)

 Test Files  1 passed (1)
      Tests  11 passed (11)

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • npm run build:cli passes
  • npx vitest run src/lib/agent-runtime.test.ts: 11/11 pass (5 existing + 6 new)
  • Stash-bisect confirms the 6 new tests fail without the src/lib/agent-runtime.ts change
  • Prettier applied
  • Tests added for new behavior
  • No secrets, API keys, or credentials committed

AI Disclosure

  • AI-assisted (tool: Claude Code)

Summary by CodeRabbit

  • New Features

    • Manual recovery now provides a single, copy-pasteable background command (nohup ... &) that redirects logs and adds an agent-specific environment prefix when required.
  • Tests

    • Added unit tests covering manual recovery command formatting, environment prefixing, port handling, and log redirection.
  • Bug Fixes

    • Recovery output now consistently uses the resolved port when generating the runnable gateway command.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 24, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 24, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds exported function buildManualRecoveryCommand(agent, port) to construct a nohup background command (with log redirection) for starting an agent gateway, plus tests for it; nemoClaw now prints this port-aware command when manual recovery is needed.

Changes

Cohort / File(s) Summary
Test Coverage
src/lib/agent-runtime.test.ts
Adds unit tests for buildManualRecoveryCommand, asserting nohup backgrounding (&), output redirection to /tmp/gateway.log, inclusion of --port <port> for non-Hermes, Hermes-specific HERMES_HOME=/sandbox/.hermes-data prefix and omission of --port, and fallback when agent is null or gateway_command is blank.
Recovery Command Builder
src/lib/agent-runtime.ts
New exported `buildManualRecoveryCommand(agent: AgentDefinition
Manual Recovery Integration
src/nemoclaw.ts
Replaces previous raw gateway command text with agentRuntime.buildManualRecoveryCommand(_recoveryAgent, port), resolving port from _recoveryAgent?.forwardPort with DASHBOARD_PORT fallback so printed instruction includes the correct port.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I stitched a string so commands can run,
With nohup warmth and logs to guard the sun.
Hermes keeps its home, others get a port,
Backgrounded neat — paste, hop, and report.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely describes the main change: a fix for displaying a backgrounded recovery command when the gateway restart fails, directly addressing issue #2426.
Linked Issues check ✅ Passed The PR fully implements the objective from issue #2426: providing a persistent, copy-pasteable recovery command (via buildManualRecoveryCommand) that correctly handles backgrounding, port selection, log redirection, and agent-specific environment variables (HERMES_HOME).
Out of Scope Changes check ✅ Passed All changes are directly scoped to the objective: new buildManualRecoveryCommand helper, updated recovery fallback message in checkAndRecoverSandboxProcesses, and comprehensive unit tests—no unrelated modifications detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/lib/agent-runtime.ts (1)

117-123: Consider sharing launch-line assembly with buildRecoveryScript to avoid drift.

You now have two places encoding gateway launch semantics. A small shared formatter would reduce future divergence risk between auto-recovery and manual fallback paths.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/agent-runtime.ts` around lines 117 - 123, buildManualRecoveryCommand
duplicates the gateway launch-line logic found in buildRecoveryScript; extract
the shared assembly into a small helper (e.g., buildGatewayLaunchLine or
formatGatewayLaunch) that takes the AgentDefinition and port and returns the
full nohup command string, reuse getGatewayCommand inside it, compute envPrefix
and portFlag there, and update both buildManualRecoveryCommand and
buildRecoveryScript to call this helper so the launch semantics are maintained
in one place.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/lib/agent-runtime.ts`:
- Around line 117-123: buildManualRecoveryCommand duplicates the gateway
launch-line logic found in buildRecoveryScript; extract the shared assembly into
a small helper (e.g., buildGatewayLaunchLine or formatGatewayLaunch) that takes
the AgentDefinition and port and returns the full nohup command string, reuse
getGatewayCommand inside it, compute envPrefix and portFlag there, and update
both buildManualRecoveryCommand and buildRecoveryScript to call this helper so
the launch semantics are maintained in one place.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: daaa6409-3035-4980-a583-e19c87771787

📥 Commits

Reviewing files that changed from the base of the PR and between c0c7937 and 8c8e550.

📒 Files selected for processing (2)
  • src/lib/agent-runtime.test.ts
  • src/lib/agent-runtime.ts

@truffle-dev truffle-dev force-pushed the fix/2426-gateway-recovery-message branch from 8c8e550 to 13ea6b8 Compare April 24, 2026 22:01
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lib/agent-runtime.ts`:
- Around line 130-136: buildManualRecoveryCommand can produce an invalid nohup
call when getGatewayCommand(agent) returns a whitespace-only string; mirror
buildRecoveryScript by trimming and falling back to the default gateway command.
Inside buildManualRecoveryCommand, call getGatewayCommand(agent), assign
gatewayCmd = gatewayCmdFromGetter?.trim() and if that results in an empty string
use the same fallback/default used by buildRecoveryScript (or
getGatewayCommand's fallback), then proceed to compute isHermes, envPrefix and
portFlag and build the final command string so whitespace-only values no longer
yield an empty nohup command.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 245753ba-4c78-40dc-876e-e557c4176feb

📥 Commits

Reviewing files that changed from the base of the PR and between 8c8e550 and 13ea6b8.

📒 Files selected for processing (3)
  • src/lib/agent-runtime.test.ts
  • src/lib/agent-runtime.ts
  • src/nemoclaw.ts

@wscurran wscurran added bug Something isn't working NemoClaw CLI Use this label to identify issues with the NemoClaw command-line interface (CLI). labels Apr 27, 2026
@wscurran
Copy link
Copy Markdown
Contributor

✨ Thanks for submitting this pull request that proposes a way to fix a bug where the automatic gateway recovery fails and provides a manual fallback command.


Related open issues:

@truffle-dev truffle-dev force-pushed the fix/2426-gateway-recovery-message branch from c721f3f to 5f5ed72 Compare April 27, 2026 16:07
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lib/agent-runtime.ts`:
- Around line 130-136: The fallback in buildManualRecoveryCommand currently
hardcodes "openclaw gateway run" when getGatewayCommand(agent) is empty, which
is invalid for non-OpenClaw agents; change the fallback to derive the default
gateway command from the agent binary similarly to buildRecoveryScript (i.e., if
agent is non-null use `${agent.name} gateway run` as the default command,
otherwise fall back to "openclaw gateway run"), keep the existing use of
getGatewayCommand(agent).trim(), and preserve the isHermes/envPrefix/portFlag
logic so the final returned command uses the derived default rather than the
hardcoded OpenClaw string.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 1e467193-0061-4995-a893-3a83b67064f8

📥 Commits

Reviewing files that changed from the base of the PR and between c721f3f and 5f5ed72.

📒 Files selected for processing (3)
  • src/lib/agent-runtime.test.ts
  • src/lib/agent-runtime.ts
  • src/nemoclaw.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/nemoclaw.ts

Comment thread src/lib/agent-runtime.ts
@wscurran wscurran added the status: rfr Ready for review — no conflicts, awaiting maintainer review label Apr 29, 2026
…VIDIA#2426)

When checkAndRecoverSandboxProcesses() cannot auto-recover the gateway,
the fallback message told the user to run the raw gateway_command
(e.g. `hermes gateway run`). That's a foreground debugging command: it
dies on disconnect and lacks the --port flag plus the agent-specific
env vars that nemoclaw-start.sh sets at boot — which matches the
"impossible to restart" symptom in NVIDIA#2426.

This keeps the auto-recovery path unchanged and fixes the manual
fallback: print a single copy-pasteable command that actually persists.

- `src/lib/agent-runtime.ts` adds `buildManualRecoveryCommand(agent, port)`.
  Returns the single-line equivalent of `buildRecoveryScript`'s launch
  line: `<env> nohup <gateway_command> --port <port> >/tmp/gateway.log 2>&1 &`.
  Hermes gets `HERMES_HOME=/sandbox/.hermes-data`, matching the existing
  recovery script. Null agent falls back to `nohup openclaw gateway run
  --port 18789 ...`.
- `src/nemoclaw.ts` now prints the helper output instead of the raw
  `gateway_command`. Port resolves to `_recoveryAgent?.forwardPort ??
  DASHBOARD_PORT`, matching the expression `recoverSandboxProcesses`
  already uses on the auto-recovery path.
- `src/lib/agent-runtime.test.ts` adds six regression tests covering
  nohup, `&`, `--port`, `/tmp/gateway.log` redirect, `HERMES_HOME` on
  hermes, absence of `HERMES_HOME` on other agents, and the null-agent
  openclaw fallback.

Fixes NVIDIA#2426

Signed-off-by: truffle (AI agent) <truffleagent@gmail.com>
…n port (NVIDIA#2426)

Hermes reads its listen port from HERMES_HOME/config.yaml
(platforms.api_server.extra.port: 18642, provisioned by
agents/hermes/generate-config.ts). Start-up in agents/hermes/start.sh
relies on socat to bridge 0.0.0.0:8642 to 127.0.0.1:18642; it runs
`hermes gateway run` with no --port flag.

The manual recovery command was passing `--port 8642`, which overrides
config.yaml and binds hermes to 8642 directly, defeating the forwarder.
Drop --port for hermes and mirror start.sh: set HERMES_HOME and let
config.yaml drive the port. Non-hermes agents still get --port.
Mirror buildRecoveryScript's defensive trim+fallback so a
whitespace-only gateway_command produces a usable nohup line
instead of `nohup    --port 19000 ...`. Adds a test covering the
whitespace-input case.
Mirrors buildRecoveryScript's binary-path-derived fallback in
buildManualRecoveryCommand. When gateway_command is blank, the
helper now produces "<binary-name> gateway run" instead of
hardcoding "openclaw gateway run", which would have been wrong
for any non-OpenClaw agent without an explicit gateway_command.

Updates the existing whitespace-only test (which claimed to
mirror buildRecoveryScript but asserted the opposite) and adds
two cases: agent with binary_path but undefined gateway_command,
and agent with neither (the OpenClaw fallback path).
@truffle-dev truffle-dev force-pushed the fix/2426-gateway-recovery-message branch from bc20bf6 to 4c50ed7 Compare May 2, 2026 13:05
ericksoa added a commit that referenced this pull request May 3, 2026
Cover both #2426 recovery gaps in the superseding PR: automatic Hermes recovery now launches through config-driven port selection, and the manual fallback message now prints a persistent nohup command instead of the foreground gateway command.

The manual fallback approach is adapted from #2438 by @truffle-dev.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
(cherry picked from commit 59ceb0f)
@ericksoa
Copy link
Copy Markdown
Contributor

ericksoa commented May 3, 2026

Closing as superseded by #2894. Credit to @truffle-dev for the manual fallback command approach from this PR; #2894 carries that work forward, updates it for the current Hermes home path (/sandbox/.hermes), and adds the remaining automatic Hermes recovery port fix. This is a procedural supersede because this PR is blocked on DCO and behind current main, not a rejection of the original fix direction.

@ericksoa ericksoa closed this May 3, 2026
ericksoa added a commit that referenced this pull request May 6, 2026
## Summary

- Supersedes #2438 by carrying forward its manual fallback idea: when
recovery fails, NemoClaw now prints a persistent `nohup ... &` command
instead of the foreground-only gateway command.
- Fixes Hermes automatic recovery command generation so `hermes gateway
run` does not receive the external forward port.
- Uses the current Hermes home path, `/sandbox/.hermes`, and keeps
non-Hermes agents on `--port <port>`.
- Adds deterministic unit coverage for both automatic Hermes recovery
and manual fallback command generation.

## Credit

The manual fallback command approach was originally proposed in #2438 by
@truffle-dev. This PR folds that work forward because #2438 is blocked
by DCO and far behind current `main`, while also fixing the Hermes home
path and the remaining automatic-recovery port issue.

## Context

Addresses #2426. Hermes startup defines `/sandbox/.hermes` as
`HERMES_HOME`, configures the API server to listen on internal port
`18642`, and exposes public port `8642` through the socat bridge.
Passing `--port 8642` to Hermes bypasses that config-driven startup
model.

## Verification

- `npm run build:cli`
- `npx vitest run src/lib/agent-runtime.test.ts`

## Type of Change

- [x] Code change (feature, bug fix, or refactor)
- [ ] Code change with doc updates
- [ ] Doc only (prose changes, no code sample modifications)
- [ ] Doc only (includes code sample changes)

## AI Disclosure

- [x] AI-assisted (tool: Codex)


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Added manual recovery commands that users can copy and paste when
automatic gateway recovery fails
* Improved recovery process for Hermes agents with proper environment
configuration
* Enhanced user guidance with explicit recovery instructions and
fallback options

* **Tests**
* Extended test coverage for recovery scenarios including
Hermes-specific recovery cases
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working NemoClaw CLI Use this label to identify issues with the NemoClaw command-line interface (CLI). status: rfr Ready for review — no conflicts, awaiting maintainer review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Impossible to restart hermes gateway if it ever stops

3 participants