fix(recovery): show backgrounded command when gateway restart fails (#2426) by truffle-dev · Pull Request #2438 · NVIDIA/NemoClaw

truffle-dev · 2026-04-24T14:12:24Z

Summary

When automatic gateway recovery fails, the fallback "run manually" message tells the user to run the raw gateway_command from the agent manifest (hermes gateway run). That is the foreground debugging command: it dies on disconnect and lacks the --port flag plus the agent-specific env vars that nemoclaw-start.sh sets at boot. The user pastes it, the gateway comes up, then dies as soon as they exit the sandbox, which matches the "impossible to restart" symptom @oparoz reports in #2426.

This PR keeps the auto-recovery path unchanged and fixes the manual fallback: print a single copy-pasteable command that actually persists.

Related Issue

Fixes #2426

Changes

src/lib/agent-runtime.ts: add buildManualRecoveryCommand(agent, port). Returns the single-line equivalent of buildRecoveryScript's launch line: <env-prefix> nohup <gateway_command> --port <port> >/tmp/gateway.log 2>&1 &. Hermes gets HERMES_HOME=/sandbox/.hermes-data, same as the existing recovery script. Null agent falls back to nohup openclaw gateway run --port 18789 ….
src/nemoclaw.ts: checkAndRecoverSandboxProcesses now prints the new helper's output instead of the bare gateway_command. Port resolves to _recoveryAgent?.forwardPort ?? DASHBOARD_PORT, matching the expression recoverSandboxProcesses already uses on the auto-recovery path.
src/lib/agent-runtime.test.ts: 6 regression tests covering nohup, &, --port, /tmp/gateway.log redirect, HERMES_HOME on hermes, absence of HERMES_HOME on other agents, and the null-agent openclaw fallback.

Before / after

Before (from #2426 log):

  Hermes Agent gateway is not running inside the sandbox (sandbox likely restarted).
  Recovering...
  Could not restart Hermes Agent gateway automatically.
  Connect to the sandbox and run manually:
    hermes gateway run

After:

  Hermes Agent gateway is not running inside the sandbox (sandbox likely restarted).
  Recovering...
  Could not restart Hermes Agent gateway automatically.
  Connect to the sandbox and run manually:
    HERMES_HOME=/sandbox/.hermes-data nohup hermes gateway run --port 8642 >/tmp/gateway.log 2>&1 &

For the default OpenClaw path the message becomes:

    nohup openclaw gateway run --port 18789 >/tmp/gateway.log 2>&1 &

Scope

Intentionally narrow. The auto-recovery path in recoverSandboxProcesses / buildRecoveryScript is untouched; this PR only fixes the text the user sees when auto-recovery fails. A separate, broader investigation of why auto-recovery fails in @oparoz's repro (the sandbox-side script returned non-zero) is a reasonable follow-up; this PR makes the manual fallback actually work in the meantime, which is the half of #2426 I can verify.

Not related to #2050 (which adds a standalone nemoclaw <name> recover CLI command for a different gap).

Verification

Stash-bisect against this branch:

Src changes stashed, dist rebuilt, targeted tests run: 6 new tests fail with TypeError: buildManualRecoveryCommand is not a function.
Stash restored, dist rebuilt, same tests run: 11/11 pass.

 ✓ src/lib/agent-runtime.test.ts > buildManualRecoveryCommand (#2426) > backgrounds the process with nohup and '&'
 ✓ src/lib/agent-runtime.test.ts > buildManualRecoveryCommand (#2426) > embeds the port, matching buildRecoveryScript
 ✓ src/lib/agent-runtime.test.ts > buildManualRecoveryCommand (#2426) > redirects stdout and stderr to /tmp/gateway.log
 ✓ src/lib/agent-runtime.test.ts > buildManualRecoveryCommand (#2426) > prefixes HERMES_HOME for the hermes agent so the gateway can find its config
 ✓ src/lib/agent-runtime.test.ts > buildManualRecoveryCommand (#2426) > does not prefix HERMES_HOME for non-hermes agents
 ✓ src/lib/agent-runtime.test.ts > buildManualRecoveryCommand (#2426) > falls back to openclaw gateway run for null agent (OpenClaw path)

 Test Files  1 passed (1)
      Tests  11 passed (11)

Type of Change

Code change (feature, bug fix, or refactor)
Code change with doc updates
Doc only (prose changes, no code sample modifications)
Doc only (includes code sample changes)

Verification

npm run build:cli passes
npx vitest run src/lib/agent-runtime.test.ts: 11/11 pass (5 existing + 6 new)
Stash-bisect confirms the 6 new tests fail without the src/lib/agent-runtime.ts change
Prettier applied
Tests added for new behavior
No secrets, API keys, or credentials committed

AI Disclosure

AI-assisted (tool: Claude Code)

Summary by CodeRabbit

New Features
- Manual recovery now provides a single, copy-pasteable background command (nohup ... &) that redirects logs and adds an agent-specific environment prefix when required.
Tests
- Added unit tests covering manual recovery command formatting, environment prefixing, port handling, and log redirection.
Bug Fixes
- Recovery output now consistently uses the resolved port when generating the runnable gateway command.

copy-pr-bot · 2026-04-24T14:12:28Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-04-24T14:12:37Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds exported function buildManualRecoveryCommand(agent, port) to construct a nohup background command (with log redirection) for starting an agent gateway, plus tests for it; nemoClaw now prints this port-aware command when manual recovery is needed.

Changes

Cohort / File(s)	Summary
Test Coverage `src/lib/agent-runtime.test.ts`	Adds unit tests for `buildManualRecoveryCommand`, asserting `nohup` backgrounding (`&`), output redirection to `/tmp/gateway.log`, inclusion of `--port <port>` for non-Hermes, Hermes-specific `HERMES_HOME=/sandbox/.hermes-data` prefix and omission of `--port`, and fallback when agent is `null` or `gateway_command` is blank.
Recovery Command Builder `src/lib/agent-runtime.ts`	New exported `buildManualRecoveryCommand(agent: AgentDefinition
Manual Recovery Integration `src/nemoclaw.ts`	Replaces previous raw gateway command text with `agentRuntime.buildManualRecoveryCommand(_recoveryAgent, port)`, resolving `port` from `_recoveryAgent?.forwardPort` with `DASHBOARD_PORT` fallback so printed instruction includes the correct port.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I stitched a string so commands can run,
With nohup warmth and logs to guard the sun.
Hermes keeps its home, others get a port,
Backgrounded neat — paste, hop, and report.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely describes the main change: a fix for displaying a backgrounded recovery command when the gateway restart fails, directly addressing issue `#2426`.
Linked Issues check	✅ Passed	The PR fully implements the objective from issue `#2426`: providing a persistent, copy-pasteable recovery command (via buildManualRecoveryCommand) that correctly handles backgrounding, port selection, log redirection, and agent-specific environment variables (HERMES_HOME).
Out of Scope Changes check	✅ Passed	All changes are directly scoped to the objective: new buildManualRecoveryCommand helper, updated recovery fallback message in checkAndRecoverSandboxProcesses, and comprehensive unit tests—no unrelated modifications detected.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

src/lib/agent-runtime.ts (1)
117-123: Consider sharing launch-line assembly with buildRecoveryScript to avoid drift.

You now have two places encoding gateway launch semantics. A small shared formatter would reduce future divergence risk between auto-recovery and manual fallback paths.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/agent-runtime.ts` around lines 117 - 123, buildManualRecoveryCommand
duplicates the gateway launch-line logic found in buildRecoveryScript; extract
the shared assembly into a small helper (e.g., buildGatewayLaunchLine or
formatGatewayLaunch) that takes the AgentDefinition and port and returns the
full nohup command string, reuse getGatewayCommand inside it, compute envPrefix
and portFlag there, and update both buildManualRecoveryCommand and
buildRecoveryScript to call this helper so the launch semantics are maintained
in one place.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/lib/agent-runtime.ts`:
- Around line 117-123: buildManualRecoveryCommand duplicates the gateway
launch-line logic found in buildRecoveryScript; extract the shared assembly into
a small helper (e.g., buildGatewayLaunchLine or formatGatewayLaunch) that takes
the AgentDefinition and port and returns the full nohup command string, reuse
getGatewayCommand inside it, compute envPrefix and portFlag there, and update
both buildManualRecoveryCommand and buildRecoveryScript to call this helper so
the launch semantics are maintained in one place.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: daaa6409-3035-4980-a583-e19c87771787

📥 Commits

Reviewing files that changed from the base of the PR and between c0c7937 and 8c8e550.

📒 Files selected for processing (2)

src/lib/agent-runtime.test.ts
src/lib/agent-runtime.ts

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lib/agent-runtime.ts`:
- Around line 130-136: buildManualRecoveryCommand can produce an invalid nohup
call when getGatewayCommand(agent) returns a whitespace-only string; mirror
buildRecoveryScript by trimming and falling back to the default gateway command.
Inside buildManualRecoveryCommand, call getGatewayCommand(agent), assign
gatewayCmd = gatewayCmdFromGetter?.trim() and if that results in an empty string
use the same fallback/default used by buildRecoveryScript (or
getGatewayCommand's fallback), then proceed to compute isHermes, envPrefix and
portFlag and build the final command string so whitespace-only values no longer
yield an empty nohup command.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 245753ba-4c78-40dc-876e-e557c4176feb

📥 Commits

Reviewing files that changed from the base of the PR and between 8c8e550 and 13ea6b8.

📒 Files selected for processing (3)

src/lib/agent-runtime.test.ts
src/lib/agent-runtime.ts
src/nemoclaw.ts

wscurran · 2026-04-27T15:22:04Z

✨ Thanks for submitting this pull request that proposes a way to fix a bug where the automatic gateway recovery fails and provides a manual fallback command.

Related open issues:

#2426 Impossible to restart hermes gateway if it ever stops

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lib/agent-runtime.ts`:
- Around line 130-136: The fallback in buildManualRecoveryCommand currently
hardcodes "openclaw gateway run" when getGatewayCommand(agent) is empty, which
is invalid for non-OpenClaw agents; change the fallback to derive the default
gateway command from the agent binary similarly to buildRecoveryScript (i.e., if
agent is non-null use `${agent.name} gateway run` as the default command,
otherwise fall back to "openclaw gateway run"), keep the existing use of
getGatewayCommand(agent).trim(), and preserve the isHermes/envPrefix/portFlag
logic so the final returned command uses the derived default rather than the
hardcoded OpenClaw string.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 1e467193-0061-4995-a893-3a83b67064f8

📥 Commits

Reviewing files that changed from the base of the PR and between c721f3f and 5f5ed72.

📒 Files selected for processing (3)

src/lib/agent-runtime.test.ts
src/lib/agent-runtime.ts
src/nemoclaw.ts

🚧 Files skipped from review as they are similar to previous changes (1)

src/nemoclaw.ts

…VIDIA#2426) When checkAndRecoverSandboxProcesses() cannot auto-recover the gateway, the fallback message told the user to run the raw gateway_command (e.g. `hermes gateway run`). That's a foreground debugging command: it dies on disconnect and lacks the --port flag plus the agent-specific env vars that nemoclaw-start.sh sets at boot — which matches the "impossible to restart" symptom in NVIDIA#2426. This keeps the auto-recovery path unchanged and fixes the manual fallback: print a single copy-pasteable command that actually persists. - `src/lib/agent-runtime.ts` adds `buildManualRecoveryCommand(agent, port)`. Returns the single-line equivalent of `buildRecoveryScript`'s launch line: `<env> nohup <gateway_command> --port <port> >/tmp/gateway.log 2>&1 &`. Hermes gets `HERMES_HOME=/sandbox/.hermes-data`, matching the existing recovery script. Null agent falls back to `nohup openclaw gateway run --port 18789 ...`. - `src/nemoclaw.ts` now prints the helper output instead of the raw `gateway_command`. Port resolves to `_recoveryAgent?.forwardPort ?? DASHBOARD_PORT`, matching the expression `recoverSandboxProcesses` already uses on the auto-recovery path. - `src/lib/agent-runtime.test.ts` adds six regression tests covering nohup, `&`, `--port`, `/tmp/gateway.log` redirect, `HERMES_HOME` on hermes, absence of `HERMES_HOME` on other agents, and the null-agent openclaw fallback. Fixes NVIDIA#2426 Signed-off-by: truffle (AI agent) <truffleagent@gmail.com>

…n port (NVIDIA#2426) Hermes reads its listen port from HERMES_HOME/config.yaml (platforms.api_server.extra.port: 18642, provisioned by agents/hermes/generate-config.ts). Start-up in agents/hermes/start.sh relies on socat to bridge 0.0.0.0:8642 to 127.0.0.1:18642; it runs `hermes gateway run` with no --port flag. The manual recovery command was passing `--port 8642`, which overrides config.yaml and binds hermes to 8642 directly, defeating the forwarder. Drop --port for hermes and mirror start.sh: set HERMES_HOME and let config.yaml drive the port. Non-hermes agents still get --port.

Mirror buildRecoveryScript's defensive trim+fallback so a whitespace-only gateway_command produces a usable nohup line instead of `nohup --port 19000 ...`. Adds a test covering the whitespace-input case.

Mirrors buildRecoveryScript's binary-path-derived fallback in buildManualRecoveryCommand. When gateway_command is blank, the helper now produces "<binary-name> gateway run" instead of hardcoding "openclaw gateway run", which would have been wrong for any non-OpenClaw agent without an explicit gateway_command. Updates the existing whitespace-only test (which claimed to mirror buildRecoveryScript but asserted the opposite) and adds two cases: agent with binary_path but undefined gateway_command, and agent with neither (the OpenClaw fallback path).

@truffle-dev

Cover both #2426 recovery gaps in the superseding PR: automatic Hermes recovery now launches through config-driven port selection, and the manual fallback message now prints a persistent nohup command instead of the foreground gateway command. The manual fallback approach is adapted from #2438 by @truffle-dev. Signed-off-by: Aaron Erickson <aerickson@nvidia.com> (cherry picked from commit 59ceb0f)

ericksoa · 2026-05-03T02:49:46Z

Closing as superseded by #2894. Credit to @truffle-dev for the manual fallback command approach from this PR; #2894 carries that work forward, updates it for the current Hermes home path (/sandbox/.hermes), and adds the remaining automatic Hermes recovery port fix. This is a procedural supersede because this PR is blocked on DCO and behind current main, not a rejection of the original fix direction.

@truffle-dev

## Summary - Supersedes #2438 by carrying forward its manual fallback idea: when recovery fails, NemoClaw now prints a persistent `nohup ... &` command instead of the foreground-only gateway command. - Fixes Hermes automatic recovery command generation so `hermes gateway run` does not receive the external forward port. - Uses the current Hermes home path, `/sandbox/.hermes`, and keeps non-Hermes agents on `--port <port>`. - Adds deterministic unit coverage for both automatic Hermes recovery and manual fallback command generation. ## Credit The manual fallback command approach was originally proposed in #2438 by @truffle-dev. This PR folds that work forward because #2438 is blocked by DCO and far behind current `main`, while also fixing the Hermes home path and the remaining automatic-recovery port issue. ## Context Addresses #2426. Hermes startup defines `/sandbox/.hermes` as `HERMES_HOME`, configures the API server to listen on internal port `18642`, and exposes public port `8642` through the socat bridge. Passing `--port 8642` to Hermes bypasses that config-driven startup model. ## Verification - `npm run build:cli` - `npx vitest run src/lib/agent-runtime.test.ts` ## Type of Change - [x] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [ ] Doc only (includes code sample changes) ## AI Disclosure - [x] AI-assisted (tool: Codex)  ## Summary by CodeRabbit * **New Features** * Added manual recovery commands that users can copy and paste when automatic gateway recovery fails * Improved recovery process for Hermes agents with proper environment configuration * Enhanced user guidance with explicit recovery instructions and fallback options * **Tests** * Extended test coverage for recovery scenarios including Hermes-specific recovery cases  --------- Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

truffle-dev mentioned this pull request Apr 24, 2026

Impossible to restart hermes gateway if it ever stops #2426

Open

2 tasks

coderabbitai Bot reviewed Apr 24, 2026

View reviewed changes

truffle-dev force-pushed the fix/2426-gateway-recovery-message branch from 8c8e550 to 13ea6b8 Compare April 24, 2026 22:01

coderabbitai Bot reviewed Apr 24, 2026

View reviewed changes

wscurran added bug Something isn't working NemoClaw CLI Use this label to identify issues with the NemoClaw command-line interface (CLI). labels Apr 27, 2026

truffle-dev force-pushed the fix/2426-gateway-recovery-message branch from c721f3f to 5f5ed72 Compare April 27, 2026 16:07

coderabbitai Bot reviewed Apr 27, 2026

View reviewed changes

Comment thread src/lib/agent-runtime.ts

wscurran added the status: rfr Ready for review — no conflicts, awaiting maintainer review label Apr 29, 2026

truffle-dev added 4 commits May 2, 2026 13:03

fix(recovery): trim gateway_command whitespace in manual fallback

5d4144c

Mirror buildRecoveryScript's defensive trim+fallback so a whitespace-only gateway_command produces a usable nohup line instead of `nohup --port 19000 ...`. Adds a test covering the whitespace-input case.

truffle-dev force-pushed the fix/2426-gateway-recovery-message branch from bc20bf6 to 4c50ed7 Compare May 2, 2026 13:05

ericksoa mentioned this pull request May 3, 2026

fix(recovery): let Hermes auto-recovery use config port #2894

Merged

5 tasks

ericksoa closed this May 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(recovery): show backgrounded command when gateway restart fails (#2426)#2438

fix(recovery): show backgrounded command when gateway restart fails (#2426)#2438
truffle-dev wants to merge 4 commits into
NVIDIA:mainfrom
truffle-dev:fix/2426-gateway-recovery-message

truffle-dev commented Apr 24, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

copy-pr-bot Bot commented Apr 24, 2026

Uh oh!

coderabbitai Bot commented Apr 24, 2026 •

edited

Loading

Reviews paused

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

wscurran commented Apr 27, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

ericksoa commented May 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

truffle-dev commented Apr 24, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issue

Changes

Before / after

Scope

Verification

Type of Change

Verification

AI Disclosure

Summary by CodeRabbit

Uh oh!

copy-pr-bot Bot commented Apr 24, 2026

Uh oh!

coderabbitai Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

wscurran commented Apr 27, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ericksoa commented May 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

truffle-dev commented Apr 24, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 24, 2026 •

edited

Loading