Skip to content

fix(recovery): let Hermes auto-recovery use config port#2894

Merged
ericksoa merged 6 commits into
mainfrom
fix/2426-hermes-auto-recovery
May 6, 2026
Merged

fix(recovery): let Hermes auto-recovery use config port#2894
ericksoa merged 6 commits into
mainfrom
fix/2426-hermes-auto-recovery

Conversation

@ericksoa
Copy link
Copy Markdown
Contributor

@ericksoa ericksoa commented May 3, 2026

Summary

  • Supersedes fix(recovery): show backgrounded command when gateway restart fails (#2426) #2438 by carrying forward its manual fallback idea: when recovery fails, NemoClaw now prints a persistent nohup ... & command instead of the foreground-only gateway command.
  • Fixes Hermes automatic recovery command generation so hermes gateway run does not receive the external forward port.
  • Uses the current Hermes home path, /sandbox/.hermes, and keeps non-Hermes agents on --port <port>.
  • Adds deterministic unit coverage for both automatic Hermes recovery and manual fallback command generation.

Credit

The manual fallback command approach was originally proposed in #2438 by @truffle-dev. This PR folds that work forward because #2438 is blocked by DCO and far behind current main, while also fixing the Hermes home path and the remaining automatic-recovery port issue.

Context

Addresses #2426. Hermes startup defines /sandbox/.hermes as HERMES_HOME, configures the API server to listen on internal port 18642, and exposes public port 8642 through the socat bridge. Passing --port 8642 to Hermes bypasses that config-driven startup model.

Verification

  • npm run build:cli
  • npx vitest run src/lib/agent-runtime.test.ts

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

AI Disclosure

  • AI-assisted (tool: Codex)

Summary by CodeRabbit

  • New Features

    • Added manual recovery commands that users can copy and paste when automatic gateway recovery fails
    • Improved recovery process for Hermes agents with proper environment configuration
    • Enhanced user guidance with explicit recovery instructions and fallback options
  • Tests

    • Extended test coverage for recovery scenarios including Hermes-specific recovery cases

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 3, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 9bfc687d-a1a8-47e1-a2c3-b25a14fbe65f

📥 Commits

Reviewing files that changed from the base of the PR and between 01a923d and 80d8970.

📒 Files selected for processing (2)
  • src/lib/agent-runtime.test.ts
  • src/lib/sandbox-process-recovery-action.ts

📝 Walkthrough

Walkthrough

The PR adds a buildManualRecoveryCommand() function to generate copy-paste recovery commands for gateway agents, with Hermes-specific handling to omit the --port flag and inject environment variables. The feature is integrated into error messaging in sandbox-process-recovery-action.ts to display manual recovery steps to users.

Changes

Manual Recovery Command Feature

Layer / File(s) Summary
Hermes Support Helpers
src/lib/agent-runtime.ts (lines 141–155)
Added hermesGatewayEnvPrefix() to construct Hermes environment variable prefix including HERMES_HOME and proxy settings, and hermesDecodeProxyRecoveryCommand() to start decode-proxy recovery.
Core Recovery Functions
src/lib/agent-runtime.ts (lines 211–275)
Updated buildRecoveryScript() to omit --port for Hermes agents and prefix with Hermes env variables; added new exported buildManualRecoveryCommand(agent, port) to generate single copy-paste recovery commands with identical Hermes-specific behavior.
User-Facing Integration
src/lib/sandbox-process-recovery-action.ts (lines 296–373)
Derived recoveryPort from agent's forwardPort and replaced gateway command output with calls to buildManualRecoveryCommand() to display explicit manual recovery instructions on gateway recovery failure.
Tests & Fixtures
src/lib/agent-runtime.test.ts (lines 6–336)
Added hermesAgent fixture and comprehensive test suite for buildManualRecoveryCommand() covering Hermes port omission, non-Hermes backgrounding with nohup, writable log selection, default command derivation, and null agent fallback.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 A rabbit hops through recovery paths,
Where Hermes gates need no port math,
Commands copy-paste, fresh and neat,
Manual fixes—now fleet and fleet! 🚀

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically identifies the main fix: preventing Hermes auto-recovery from using the external forward port and instead relying on config-driven internal port configuration.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/2426-hermes-auto-recovery

Comment @coderabbitai help to get the list of available commands and usage tips.

ericksoa added 2 commits May 2, 2026 19:48
Cover both #2426 recovery gaps in the superseding PR: automatic Hermes recovery now launches through config-driven port selection, and the manual fallback message now prints a persistent nohup command instead of the foreground gateway command.

The manual fallback approach is adapted from #2438 by @truffle-dev.

Signed-off-by: Aaron Erickson <[email protected]>
(cherry picked from commit 59ceb0f)
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lib/agent-runtime.ts`:
- Around line 247-255: The manual recovery command uses binaryName which assumes
the binary is on PATH; update buildManualRecoveryCommand so the fallback
defaultGatewayCommand uses the resolved binaryPath (not binaryName) when
agent?.gateway_command is blank, i.e. construct defaultGatewayCommand from
binaryPath (preserving any existing " gateway run" suffix behavior), so
gatewayCmd falls back to the actual resolved binary path and the printed command
is runnable when copy-pasted; adjust any spacing/quoting around envPrefix,
gatewayCmd and portFlag consistently.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 701556bb-92ea-4169-aff1-7db9bc861adf

📥 Commits

Reviewing files that changed from the base of the PR and between f2a2170 and 12c67b5.

📒 Files selected for processing (3)
  • src/lib/agent-runtime.test.ts
  • src/lib/agent-runtime.ts
  • src/nemoclaw.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/lib/agent-runtime.test.ts

Comment thread src/lib/agent-runtime.ts Outdated
Use resolved binary paths for manual fallback commands when an agent manifest does not provide gateway_command, matching the intent of the CodeRabbit review on #2894.

Also route Hermes recovery and manual fallback through the same decode proxy environment used by agents/hermes/start.sh so recovered gateways preserve credential placeholder handling.

Signed-off-by: Aaron Erickson <[email protected]>
@truffle-dev
Copy link
Copy Markdown

Thanks for picking this up. The Hermes env-prefix layering plus the omit---port decision is the right read of #2426. I had the manual-fallback shape but missed the internal-config-driven port distinction (18642 internal, 8642 external via socat). The decode-proxy wait loop is a nice touch.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 3, 2026

Selective E2E Results — ✅ All requested jobs passed

Run: 25268465794
Branch: fix/2426-hermes-auto-recovery
Requested jobs: hermes-e2e,rebuild-hermes-e2e
Summary: 2 passed, 0 failed, 20 skipped

Job Result
cloud-e2e ⏭️ skipped
cloud-inference-e2e ⏭️ skipped
cloud-onboard-e2e ⏭️ skipped
deployment-services-e2e ⏭️ skipped
diagnostics-e2e ⏭️ skipped
docs-validation-e2e ⏭️ skipped
gpu-e2e ⏭️ skipped
hermes-e2e ✅ success
inference-routing-e2e ⏭️ skipped
messaging-compatible-endpoint-e2e ⏭️ skipped
messaging-providers-e2e ⏭️ skipped
network-policy-e2e ⏭️ skipped
overlayfs-autofix-e2e ⏭️ skipped
rebuild-hermes-e2e ✅ success
rebuild-openclaw-e2e ⏭️ skipped
sandbox-operations-e2e ⏭️ skipped
sandbox-survival-e2e ⏭️ skipped
shields-config-e2e ⏭️ skipped
skill-agent-e2e ⏭️ skipped
snapshot-commands-e2e ⏭️ skipped
token-rotation-e2e ⏭️ skipped
upgrade-stale-sandbox-e2e ⏭️ skipped

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 3, 2026

Selective E2E Results — ✅ All requested jobs passed

Run: 25268675592
Branch: fix/2426-hermes-auto-recovery
Requested jobs: issue-2478-crash-loop-recovery-e2e
Summary: 0 passed, 0 failed, 22 skipped

Job Result
cloud-e2e ⏭️ skipped
cloud-inference-e2e ⏭️ skipped
cloud-onboard-e2e ⏭️ skipped
deployment-services-e2e ⏭️ skipped
diagnostics-e2e ⏭️ skipped
docs-validation-e2e ⏭️ skipped
gpu-e2e ⏭️ skipped
hermes-e2e ⏭️ skipped
inference-routing-e2e ⏭️ skipped
messaging-compatible-endpoint-e2e ⏭️ skipped
messaging-providers-e2e ⏭️ skipped
network-policy-e2e ⏭️ skipped
overlayfs-autofix-e2e ⏭️ skipped
rebuild-hermes-e2e ⏭️ skipped
rebuild-openclaw-e2e ⏭️ skipped
sandbox-operations-e2e ⏭️ skipped
sandbox-survival-e2e ⏭️ skipped
shields-config-e2e ⏭️ skipped
skill-agent-e2e ⏭️ skipped
snapshot-commands-e2e ⏭️ skipped
token-rotation-e2e ⏭️ skipped
upgrade-stale-sandbox-e2e ⏭️ skipped

@ericksoa ericksoa self-assigned this May 3, 2026
@ericksoa ericksoa added bug Something isn't working Integration: Hermes v0.0.34 Release target labels May 3, 2026
@cv cv added v0.0.35 Release target and removed v0.0.34 Release target labels May 5, 2026
…recovery

Signed-off-by: Aaron Erickson <[email protected]>

# Conflicts:
#	src/nemoclaw.ts
@ericksoa ericksoa merged commit b3c947e into main May 6, 2026
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working Integration: Hermes v0.0.35 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants