Skip to content

feat(academy): federated cross-HPC campaigns via HTTP exchange + multi-site dashboard#136

Open
JinchuLi2002 wants to merge 26 commits into
argonne-lcf:dev-globusfrom
JinchuLi2002:academy-dynamic-campaign
Open

feat(academy): federated cross-HPC campaigns via HTTP exchange + multi-site dashboard#136
JinchuLi2002 wants to merge 26 commits into
argonne-lcf:dev-globusfrom
JinchuLi2002:academy-dynamic-campaign

Conversation

@JinchuLi2002

Copy link
Copy Markdown

Summary

Adds federated cross-HPC ChemGraph Academy campaigns. Two agents on
different HPCs discover each other via deterministic UIDs over
Academy's HTTP exchange, exchange messages, and render in one merged
dashboard driven from the operator's Mac.

End-to-end verified on Aurora + Crux: federated-chat bounced a
counter 1→10 between sites, ~230 events streamed live, zero errors.

How it works

Before this PR: campaigns were single-machine. The daemon used
Redis (started by rank 0) as the exchange, peer registrations were
written to a shared-FS file the other ranks polled, and rank 0
dispatched the kickoff message inline at the end of startup. Every
piece of that — Redis subprocess, shared FS, inline kickoff — assumes
one allocation on one machine. To cross HPCs, all three had to change:
the exchange must be reachable from both sites' compute nodes (no
shared Redis), peer rendezvous must happen without a shared FS, and
kickoff must wait until every site has come up.

Identity (no shared FS): each rank computes peer UIDs locally as
uuid5(NS, "{run_id}/{agent_name}"). Both sites derive the same UID
for each agent without any network lookup. discover() is used as a
UID-keyed liveness probe — not for name resolution, since the hosted
exchange strips names from discovery responses.

Exchange: --exchange-type http targets Academy's hosted exchange
(Globus-auth'd) instead of a per-allocation Redis. Both sites' compute
nodes talk to the same public endpoint, so messages cross HPC
boundaries without any direct site-to-site network path.

Bootstrap: in single-machine campaigns rank 0 dispatches the
kickoff message inline. In federated mode every spawn-site passes
--no-bootstrap; the operator runs chemgraph academy bootstrap
once all sites have registered, which sends the kickoff to
initial_agent over the same HTTP exchange.

Dashboard: --system aurora,crux brings up one SSH ControlMaster

  • UAN relay + rsync mirror per site, then serves a merged event
    view. Per-site events.jsonl are interleaved by timestamp; per-site
    status/placement are merged to top-level keys so existing
    single-site renderers work unchanged.

See examples/academy/federated-chat/e2e_guide.md for the full
four-terminal walkthrough.

What's new

  • Cross-site identity via deterministic UIDs.
  • chemgraph academy bootstrap standalone subcommand.
  • --exchange-type http with proxy passthrough through
    mpiexec --genv so MPI ranks can reach the public exchange.
  • Multi-site dashboard (--system aurora,crux).
  • Self-cleaning UAN relay with SIGTERM handlers and per-UAN
    process scan (handles Aurora's uan-0001..0010 round-robin alias).
  • Operator-visible lifecycle prints in daemon + agent.
  • federated-chat packaged campaign + e2e guide.

Compatibility

Single-machine run-compute flow is byte-identical to pre-PR
behavior. Federated paths are gated on --agents / --no-bootstrap
/ --exchange-type http being explicitly set.

Test plan

  • End-to-end federated-chat Aurora ↔ Crux (counter 1→10, ~230 events)
  • +770 LoC tests covering exchange dispatch, bootstrap, compute
    launcher, dashboard, deterministic UID properties

JinchuLi2002 and others added 25 commits June 18, 2026 15:03
Mirrors the pattern already used on academy-synth-topology. Allows
local journals (e.g. symlinked from ~/.config/chemgraph-journals/)
to coexist in the repo without ever being staged.
Wire Academy's HTTP exchange (default URL: Academy-hosted
https://exchange.academy-agents.org/v1, Globus-Auth gated) as a
fourth exchange type alongside redis/local/hybrid. Validated
end-to-end on an Aurora compute node running example-002:
5 agents register against the hosted exchange, coordinator receives
bootstrap, LM traffic flows through the existing UAN relay. This is
the first time a ChemGraph Academy campaign has run on Aurora
without Redis as the messaging substrate, and the technical
groundwork for cross-HPC (e.g. Mac<->Aurora<->Polaris) campaigns.

Plumbing
- runtime/exchange.py: SUPPORTED_EXCHANGE_TYPES constant covers
  ('redis', 'local', 'hybrid', 'http') so CLI choices and dispatch
  table can't drift. New 'http' branch constructs HttpExchangeFactory
  with optional override URL. exchange_uses_redis() helper lets the
  launcher gate the rank-0 Redis subprocess without inlining the set.
- core/campaign.py: ChemGraphDaemonConfig.http_exchange_url field
  (None = use Academy-hosted default).
- runtime/registration.py: HttpAgentRegistration added to the
  _REGISTRATION_TYPES dispatch so per-rank registration files can
  round-trip through disk for the http exchange.
- runtime/daemon.py, runtime/mpi.py: matching --exchange-type
  choices, --http-exchange-url flag, observability snapshot.

Aurora-specific compute_launcher.py fixes
- _prepare_environment: do NOT strip http_proxy/https_proxy from
  os.environ when exchange_type=='http'. Aurora's profile lists those
  in unset_env for the LM relay path (loopback 127.0.0.1) which is
  correct for redis runs but breaks http exchange. Without this fix
  the parent Python had no proxy vars so the --genv flags never got
  populated, and ranks couldn't reach the public internet.
- mpiexec cmd: append --genvall plus explicit --genv KEY=VAL pairs
  for proxy vars when exchange_type=='http'. PALS's documented
  --genvall default empirically did not forward our parent env;
  explicit per-var flags were required.
- run_allocation: skip rank-0 redis-server subprocess for any
  exchange that doesn't need Redis (was inline 'in {redis,hybrid}',
  now uses exchange_uses_redis helper).

Tests (19 passing across the two suites)
- exchange dispatch parametrized over all four types
- SUPPORTED_EXCHANGE_TYPES integrity vs the dispatch table
- exchange_uses_redis answers pinned per type
- HttpExchangeFactory built with hosted default when url is None,
  with custom URL when provided
- HttpAgentRegistration round-trips through write/load
- run_allocation skips Redis subprocess for http exchange
- --http-exchange-url forwarded to daemon argv when set, omitted
  when None
- compute_launcher tests pass with the new env-prep signature

Operator prerequisites for --exchange-type http on Aurora
- Globus token cached at ~/local/share/academy/storage.db (run any
  HttpExchangeFactory() once interactively to log in via Globus).
- http_proxy / https_proxy set to the ALCF proxy
  (http://proxy.alcf.anl.gov:3128) before invoking
  'chemgraph academy run-compute'.
- ALCF_USER set to the *workspace* username (e.g. jinchu), which
  may differ from the SSH login (e.g. jinchuli).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Foundation for the federated ``spawn-site`` flow. The daemon can now
launch a named subset of a campaign's agents instead of the whole
roster, and rank 0 can skip its in-process bootstrap dispatch so
kickoff is deferred to a separate operator-driven step. Both
behaviors are opt-in; existing ``run-compute`` single-machine
campaigns are untouched.

core/campaign.py
- ``filter_agents(campaign, names)`` returns a new ``ChemGraphCampaign``
  with only the named agents, preserving order so MPI rank-to-agent
  mapping stays deterministic. Rejects empty selections, duplicate
  names, and names not declared on the campaign. Deliberately does
  NOT rewrite ``initial_agent`` -- in the federated flow that name
  may refer to an agent hosted on another site.
- ``ChemGraphDaemonConfig`` gains two fields with backward-compatible
  defaults: ``agents: tuple[str, ...] = ()`` (empty = launch every
  declared agent) and ``skip_bootstrap: bool = False``.

runtime/daemon.py
- ``--agents <comma-list>`` CLI flag, parsed by ``_parse_agents_arg``
  (whitespace-trimmed, empty-segment-tolerant). When set, the daemon
  applies ``filter_agents`` BEFORE ``validate_campaign`` so the
  downstream ``selected_agent(campaign, rank)`` and
  ``wait_for_agent_statuses_finished(campaign=...)`` both see the
  local slice only.
- ``--no-bootstrap`` flag. Rank 0's bootstrap dispatch is now gated
  by ``not skip_bootstrap AND initial_agent in registrations``; the
  second clause naturally handles the case where ``initial_agent``
  lives on another site. The skipped path emits a new
  ``bootstrap_message_skipped`` system trace recording the reason
  (flag vs. non-local agent) so investigators can tell "deferred
  to operator" apart from "silently forgot".

Tests: 30/30 existing academy tests pass with the new defaults.
Focused tests for filter_agents + slicing arrive with the
``spawn-site`` CLI in the follow-up commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Operator-facing piece of the federated flow: ``chemgraph academy
spawn-site`` launches one site of a multi-site campaign. Same
arguments as ``run-compute`` plus the slice selector ``--agents
worker-a,worker-b``; internal bootstrap is always skipped (the
operator triggers kickoff once every site is up, via the dedicated
``bootstrap`` subcommand that lands in a follow-up commit).

UX target (Aurora + Crux + Mac dashboard):
  # Aurora compute node
  chemgraph academy spawn-site -- \\
    --system aurora --campaign federated-demo.jsonc \\
    --agents coordinator-agent --exchange-type http

  # Crux compute node
  chemgraph academy spawn-site -- \\
    --system crux --campaign federated-demo.jsonc \\
    --agents worker-a,worker-b --exchange-type http

  # Mac (later, after both sides are up)
  chemgraph academy bootstrap -- ...

core/campaign.py
- ``parse_agents_selection(raw)`` promotes the comma-list parser to a
  public helper so launcher and daemon agree on whitespace / empty-
  segment handling. Duplicate detection lives in ``filter_agents``
  so the user-facing error appears in one place regardless of the
  input path.

runtime/compute_launcher.py
- ``--agents`` + ``--no-bootstrap`` flags. ``AllocationPlan`` gains
  matching ``agents: tuple[str, ...]`` and ``skip_bootstrap: bool``
  fields, both with backward-compatible defaults so the existing
  ``run-compute`` flow is unchanged.
- ``prepare_compute_launch`` derives ``agent_count`` from the slice
  length when ``--agents`` is given; refuses to mix a contradicting
  explicit ``--agent-count`` rather than silently picking one. Mpi
  ``-n`` therefore always matches the daemon's post-filter agent
  ordering.
- ``run_allocation`` forwards ``--agents`` and ``--no-bootstrap``
  into the daemon argv only when set.

runtime/daemon.py
- Drops the private ``_parse_agents_arg`` helper in favor of the
  shared ``parse_agents_selection`` import.

cli/main.py
- ``academy spawn-site`` subcommand registered. Implementation is a
  thin shell over ``compute_main`` that prepends ``--no-bootstrap``
  if the operator didn't already include it -- ``spawn-site`` is
  semantically ``run-compute`` with bootstrap disabled and an agent
  slice required.

Tests (+11, 41 -> 41 in the two touched suites; 63/63 across full
academy sweep)
- parse_agents_selection: trimming, empty segments, None / "" input
- filter_agents: order preservation, unknown-name rejection,
  empty-selection rejection, duplicate-name rejection
- prepare_compute_launch: derives agent_count from --agents, rejects
  contradicting --agent-count
- run_allocation: --agents and --no-bootstrap are forwarded when
  set, omitted when default (so single-machine flow is byte-
  identical)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…egistration

Replace the shared-filesystem JSON file
(``<run_dir>/academy_registrations.json``) with exchange-mediated
discovery. The old mechanism required rank 0 to register every agent
on the campaign and write the resulting registrations to disk for the
other ranks to pick up. That works for a single allocation on a
single FS; it cannot span machines, which blocks federated
``spawn-site`` campaigns spread across Aurora + Crux + ...

New flow
- Each rank registers ONLY its own local agent via
  ``transport.register_agent(ChemGraphLogicalAgent, name=...)``.
- Each rank discovers cross-rank / cross-site peers by polling
  ``transport.discover()`` with a wall-clock timeout, filtering the
  returned ``AgentId`` tuple client-side by ``AgentId.name``.
- No rank-0 special role for registration. Convergence is per-site:
  each rank exits the discovery loop as soon as its own
  ``allowed_peers`` are all visible on the exchange, regardless of
  what other ranks / sites are doing.
- ``bootstrap_message_dispatched`` rule simplified to
  ``initial_agent == agent_spec.name`` (instead of "name in
  registrations dict"); semantically identical for single-machine
  runs, correct for federated runs.

runtime/registration.py: gutted and rewritten. Old surface area
(``load_academy_registrations``, ``write_academy_registrations``,
``wait_academy_registrations``, ``registration_payload``,
``academy_registration_path``, ``_REGISTRATION_TYPES``,
``_exchange_type_of``) deleted in favor of a single async helper
``discover_peer_agent_ids(transport, peer_names, *, agent_class,
timeout_s, poll_interval_s)``. Returns ``dict[name, AgentId]`` for
``Handle`` construction. Times out with a message listing the
missing peer names so operators can immediately tell which site
failed to register.

runtime/daemon.py: registration block + bootstrap dispatch reworked
to match the new flow. Code shrinks: the rank-0 / rank-N branch is
gone; the post-block "if rank == 0: reload registrations" hack is
gone; ``registrations`` dict and its key lookups replaced with
``registration`` (own) plus ``peer_agent_ids`` (discovered).

observability/run_artifacts.py: ``clear_run_outputs`` no longer
deletes the dead ``academy_registrations.json`` filename.

tests/test_academy_exchange_registration.py: file-based round-trip
tests removed (their target functions no longer exist). Replaced
with discovery-helper tests against a ``_FakeTransport`` whose
``discover()`` returns pre-configured rounds:
  * empty peer list short-circuits without any discover() calls
  * happy path returns name -> AgentId for requested peers only,
    even when discover() also returns other agents (cross-operator
    isolation depends on this filter)
  * waits across multiple polls for late peers (the federated
    convergence story)
  * times out with the missing peer names in the message
  * first-found-wins for a re-seen peer name across polls

Run-id name-prefixing remains an operator-runbook convention until
auto-prefixing lands; without it, two operators running concurrent
demos against the same hosted exchange would see each other's
agents in their ``discover()`` results.

Tests: 62/62 academy sweep (was 63; net -1 because the parametrized
file-round-trip test was 4 cases and the replacement is 4 helpers +
1 short-circuit case).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The federated piece the operator runs after every site is up. In a
single-machine campaign, rank 0 of the daemon dispatches the
``campaign`` -> ``initial_agent`` bootstrap message in-process; in a
federated campaign that's impossible because the agent that owns
``initial_agent`` may live on a different machine that hasn't even
come up yet. ``spawn-site`` already skips the inline dispatch
(``--no-bootstrap``). This commit adds the matching standalone
command that triggers the kickoff at the right moment from anywhere
with the cached Globus token.

UX:
  chemgraph academy bootstrap -- \\
    --campaign federated-demo.jsonc \\
    --exchange-type http

  # or override the recipient for partial re-runs / debugging
  chemgraph academy bootstrap -- \\
    --campaign federated-demo.jsonc \\
    --recipient worker-a \\
    --exchange-type http

runtime/bootstrap.py (new)
- ``parse_args``: --campaign (required), --recipient (defaults to
  campaign.initial_agent), --exchange-type (defaults to 'http' since
  that's the main use case), --http-exchange-url override, redis
  triple for the local-broker case, --discover-timeout-s.
- ``dispatch_bootstrap``: opens a user client on the configured
  exchange, discovers the recipient by name via the shared
  ``discover_peer_agent_ids`` helper, sends one
  ``receive_message`` action, closes the client (also on error so
  the aiohttp session backing the http transport doesn't leak).
- ``main``: returns exit code 2 with a stderr message when the
  recipient never appears on the exchange, so wrapping shell
  scripts can branch on "bootstrap didn't actually happen."

cli/main.py
- ``academy bootstrap`` subparser + dispatch in ``_handle_academy``.
- Usage hint updated to include the new command.

Tests (6 new, 68/68 academy sweep)
- parse_args: --campaign required, exchange-type default, recipient
  override
- dispatch_bootstrap: happy-path discovery + handle action, sender /
  recipient / message_id consistency, campaign user_task embedded in
  the dispatched content, client closed on success
- dispatch_bootstrap: client closed on timeout too (no Handle
  construction attempted when discovery fails)
- main: returns 2 on TimeoutError and writes the missing recipient
  name to stderr

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The piece that lets one Mac terminal serve a federated campaign
running across multiple HPC sites. Operator runs ONE dashboard
command with ``--system aurora,crux``; the launcher spins up
per-site SSH tunnels + UAN relays + rsync mirrors, and the server
walks the merged subdir tree to render one timeline tagged by site.

runtime/dashboard_launcher.py
- ``--system`` now takes a comma-list ('aurora,crux'). Single-site
  invocations are unchanged (the value resolves to a 1-tuple and
  the rest of the launcher uses the same per-site setup helper).
- Per-site setup extracted into ``_setup_site`` returning a
  ``_SiteHandle`` carrying everything the cleanup finally needs.
  ``main`` loops over the resolved tuple; failure on any site
  triggers teardown of the partially-set-up sites.
- Each site gets its own reverse-port (base + site_index) so two
  SSH ``-R`` tunnels don't collide on the Mac side.
- Multi-site mode rejects scalar overrides that can't sensibly
  apply to every site (--remote-host, --ssh-control-path,
  --relay-port, --lm-base-url, --local-run-dir). Operators encode
  site differences in the profile JSON instead.
- Single-site mirror layout unchanged
  (``<root>/<run_id>/``); multi-site mirrors under
  ``<root>/<run_id>/<system>/``.

dashboard/server.py
- ``_iter_site_dirs`` detects layout: if ``events.jsonl`` is at the
  top level it's legacy single-site; otherwise walk subdirs and
  treat each as a site if it has ``events.jsonl`` OR
  ``dashboard_metadata.json``. The metadata check covers the
  early-startup window where a site is up but no events have been
  written yet, so federated dashboards don't briefly look like
  "empty single-site".
- ``events_payload``: legacy shape preserved for single-site;
  federated merges sites in timestamp order with a ``site`` tag on
  each event so the UI can color/group per-site.
- ``status_payload``: legacy keys preserved for single-site;
  federated nests per-site status/placement/summary under
  ``sites: {<name>: ...}`` with a top-level ``updated`` reflecting
  the latest per-site update.

Tests (+10, 78/78 academy sweep)
- _iter_site_dirs: recognizes metadata-only sites; falls back to
  single-site for empty dirs
- events_payload: merges + tags by site; timestamp-sorted output
  even when sites are seeded reverse-order
- status_payload: nests under ``sites`` for federated, preserves
  legacy keys for single-site (regression guard against an
  accidental "make them uniform" refactor)
- _parse_systems_list: single name, comma-list with whitespace,
  rejects empty, rejects duplicates

Aurora ⇄ Crux demo runbook (operator runs once both sites have a
system profile in the repo):

  # Mac terminal A
  chemgraph academy dashboard -- federated-demo-001 \\
    --system aurora,crux --campaign federated-demo.jsonc

  # Aurora compute
  chemgraph academy spawn-site -- \\
    --system aurora --campaign federated-demo.jsonc \\
    --agents coordinator-agent --exchange-type http

  # Crux compute
  chemgraph academy spawn-site -- \\
    --system crux --campaign federated-demo.jsonc \\
    --agents worker-a --exchange-type http

  # Mac terminal B
  chemgraph academy bootstrap -- \\
    --campaign federated-demo.jsonc --exchange-type http

(Crux profile JSON is still TODO -- pre-requisite for the actual
demo, not for the dashboard code.)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…lidation

The non-code artifacts that turn the federation primitives shipped
in B.1-B.4c into an actually runnable Aurora ⇄ Crux demo.

runtime/profiles/crux.template.json (new)
- Mirrors polaris.template.json (/eagle paths) but with Crux-specific
  bits: separate venv directory name (academy-swarm-crux) so it
  doesn't collide with the existing x86_64 Polaris venv on the same
  /eagle workspace; -crux suffix on the relay host file so per-site
  relays in the multi-site dashboard don't fight over the same path.
- Registered in profiles/__init__.py BUILTIN_SYSTEM_PROFILES so
  ``chemgraph academy spawn-site --system crux`` and the multi-site
  dashboard launcher both recognize it.
- Same unset_env policy as Aurora/Polaris -- proxies stripped by
  default for the LM-relay path; the launcher's exchange_type=='http'
  branch already overrides this so http exchange works via the
  ALCF proxy (proxy reachability empirically verified on Crux
  compute today).

campaigns/federated-hello/ (new)
- Two agents (agent-aurora, agent-crux), each declaring the other as
  its only allowed peer. No MCP servers, no resources, no science
  tools -- the smallest possible end-to-end campaign that exercises
  cross-site discovery + cross-site send_message + LM-driven
  decision turns. ~$0.01-0.05 of GPT-5-mini calls per run.
- agent-aurora's mission: send ONE 'hello from aurora' to agent-crux,
  finish_turn, then finish_turn on every subsequent wakeup.
- agent-crux's mission: wait, reply ONCE, finish_turn. Strong
  anti-loop guidance in both missions + the prompt profile.
- prompt_profiles/default.json: tight system + protocol prompts that
  explicitly say "no science tools, only send_message and
  finish_turn." langchain_recursion_limit=32 since neither agent
  should ever loop more than a handful of rounds.
- lm_config.json: GPT-5-mini template (no temperature field, since
  reasoning models reject non-default values -- the launcher's
  auto-strip would handle it but cleaner to just omit).
- Registered under 'federated-hello' in CAMPAIGNS +
  CAMPAIGN_LAUNCH_DEFAULTS so ``--campaign federated-hello`` works
  as a packaged name (no rsync of the campaign dir required).

core/campaign.py: validate_campaign(*, federated=False)
- New keyword-only flag loosens two single-machine assumptions that
  break in federated spawn-site flows:
    * initial_agent may name an agent hosted on another site
    * each agent's allowed_peers may reference cross-site agents
  Both are looked up via the exchange at runtime, so the validator
  legitimately can't pre-check them in a federated slice.
- Intra-slice checks (duplicate names, self-peer, MCP server / tool
  / resource resolvability) still run. Self-peer in particular
  stays a hard error because it would loop messages regardless of
  how many sites the campaign spans.

runtime/daemon.py
- Passes federated=bool(config.agents) to validate_campaign. The
  presence of an --agents slice is the canonical indicator of
  "I'm one site of a federated launch." Single-machine
  run-compute flows pass federated=False (the default), so prior
  behavior is byte-identical.

Tests (+2, 80/80 academy sweep green; was 78)
- validate_campaign federated=True accepts the cross-site peer
  reference in a federated-hello slice that strict validation
  rejects (regression guard for the relaxation).
- validate_campaign federated=True still rejects self-peer
  (regression guard against accidentally relaxing too much).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The minimum-scope UI work that makes the federation story visible
in the dashboard. Without it the merged event payload (B.4c.2)
landed on a UI that displayed everything as if it were a single
machine -- timeline still rendered, agent graph still rendered, but
operators / demo viewers had no visual cue that the campaign was
spanning multiple HPCs.

What the UI now shows in federated runs:
* Header bar: "Sites: aurora (1🤖 / 12📨) · crux (1🤖 / 8📨)" so
  the multi-site nature is immediately legible from the top of
  the dashboard.
* Agent-graph swimlanes labelled by site ("aurora", "crux")
  instead of by individual compute hostnames ("x4708...",
  "x1000...") -- same nodes, same edges, far clearer story.
* Message-flow detail panel: route is labelled "cross-site"
  (federated) or "cross-node" (single-machine) depending on
  context, with "From site" / "To site" rows showing aurora vs
  crux. The literal hostname is still available in each agent's
  detail panel.
* Cross-node-messages metric becomes meaningful as "messages that
  crossed the HPC boundary" in federated runs.

Single-site runs are visually byte-identical to before:
``snapshot.federated`` is false so ``agentGroup`` falls back to
``agentHost``, the sitesBadge stays hidden, route labels stay
"cross-node" / "same-node", detail rows stay "From host" /
"To host". Test suite (80/80) confirms server-side payload shape
is unchanged for single-site.

Implementation
- ``load()``: detect ``snapshot.sites`` (set by server-side
  ``_iter_site_dirs`` in B.4c.2), set ``snapshot.federated``,
  build a flat ``sitesByAgent`` index from
  ``sites[*].status.agents`` and ``sites[*].placement.agents``,
  backfilled from per-event ``site`` tags as authoritative.
- ``agentSite(agent)`` / ``agentGroup(agent)``: the single point
  where federated vs single-site rendering diverges. Every
  renderer that asks "what bucket does this agent belong to" now
  goes through ``agentGroup`` instead of ``agentHost``.
- ``renderSitesBadge()``: header-bar federation indicator with
  per-site agent counts and per-site event counts.
- Three message-route detail panels updated to label by group
  rather than hardcoding "host", and to show "cross-site" in
  federated mode.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
For accounts whose SSH login differs from the workspace directory
name. ALCF_USER drives every path interpolation
(``/flare/${ALCF_PROJECT}/${ALCF_USER}/``) while ALCF_SSH_USER drives
only ``remote_host`` (``${ALCF_SSH_USER}@aurora.alcf.anl.gov``). The
two collided on a single env var until now, forcing operators to
choose: set ALCF_USER for paths and get the wrong SSH login (which
triggered an ALCF Cyber security challenge on Aurora), or set it
right for SSH and have all the run-dir / venv paths point at a
non-existent directory.

The relevant operator on this repo has SSH login ``jinchuli`` but
their Aurora/Crux/Polaris workspace lives under
``/{flare,eagle}/<proj>/jinchu/`` (no trailing 'i'), so the
ALCF_USER=jinchu setting was producing the right paths but the
wrong SSH user. Now they set ALCF_USER=jinchu for paths and
ALCF_SSH_USER=jinchuli for SSH and both work.

Default ALCF_SSH_USER to ALCF_USER when unset, so the majority of
users for whom the two are equal don't have to set both.

system.py
- New ``_expand_with(text, env)`` does ``os.path.expandvars``-style
  substitution against a caller-supplied env dict rather than the
  process environment, so the SSH-USER default doesn't leak into
  ``os.environ`` for subsequent callers.
- ``load_system_profile`` copies the environ, fills in the default,
  and substitutes through ``_expand_with``.

profiles/{aurora,crux,polaris}.template.json
- ``remote_host`` now interpolates ``${ALCF_SSH_USER}``; every
  other field still uses ``${ALCF_USER}`` for the path component.

Tests: 80/80 academy sweep still green. Default-case behavior (both
env vars equal) is byte-identical to the prior single-var setup.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ide; crux relay port 18187

Two operational fixes from the Aurora<->Crux federated demo.

compute_launcher.py
- Default startup_timeout_s 120s -> 600s. The realistic worst case
  for federated launches is one site's HPC queue wait + Python
  imports outpacing the other site's peer-discovery patience; 120s
  is comfortably too short. 600s comfortably absorbs debug-scaling
  / workq schedule delays. Single-machine launches reach
  discover_peer_agent_ids in seconds so the new ceiling never
  matters for them.
- New --startup-timeout-s CLI flag so operators can extend the
  window further when they know a site will be slow.

profiles/crux.template.json
- Bump relay_port 18186 -> 18187 to dodge a leftover ssh -R
  reverse-tunnel that's still bound to 127.0.0.1:18186 on
  crux-uan-0001 from a prior failed dashboard launch. Follow-up
  cleanup: launcher should probe for a free port instead of
  insisting on the profile's hardcoded one.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The daemon was opaque during its slow stretches -- import,
registration, peer discovery, runtime entry, and waiting for the
bootstrap message all happened silently from the operator terminal.
You could not distinguish "still importing" from "stuck on discovery"
from "alive and waiting for bootstrap" without tailing events.jsonl
or checking the dashboard.

Add four landmark prints, all grep-able as ``[daemon]`` or
``[agent <name>]``:

daemon.py
- ``[daemon] rankN registered <name> on the exchange`` -- own-
  registration completed; next step is peer discovery
- ``[daemon] rankN discovering peers [...] (timeout 600s)...`` --
  entering the wait
- ``[daemon] rankN discovered N peer(s): [...]`` -- past discovery,
  about to enter Runtime
- ``[daemon] rankN agent <name> is now running inside Academy
  Runtime`` -- agent is alive and listening
- ``[daemon] rankN dispatched inline bootstrap to <initial>`` /
  ``... skipping inline bootstrap (federated mode); waiting for
  chemgraph academy bootstrap ...`` so the operator knows whether
  to fire the standalone bootstrap subcommand

core/agent.py
- ``[agent <name>] first message arrived from <sender> (kind=...):
  <tldr>`` on the FIRST inbox message. For the federated demo the
  recipient agents both print this -- agent-aurora when bootstrap
  lands, agent-crux when the hello arrives. Concrete "kickoff
  worked" signal without needing the dashboard.

All prints flush=True so they survive PALS/MPICH buffering when
mpiexec is forwarding many ranks stdout simultaneously.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…dezvous

The B.4 federated demo kept timing out at discovery -- both sites
registered their agents, neither could find the other. Root cause
turned out to be that Academy's hosted HttpExchange strips
``AgentId.name`` from ``discover()`` responses: only ``uid`` and
``role`` round-trip. Our name-based filter ``if agent_id.name in
wanted`` was silently never matching across sites because every
discovered AgentId came back with name=None. (The original ChemGraph
test suite missed this because it used ``AgentId.new('worker-a')``
fakes that preserve the name -- the same fakes the real hosted
exchange does not.)

Replacement: deterministic UIDs.

registration.py
- ``deterministic_agent_uid(run_id, agent_name)`` derives a stable
  uuid5 from a fixed namespace + ``"{run_id}/{agent_name}"``. Same
  inputs on every site produce the same UID, so each rank
  constructs every peer's AgentId locally instead of needing
  ``discover()`` to echo the name back.
- ``deterministic_agent_id(run_id, agent_name)`` builds the full
  AgentId with the local name preserved (for trace events) and
  the deterministic UID.
- ``register_agent_with_uid(transport, agent_class, agent_id)``
  bypasses the SDK's ``register_agent`` (which always generates a
  random UID via ``AgentId.new``) and POSTs the pre-built
  deterministic AgentId directly to the same mailbox endpoint.
- ``wait_for_peers_alive(transport, peer_ids, ...)`` replaces
  ``discover_peer_agent_ids``. Matches on UID (preserved by
  discover()) instead of name (stripped). Times out with a
  message listing missing peer names+UIDs.

daemon.py
- Imports + uses the new helpers. Each rank computes its own
  AgentId deterministically and registers with it, then computes
  every peer's AgentId locally and waits for the peer's mailbox
  to be visible on the exchange. No "discover by name" anywhere.
- Runtime is still handed a real HttpAgentRegistration wrapping
  the deterministic AgentId, so the agent runs unchanged.

bootstrap.py
- New ``--run-id`` required arg. The recipient's mailbox UID is
  derived from (run-id, recipient-name); operator must pass the
  same run-id they used for spawn-site or the bootstrap addresses
  a different mailbox than the daemons registered.
- Bumped ``--discover-timeout-s`` default 120s -> 600s to match
  spawn-site's startup_timeout_s.
- Uses ``deterministic_agent_id`` + ``wait_for_peers_alive``
  instead of name-based discovery.

Side effect: agent names are now campaign-scoped via the run-id.
Two operators running the SAME campaign with the SAME run-id will
collide on the mailbox UIDs and the second registration will fail
with "mailbox already exists" -- correct fail-fast behavior. The
old run-id-prefixing convention from the original docstring is now
load-bearing rather than advisory.

Tests (+5, 85/85 academy sweep green)
- deterministic_agent_uid: stable; differs by run_id; differs by
  agent_name
- deterministic_agent_id: name preserved locally
- wait_for_peers_alive: empty list short-circuits; succeeds when
  all UIDs present (with names stripped, mirroring the real
  exchange response); waits across polls for late peers; times
  out naming missing UIDs; ignores unrelated agents
- bootstrap: requires --run-id; defaults discover-timeout to 600s;
  sends to deterministic recipient AgentId; closes client on
  timeout; main() returns 2 with stderr message

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…event names

The skip-trace I added in 5549dbb (operator-visible daemon
lifecycle prints) writes a system trace with event name
bootstrap_message_skipped, but that name was never added to
the CampaignEvent.event Literal enum. The pydantic validator
rejected it, crashing the daemon RIGHT AFTER the
[daemon] ... is now running inside Academy Runtime print.

Cosmetic-but-fatal regression that the test suite missed because
no test exercises the skip-bootstrap code path through
append_system_trace -- the federated demo is the first place
this code path runs end to end.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e context

handle.action reads the outbound exchange from a contextvar that
is only set when a client is active. Runtime sets it for daemon-
side code, but the standalone bootstrap command needs to set it
explicitly via client.register_handle(handle) -- otherwise
Handle.action raises ExchangeClientNotFoundError.

The first federated demo attempt failed here: discovery succeeded,
the message was built, the Handle was constructed -- and the
action call died because the Handle did not know which client to
route through.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…le finds exchange

UserExchangeClient.__aenter__ is what sets the
academy.handle.exchange_context ContextVar that Handle.action reads
to find the outbound exchange. The prior register_handle-only fix
binds the handle for inbox routing but does NOT set the contextvar,
so the action call still raised ExchangeClientNotFoundError.

Restructure dispatch_bootstrap to run the whole send inside
async with client: -- exchange_context is set on enter, restored on
exit. The aiohttp session gets closed by __aexit__, so the explicit
client.close() became redundant.

Test fixture _FakeClient is now an async-context-manager stand-in;
the two close-on-success / close-on-timeout assertions check
enter_count/exit_count instead.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…odels

Ported from the synth branch. GPT-5* and o1/o3/o4* reject any non-
default temperature with HTTP 400 'Unsupported value: temperature
does not support 0.0'. Both ChatOpenAI construction sites
(load_openai_model and agent/turn._custom_openai_compatible_kwargs)
now consult is_reasoning_model() and drop temperature + the other
sampling knobs when the model is one of those.

Same module-level is_reasoning_model() helper as on the synth
branch so a future merge stays mechanical.

This was the last bug between the federated-hello demo daemons
making their first LM call and completing the round trip. Both
sites successfully discovered each other, received the bootstrap
message, and entered their first reasoning round; the round
crashed at the LM call because the demo uses GPT-5-mini.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…paign

Federated-hello produced 2-3 events per site and the demo was over
before the dashboard could render anything. Federated-chat is a
back-and-forth counter game between agent-aurora and agent-crux:
each turn one agent increments a counter and sends to the peer,
until the counter hits 10. ~6 reasoning rounds per agent = ~40
events total in the merged dashboard timeline, plus visible
message-flow with cross-site Route labels.

Same two-agent shape as federated-hello so the same operator
runbook works -- only --campaign federated-chat changes.

Registered under 'federated-chat' name with max_decisions=20
slack for retries.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tus to top level

The server returns per-site state nested under snapshot.sites[<site>].status /
.placement in federated mode, but agents() reads snapshot.status?.agents, so
the dashboard rendered an empty graph + zeroed metrics for federated runs
even though events were streaming through correctly. Synthesize merged
top-level status/placement during load() so every existing single-site
reader (agents, metrics, workflow mode detection) works unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…he port

The relay had no signal handlers, so SIGTERM hit the default disposition
and bash kill(1) calls were silently ignored. The python kept running,
the port stayed bound, and the next launch failed with "Address already
in use" -- requiring a manual UAN sweep to recover. Install SIGTERM/SIGINT
handlers that close the listen socket and exit cleanly, with try/except
around accept() so the close-from-handler path returns instead of
tracebacking.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…race and self-kill guard

Aurora's login alias round-robins across uan-0001..uan-0010, so a single
pid file on the shared FS was meaningless: the pid only exists on the
UAN that ran the relay, and the next launch usually lands on a different
UAN where that pid is either absent or belongs to someone else. As a
result every crashed launch left an orphan relay holding the port, and
manual ssh-into-each-UAN cleanup was the only recovery path.

Replace the single-pid bookkeeping with per-UAN cleanup that scans
ps for python processes whose argv contains the relay script path,
owned by $USER, excluding $$ and $PPID. The self-exclusion is load-
bearing: pgrep -f matched our own bash script (the relay script path
appears in our argv as well), so the previous attempt killed the
caller instead of the orphan.

Also drop set -e (pgrep returning 1 was triggering silent exits with
no log output) and add exec 2>&1 + set -x so the local relay log
contains a full trace when something fails -- previously failures
produced empty logs and "Local relay log:" with nothing after it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Conflicts (tests only, no src/ conflicts):
- tests/test_academy_dashboard.py
- tests/test_academy_exchange_registration.py

Both resolved by keeping dev-globus's `pytest.importorskip("academy")`
guard and dropping the academy.exchange.{local,hybrid,redis} imports
that this branch no longer uses (the deterministic-UID rendezvous
replaced them).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ed-chat

federated-chat covers the same Aurora<->Crux federation smoke path
with more dashboard material to render, so the hello campaign is
dead weight. Remove the campaign files + registry entries, and
re-point the validate_campaign(federated=True) regression test
+ registration.py docstring example to federated-chat.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…eholder

Two related ergonomics fixes for spawn-site / run-compute launches:

* When --lm-user is omitted, fall back to $ARGO_USER from the env.
  HPC users already export ARGO_USER for the rest of the ChemGraph
  workflow, so requiring a duplicate --lm-user flag was busywork.
* _write_lm_config now refuses to ship lm_config.json with the
  template's literal "<argo-user>" placeholder. Argo would otherwise
  silently accept the launch and only reject at first LM call time,
  after the daemon + relay stack was already running -- expensive to
  debug. The hard error names the fix directly.

Tests: stub HttpExchangeFactory in test_academy_exchange_registration
so the http-dispatch tests don't try to authenticate against the real
hosted exchange.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Four-terminal walkthrough for the cross-HPC demo: dashboard on Mac
+ spawn-site on Aurora + spawn-site on Crux + bootstrap kickoff.
Mirrors the existing example-002 guide's shape, swaps in the federated
flow (deterministic peer UIDs, HTTP exchange, ALCF proxy passthrough,
Globus device-flow login). README links to the e2e guide and points at
the packaged campaign location.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@JinchuLi2002 JinchuLi2002 reopened this Jun 22, 2026
@JinchuLi2002 JinchuLi2002 changed the base branch from main to dev-globus June 22, 2026 18:59
* dashboard_launcher.py: split `import os, shlex, ...` onto separate
  lines (E401).
* mpi.py: drop unused `write_json_atomic` import (F401), left over
  from the file-based registration scheme that was deleted in 52fa7b5.

Pre-existing ruff failures elsewhere in the repo (parsl_tools,
mace_mcp_parsl, etc.) are not from this PR and untouched.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant