feat(academy): federated cross-HPC campaigns via HTTP exchange + multi-site dashboard by JinchuLi2002 · Pull Request #136 · argonne-lcf/ChemGraph

JinchuLi2002 · 2026-06-22T18:56:23Z

Summary

Adds federated cross-HPC ChemGraph Academy campaigns. Two agents on
different HPCs discover each other via deterministic UIDs over
Academy's HTTP exchange, exchange messages, and render in one merged
dashboard driven from the operator's Mac.

End-to-end verified on Aurora + Crux: federated-chat bounced a
counter 1→10 between sites, ~230 events streamed live, zero errors.

How it works

Before this PR: campaigns were single-machine. The daemon used
Redis (started by rank 0) as the exchange, peer registrations were
written to a shared-FS file the other ranks polled, and rank 0
dispatched the kickoff message inline at the end of startup. Every
piece of that — Redis subprocess, shared FS, inline kickoff — assumes
one allocation on one machine. To cross HPCs, all three had to change:
the exchange must be reachable from both sites' compute nodes (no
shared Redis), peer rendezvous must happen without a shared FS, and
kickoff must wait until every site has come up.

Identity (no shared FS): each rank computes peer UIDs locally as
uuid5(NS, "{run_id}/{agent_name}"). Both sites derive the same UID
for each agent without any network lookup. discover() is used as a
UID-keyed liveness probe — not for name resolution, since the hosted
exchange strips names from discovery responses.

Exchange: --exchange-type http targets Academy's hosted exchange
(Globus-auth'd) instead of a per-allocation Redis. Both sites' compute
nodes talk to the same public endpoint, so messages cross HPC
boundaries without any direct site-to-site network path.

Bootstrap: in single-machine campaigns rank 0 dispatches the
kickoff message inline. In federated mode every spawn-site passes
--no-bootstrap; the operator runs chemgraph academy bootstrap
once all sites have registered, which sends the kickoff to
initial_agent over the same HTTP exchange.

Dashboard: --system aurora,crux brings up one SSH ControlMaster

UAN relay + rsync mirror per site, then serves a merged event
view. Per-site events.jsonl are interleaved by timestamp; per-site
status/placement are merged to top-level keys so existing
single-site renderers work unchanged.

See examples/academy/federated-chat/e2e_guide.md for the full
four-terminal walkthrough.

What's new

Cross-site identity via deterministic UIDs.
chemgraph academy bootstrap standalone subcommand.
--exchange-type http with proxy passthrough through
mpiexec --genv so MPI ranks can reach the public exchange.
Multi-site dashboard (--system aurora,crux).
Self-cleaning UAN relay with SIGTERM handlers and per-UAN
process scan (handles Aurora's uan-0001..0010 round-robin alias).
Operator-visible lifecycle prints in daemon + agent.
federated-chat packaged campaign + e2e guide.

Compatibility

Single-machine run-compute flow is byte-identical to pre-PR
behavior. Federated paths are gated on --agents / --no-bootstrap
/ --exchange-type http being explicitly set.

Test plan

End-to-end federated-chat Aurora ↔ Crux (counter 1→10, ~230 events)
+770 LoC tests covering exchange dispatch, bootstrap, compute
launcher, dashboard, deterministic UID properties

Mirrors the pattern already used on academy-synth-topology. Allows local journals (e.g. symlinked from ~/.config/chemgraph-journals/) to coexist in the repo without ever being staged.

Wire Academy's HTTP exchange (default URL: Academy-hosted https://exchange.academy-agents.org/v1, Globus-Auth gated) as a fourth exchange type alongside redis/local/hybrid. Validated end-to-end on an Aurora compute node running example-002: 5 agents register against the hosted exchange, coordinator receives bootstrap, LM traffic flows through the existing UAN relay. This is the first time a ChemGraph Academy campaign has run on Aurora without Redis as the messaging substrate, and the technical groundwork for cross-HPC (e.g. Mac<->Aurora<->Polaris) campaigns. Plumbing - runtime/exchange.py: SUPPORTED_EXCHANGE_TYPES constant covers ('redis', 'local', 'hybrid', 'http') so CLI choices and dispatch table can't drift. New 'http' branch constructs HttpExchangeFactory with optional override URL. exchange_uses_redis() helper lets the launcher gate the rank-0 Redis subprocess without inlining the set. - core/campaign.py: ChemGraphDaemonConfig.http_exchange_url field (None = use Academy-hosted default). - runtime/registration.py: HttpAgentRegistration added to the _REGISTRATION_TYPES dispatch so per-rank registration files can round-trip through disk for the http exchange. - runtime/daemon.py, runtime/mpi.py: matching --exchange-type choices, --http-exchange-url flag, observability snapshot. Aurora-specific compute_launcher.py fixes - _prepare_environment: do NOT strip http_proxy/https_proxy from os.environ when exchange_type=='http'. Aurora's profile lists those in unset_env for the LM relay path (loopback 127.0.0.1) which is correct for redis runs but breaks http exchange. Without this fix the parent Python had no proxy vars so the --genv flags never got populated, and ranks couldn't reach the public internet. - mpiexec cmd: append --genvall plus explicit --genv KEY=VAL pairs for proxy vars when exchange_type=='http'. PALS's documented --genvall default empirically did not forward our parent env; explicit per-var flags were required. - run_allocation: skip rank-0 redis-server subprocess for any exchange that doesn't need Redis (was inline 'in {redis,hybrid}', now uses exchange_uses_redis helper). Tests (19 passing across the two suites) - exchange dispatch parametrized over all four types - SUPPORTED_EXCHANGE_TYPES integrity vs the dispatch table - exchange_uses_redis answers pinned per type - HttpExchangeFactory built with hosted default when url is None, with custom URL when provided - HttpAgentRegistration round-trips through write/load - run_allocation skips Redis subprocess for http exchange - --http-exchange-url forwarded to daemon argv when set, omitted when None - compute_launcher tests pass with the new env-prep signature Operator prerequisites for --exchange-type http on Aurora - Globus token cached at ~/local/share/academy/storage.db (run any HttpExchangeFactory() once interactively to log in via Globus). - http_proxy / https_proxy set to the ALCF proxy (http://proxy.alcf.anl.gov:3128) before invoking 'chemgraph academy run-compute'. - ALCF_USER set to the *workspace* username (e.g. jinchu), which may differ from the SSH login (e.g. jinchuli). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Foundation for the federated ``spawn-site`` flow. The daemon can now launch a named subset of a campaign's agents instead of the whole roster, and rank 0 can skip its in-process bootstrap dispatch so kickoff is deferred to a separate operator-driven step. Both behaviors are opt-in; existing ``run-compute`` single-machine campaigns are untouched. core/campaign.py - ``filter_agents(campaign, names)`` returns a new ``ChemGraphCampaign`` with only the named agents, preserving order so MPI rank-to-agent mapping stays deterministic. Rejects empty selections, duplicate names, and names not declared on the campaign. Deliberately does NOT rewrite ``initial_agent`` -- in the federated flow that name may refer to an agent hosted on another site. - ``ChemGraphDaemonConfig`` gains two fields with backward-compatible defaults: ``agents: tuple[str, ...] = ()`` (empty = launch every declared agent) and ``skip_bootstrap: bool = False``. runtime/daemon.py - ``--agents <comma-list>`` CLI flag, parsed by ``_parse_agents_arg`` (whitespace-trimmed, empty-segment-tolerant). When set, the daemon applies ``filter_agents`` BEFORE ``validate_campaign`` so the downstream ``selected_agent(campaign, rank)`` and ``wait_for_agent_statuses_finished(campaign=...)`` both see the local slice only. - ``--no-bootstrap`` flag. Rank 0's bootstrap dispatch is now gated by ``not skip_bootstrap AND initial_agent in registrations``; the second clause naturally handles the case where ``initial_agent`` lives on another site. The skipped path emits a new ``bootstrap_message_skipped`` system trace recording the reason (flag vs. non-local agent) so investigators can tell "deferred to operator" apart from "silently forgot". Tests: 30/30 existing academy tests pass with the new defaults. Focused tests for filter_agents + slicing arrive with the ``spawn-site`` CLI in the follow-up commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Operator-facing piece of the federated flow: ``chemgraph academy spawn-site`` launches one site of a multi-site campaign. Same arguments as ``run-compute`` plus the slice selector ``--agents worker-a,worker-b``; internal bootstrap is always skipped (the operator triggers kickoff once every site is up, via the dedicated ``bootstrap`` subcommand that lands in a follow-up commit). UX target (Aurora + Crux + Mac dashboard): # Aurora compute node chemgraph academy spawn-site -- \\ --system aurora --campaign federated-demo.jsonc \\ --agents coordinator-agent --exchange-type http # Crux compute node chemgraph academy spawn-site -- \\ --system crux --campaign federated-demo.jsonc \\ --agents worker-a,worker-b --exchange-type http # Mac (later, after both sides are up) chemgraph academy bootstrap -- ... core/campaign.py - ``parse_agents_selection(raw)`` promotes the comma-list parser to a public helper so launcher and daemon agree on whitespace / empty- segment handling. Duplicate detection lives in ``filter_agents`` so the user-facing error appears in one place regardless of the input path. runtime/compute_launcher.py - ``--agents`` + ``--no-bootstrap`` flags. ``AllocationPlan`` gains matching ``agents: tuple[str, ...]`` and ``skip_bootstrap: bool`` fields, both with backward-compatible defaults so the existing ``run-compute`` flow is unchanged. - ``prepare_compute_launch`` derives ``agent_count`` from the slice length when ``--agents`` is given; refuses to mix a contradicting explicit ``--agent-count`` rather than silently picking one. Mpi ``-n`` therefore always matches the daemon's post-filter agent ordering. - ``run_allocation`` forwards ``--agents`` and ``--no-bootstrap`` into the daemon argv only when set. runtime/daemon.py - Drops the private ``_parse_agents_arg`` helper in favor of the shared ``parse_agents_selection`` import. cli/main.py - ``academy spawn-site`` subcommand registered. Implementation is a thin shell over ``compute_main`` that prepends ``--no-bootstrap`` if the operator didn't already include it -- ``spawn-site`` is semantically ``run-compute`` with bootstrap disabled and an agent slice required. Tests (+11, 41 -> 41 in the two touched suites; 63/63 across full academy sweep) - parse_agents_selection: trimming, empty segments, None / "" input - filter_agents: order preservation, unknown-name rejection, empty-selection rejection, duplicate-name rejection - prepare_compute_launch: derives agent_count from --agents, rejects contradicting --agent-count - run_allocation: --agents and --no-bootstrap are forwarded when set, omitted when default (so single-machine flow is byte- identical) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…egistration Replace the shared-filesystem JSON file (``<run_dir>/academy_registrations.json``) with exchange-mediated discovery. The old mechanism required rank 0 to register every agent on the campaign and write the resulting registrations to disk for the other ranks to pick up. That works for a single allocation on a single FS; it cannot span machines, which blocks federated ``spawn-site`` campaigns spread across Aurora + Crux + ... New flow - Each rank registers ONLY its own local agent via ``transport.register_agent(ChemGraphLogicalAgent, name=...)``. - Each rank discovers cross-rank / cross-site peers by polling ``transport.discover()`` with a wall-clock timeout, filtering the returned ``AgentId`` tuple client-side by ``AgentId.name``. - No rank-0 special role for registration. Convergence is per-site: each rank exits the discovery loop as soon as its own ``allowed_peers`` are all visible on the exchange, regardless of what other ranks / sites are doing. - ``bootstrap_message_dispatched`` rule simplified to ``initial_agent == agent_spec.name`` (instead of "name in registrations dict"); semantically identical for single-machine runs, correct for federated runs. runtime/registration.py: gutted and rewritten. Old surface area (``load_academy_registrations``, ``write_academy_registrations``, ``wait_academy_registrations``, ``registration_payload``, ``academy_registration_path``, ``_REGISTRATION_TYPES``, ``_exchange_type_of``) deleted in favor of a single async helper ``discover_peer_agent_ids(transport, peer_names, *, agent_class, timeout_s, poll_interval_s)``. Returns ``dict[name, AgentId]`` for ``Handle`` construction. Times out with a message listing the missing peer names so operators can immediately tell which site failed to register. runtime/daemon.py: registration block + bootstrap dispatch reworked to match the new flow. Code shrinks: the rank-0 / rank-N branch is gone; the post-block "if rank == 0: reload registrations" hack is gone; ``registrations`` dict and its key lookups replaced with ``registration`` (own) plus ``peer_agent_ids`` (discovered). observability/run_artifacts.py: ``clear_run_outputs`` no longer deletes the dead ``academy_registrations.json`` filename. tests/test_academy_exchange_registration.py: file-based round-trip tests removed (their target functions no longer exist). Replaced with discovery-helper tests against a ``_FakeTransport`` whose ``discover()`` returns pre-configured rounds: * empty peer list short-circuits without any discover() calls * happy path returns name -> AgentId for requested peers only, even when discover() also returns other agents (cross-operator isolation depends on this filter) * waits across multiple polls for late peers (the federated convergence story) * times out with the missing peer names in the message * first-found-wins for a re-seen peer name across polls Run-id name-prefixing remains an operator-runbook convention until auto-prefixing lands; without it, two operators running concurrent demos against the same hosted exchange would see each other's agents in their ``discover()`` results. Tests: 62/62 academy sweep (was 63; net -1 because the parametrized file-round-trip test was 4 cases and the replacement is 4 helpers + 1 short-circuit case). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The federated piece the operator runs after every site is up. In a single-machine campaign, rank 0 of the daemon dispatches the ``campaign`` -> ``initial_agent`` bootstrap message in-process; in a federated campaign that's impossible because the agent that owns ``initial_agent`` may live on a different machine that hasn't even come up yet. ``spawn-site`` already skips the inline dispatch (``--no-bootstrap``). This commit adds the matching standalone command that triggers the kickoff at the right moment from anywhere with the cached Globus token. UX: chemgraph academy bootstrap -- \\ --campaign federated-demo.jsonc \\ --exchange-type http # or override the recipient for partial re-runs / debugging chemgraph academy bootstrap -- \\ --campaign federated-demo.jsonc \\ --recipient worker-a \\ --exchange-type http runtime/bootstrap.py (new) - ``parse_args``: --campaign (required), --recipient (defaults to campaign.initial_agent), --exchange-type (defaults to 'http' since that's the main use case), --http-exchange-url override, redis triple for the local-broker case, --discover-timeout-s. - ``dispatch_bootstrap``: opens a user client on the configured exchange, discovers the recipient by name via the shared ``discover_peer_agent_ids`` helper, sends one ``receive_message`` action, closes the client (also on error so the aiohttp session backing the http transport doesn't leak). - ``main``: returns exit code 2 with a stderr message when the recipient never appears on the exchange, so wrapping shell scripts can branch on "bootstrap didn't actually happen." cli/main.py - ``academy bootstrap`` subparser + dispatch in ``_handle_academy``. - Usage hint updated to include the new command. Tests (6 new, 68/68 academy sweep) - parse_args: --campaign required, exchange-type default, recipient override - dispatch_bootstrap: happy-path discovery + handle action, sender / recipient / message_id consistency, campaign user_task embedded in the dispatched content, client closed on success - dispatch_bootstrap: client closed on timeout too (no Handle construction attempted when discovery fails) - main: returns 2 on TimeoutError and writes the missing recipient name to stderr Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The piece that lets one Mac terminal serve a federated campaign running across multiple HPC sites. Operator runs ONE dashboard command with ``--system aurora,crux``; the launcher spins up per-site SSH tunnels + UAN relays + rsync mirrors, and the server walks the merged subdir tree to render one timeline tagged by site. runtime/dashboard_launcher.py - ``--system`` now takes a comma-list ('aurora,crux'). Single-site invocations are unchanged (the value resolves to a 1-tuple and the rest of the launcher uses the same per-site setup helper). - Per-site setup extracted into ``_setup_site`` returning a ``_SiteHandle`` carrying everything the cleanup finally needs. ``main`` loops over the resolved tuple; failure on any site triggers teardown of the partially-set-up sites. - Each site gets its own reverse-port (base + site_index) so two SSH ``-R`` tunnels don't collide on the Mac side. - Multi-site mode rejects scalar overrides that can't sensibly apply to every site (--remote-host, --ssh-control-path, --relay-port, --lm-base-url, --local-run-dir). Operators encode site differences in the profile JSON instead. - Single-site mirror layout unchanged (``<root>/<run_id>/``); multi-site mirrors under ``<root>/<run_id>/<system>/``. dashboard/server.py - ``_iter_site_dirs`` detects layout: if ``events.jsonl`` is at the top level it's legacy single-site; otherwise walk subdirs and treat each as a site if it has ``events.jsonl`` OR ``dashboard_metadata.json``. The metadata check covers the early-startup window where a site is up but no events have been written yet, so federated dashboards don't briefly look like "empty single-site". - ``events_payload``: legacy shape preserved for single-site; federated merges sites in timestamp order with a ``site`` tag on each event so the UI can color/group per-site. - ``status_payload``: legacy keys preserved for single-site; federated nests per-site status/placement/summary under ``sites: {<name>: ...}`` with a top-level ``updated`` reflecting the latest per-site update. Tests (+10, 78/78 academy sweep) - _iter_site_dirs: recognizes metadata-only sites; falls back to single-site for empty dirs - events_payload: merges + tags by site; timestamp-sorted output even when sites are seeded reverse-order - status_payload: nests under ``sites`` for federated, preserves legacy keys for single-site (regression guard against an accidental "make them uniform" refactor) - _parse_systems_list: single name, comma-list with whitespace, rejects empty, rejects duplicates Aurora ⇄ Crux demo runbook (operator runs once both sites have a system profile in the repo): # Mac terminal A chemgraph academy dashboard -- federated-demo-001 \\ --system aurora,crux --campaign federated-demo.jsonc # Aurora compute chemgraph academy spawn-site -- \\ --system aurora --campaign federated-demo.jsonc \\ --agents coordinator-agent --exchange-type http # Crux compute chemgraph academy spawn-site -- \\ --system crux --campaign federated-demo.jsonc \\ --agents worker-a --exchange-type http # Mac terminal B chemgraph academy bootstrap -- \\ --campaign federated-demo.jsonc --exchange-type http (Crux profile JSON is still TODO -- pre-requisite for the actual demo, not for the dashboard code.) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…lidation The non-code artifacts that turn the federation primitives shipped in B.1-B.4c into an actually runnable Aurora ⇄ Crux demo. runtime/profiles/crux.template.json (new) - Mirrors polaris.template.json (/eagle paths) but with Crux-specific bits: separate venv directory name (academy-swarm-crux) so it doesn't collide with the existing x86_64 Polaris venv on the same /eagle workspace; -crux suffix on the relay host file so per-site relays in the multi-site dashboard don't fight over the same path. - Registered in profiles/__init__.py BUILTIN_SYSTEM_PROFILES so ``chemgraph academy spawn-site --system crux`` and the multi-site dashboard launcher both recognize it. - Same unset_env policy as Aurora/Polaris -- proxies stripped by default for the LM-relay path; the launcher's exchange_type=='http' branch already overrides this so http exchange works via the ALCF proxy (proxy reachability empirically verified on Crux compute today). campaigns/federated-hello/ (new) - Two agents (agent-aurora, agent-crux), each declaring the other as its only allowed peer. No MCP servers, no resources, no science tools -- the smallest possible end-to-end campaign that exercises cross-site discovery + cross-site send_message + LM-driven decision turns. ~$0.01-0.05 of GPT-5-mini calls per run. - agent-aurora's mission: send ONE 'hello from aurora' to agent-crux, finish_turn, then finish_turn on every subsequent wakeup. - agent-crux's mission: wait, reply ONCE, finish_turn. Strong anti-loop guidance in both missions + the prompt profile. - prompt_profiles/default.json: tight system + protocol prompts that explicitly say "no science tools, only send_message and finish_turn." langchain_recursion_limit=32 since neither agent should ever loop more than a handful of rounds. - lm_config.json: GPT-5-mini template (no temperature field, since reasoning models reject non-default values -- the launcher's auto-strip would handle it but cleaner to just omit). - Registered under 'federated-hello' in CAMPAIGNS + CAMPAIGN_LAUNCH_DEFAULTS so ``--campaign federated-hello`` works as a packaged name (no rsync of the campaign dir required). core/campaign.py: validate_campaign(*, federated=False) - New keyword-only flag loosens two single-machine assumptions that break in federated spawn-site flows: * initial_agent may name an agent hosted on another site * each agent's allowed_peers may reference cross-site agents Both are looked up via the exchange at runtime, so the validator legitimately can't pre-check them in a federated slice. - Intra-slice checks (duplicate names, self-peer, MCP server / tool / resource resolvability) still run. Self-peer in particular stays a hard error because it would loop messages regardless of how many sites the campaign spans. runtime/daemon.py - Passes federated=bool(config.agents) to validate_campaign. The presence of an --agents slice is the canonical indicator of "I'm one site of a federated launch." Single-machine run-compute flows pass federated=False (the default), so prior behavior is byte-identical. Tests (+2, 80/80 academy sweep green; was 78) - validate_campaign federated=True accepts the cross-site peer reference in a federated-hello slice that strict validation rejects (regression guard for the relaxation). - validate_campaign federated=True still rejects self-peer (regression guard against accidentally relaxing too much). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The minimum-scope UI work that makes the federation story visible in the dashboard. Without it the merged event payload (B.4c.2) landed on a UI that displayed everything as if it were a single machine -- timeline still rendered, agent graph still rendered, but operators / demo viewers had no visual cue that the campaign was spanning multiple HPCs. What the UI now shows in federated runs: * Header bar: "Sites: aurora (1🤖 / 12📨) · crux (1🤖 / 8📨)" so the multi-site nature is immediately legible from the top of the dashboard. * Agent-graph swimlanes labelled by site ("aurora", "crux") instead of by individual compute hostnames ("x4708...", "x1000...") -- same nodes, same edges, far clearer story. * Message-flow detail panel: route is labelled "cross-site" (federated) or "cross-node" (single-machine) depending on context, with "From site" / "To site" rows showing aurora vs crux. The literal hostname is still available in each agent's detail panel. * Cross-node-messages metric becomes meaningful as "messages that crossed the HPC boundary" in federated runs. Single-site runs are visually byte-identical to before: ``snapshot.federated`` is false so ``agentGroup`` falls back to ``agentHost``, the sitesBadge stays hidden, route labels stay "cross-node" / "same-node", detail rows stay "From host" / "To host". Test suite (80/80) confirms server-side payload shape is unchanged for single-site. Implementation - ``load()``: detect ``snapshot.sites`` (set by server-side ``_iter_site_dirs`` in B.4c.2), set ``snapshot.federated``, build a flat ``sitesByAgent`` index from ``sites[*].status.agents`` and ``sites[*].placement.agents``, backfilled from per-event ``site`` tags as authoritative. - ``agentSite(agent)`` / ``agentGroup(agent)``: the single point where federated vs single-site rendering diverges. Every renderer that asks "what bucket does this agent belong to" now goes through ``agentGroup`` instead of ``agentHost``. - ``renderSitesBadge()``: header-bar federation indicator with per-site agent counts and per-site event counts. - Three message-route detail panels updated to label by group rather than hardcoding "host", and to show "cross-site" in federated mode. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

For accounts whose SSH login differs from the workspace directory name. ALCF_USER drives every path interpolation (``/flare/${ALCF_PROJECT}/${ALCF_USER}/``) while ALCF_SSH_USER drives only ``remote_host`` (``${ALCF_SSH_USER}@aurora.alcf.anl.gov``). The two collided on a single env var until now, forcing operators to choose: set ALCF_USER for paths and get the wrong SSH login (which triggered an ALCF Cyber security challenge on Aurora), or set it right for SSH and have all the run-dir / venv paths point at a non-existent directory. The relevant operator on this repo has SSH login ``jinchuli`` but their Aurora/Crux/Polaris workspace lives under ``/{flare,eagle}/<proj>/jinchu/`` (no trailing 'i'), so the ALCF_USER=jinchu setting was producing the right paths but the wrong SSH user. Now they set ALCF_USER=jinchu for paths and ALCF_SSH_USER=jinchuli for SSH and both work. Default ALCF_SSH_USER to ALCF_USER when unset, so the majority of users for whom the two are equal don't have to set both. system.py - New ``_expand_with(text, env)`` does ``os.path.expandvars``-style substitution against a caller-supplied env dict rather than the process environment, so the SSH-USER default doesn't leak into ``os.environ`` for subsequent callers. - ``load_system_profile`` copies the environ, fills in the default, and substitutes through ``_expand_with``. profiles/{aurora,crux,polaris}.template.json - ``remote_host`` now interpolates ``${ALCF_SSH_USER}``; every other field still uses ``${ALCF_USER}`` for the path component. Tests: 80/80 academy sweep still green. Default-case behavior (both env vars equal) is byte-identical to the prior single-var setup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ide; crux relay port 18187 Two operational fixes from the Aurora<->Crux federated demo. compute_launcher.py - Default startup_timeout_s 120s -> 600s. The realistic worst case for federated launches is one site's HPC queue wait + Python imports outpacing the other site's peer-discovery patience; 120s is comfortably too short. 600s comfortably absorbs debug-scaling / workq schedule delays. Single-machine launches reach discover_peer_agent_ids in seconds so the new ceiling never matters for them. - New --startup-timeout-s CLI flag so operators can extend the window further when they know a site will be slow. profiles/crux.template.json - Bump relay_port 18186 -> 18187 to dodge a leftover ssh -R reverse-tunnel that's still bound to 127.0.0.1:18186 on crux-uan-0001 from a prior failed dashboard launch. Follow-up cleanup: launcher should probe for a free port instead of insisting on the profile's hardcoded one. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The daemon was opaque during its slow stretches -- import, registration, peer discovery, runtime entry, and waiting for the bootstrap message all happened silently from the operator terminal. You could not distinguish "still importing" from "stuck on discovery" from "alive and waiting for bootstrap" without tailing events.jsonl or checking the dashboard. Add four landmark prints, all grep-able as ``[daemon]`` or ``[agent <name>]``: daemon.py - ``[daemon] rankN registered <name> on the exchange`` -- own- registration completed; next step is peer discovery - ``[daemon] rankN discovering peers [...] (timeout 600s)...`` -- entering the wait - ``[daemon] rankN discovered N peer(s): [...]`` -- past discovery, about to enter Runtime - ``[daemon] rankN agent <name> is now running inside Academy Runtime`` -- agent is alive and listening - ``[daemon] rankN dispatched inline bootstrap to <initial>`` / ``... skipping inline bootstrap (federated mode); waiting for chemgraph academy bootstrap ...`` so the operator knows whether to fire the standalone bootstrap subcommand core/agent.py - ``[agent <name>] first message arrived from <sender> (kind=...): <tldr>`` on the FIRST inbox message. For the federated demo the recipient agents both print this -- agent-aurora when bootstrap lands, agent-crux when the hello arrives. Concrete "kickoff worked" signal without needing the dashboard. All prints flush=True so they survive PALS/MPICH buffering when mpiexec is forwarding many ranks stdout simultaneously. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…dezvous The B.4 federated demo kept timing out at discovery -- both sites registered their agents, neither could find the other. Root cause turned out to be that Academy's hosted HttpExchange strips ``AgentId.name`` from ``discover()`` responses: only ``uid`` and ``role`` round-trip. Our name-based filter ``if agent_id.name in wanted`` was silently never matching across sites because every discovered AgentId came back with name=None. (The original ChemGraph test suite missed this because it used ``AgentId.new('worker-a')`` fakes that preserve the name -- the same fakes the real hosted exchange does not.) Replacement: deterministic UIDs. registration.py - ``deterministic_agent_uid(run_id, agent_name)`` derives a stable uuid5 from a fixed namespace + ``"{run_id}/{agent_name}"``. Same inputs on every site produce the same UID, so each rank constructs every peer's AgentId locally instead of needing ``discover()`` to echo the name back. - ``deterministic_agent_id(run_id, agent_name)`` builds the full AgentId with the local name preserved (for trace events) and the deterministic UID. - ``register_agent_with_uid(transport, agent_class, agent_id)`` bypasses the SDK's ``register_agent`` (which always generates a random UID via ``AgentId.new``) and POSTs the pre-built deterministic AgentId directly to the same mailbox endpoint. - ``wait_for_peers_alive(transport, peer_ids, ...)`` replaces ``discover_peer_agent_ids``. Matches on UID (preserved by discover()) instead of name (stripped). Times out with a message listing missing peer names+UIDs. daemon.py - Imports + uses the new helpers. Each rank computes its own AgentId deterministically and registers with it, then computes every peer's AgentId locally and waits for the peer's mailbox to be visible on the exchange. No "discover by name" anywhere. - Runtime is still handed a real HttpAgentRegistration wrapping the deterministic AgentId, so the agent runs unchanged. bootstrap.py - New ``--run-id`` required arg. The recipient's mailbox UID is derived from (run-id, recipient-name); operator must pass the same run-id they used for spawn-site or the bootstrap addresses a different mailbox than the daemons registered. - Bumped ``--discover-timeout-s`` default 120s -> 600s to match spawn-site's startup_timeout_s. - Uses ``deterministic_agent_id`` + ``wait_for_peers_alive`` instead of name-based discovery. Side effect: agent names are now campaign-scoped via the run-id. Two operators running the SAME campaign with the SAME run-id will collide on the mailbox UIDs and the second registration will fail with "mailbox already exists" -- correct fail-fast behavior. The old run-id-prefixing convention from the original docstring is now load-bearing rather than advisory. Tests (+5, 85/85 academy sweep green) - deterministic_agent_uid: stable; differs by run_id; differs by agent_name - deterministic_agent_id: name preserved locally - wait_for_peers_alive: empty list short-circuits; succeeds when all UIDs present (with names stripped, mirroring the real exchange response); waits across polls for late peers; times out naming missing UIDs; ignores unrelated agents - bootstrap: requires --run-id; defaults discover-timeout to 600s; sends to deterministic recipient AgentId; closes client on timeout; main() returns 2 with stderr message Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…event names The skip-trace I added in 5549dbb (operator-visible daemon lifecycle prints) writes a system trace with event name bootstrap_message_skipped, but that name was never added to the CampaignEvent.event Literal enum. The pydantic validator rejected it, crashing the daemon RIGHT AFTER the [daemon] ... is now running inside Academy Runtime print. Cosmetic-but-fatal regression that the test suite missed because no test exercises the skip-bootstrap code path through append_system_trace -- the federated demo is the first place this code path runs end to end. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…e context handle.action reads the outbound exchange from a contextvar that is only set when a client is active. Runtime sets it for daemon- side code, but the standalone bootstrap command needs to set it explicitly via client.register_handle(handle) -- otherwise Handle.action raises ExchangeClientNotFoundError. The first federated demo attempt failed here: discovery succeeded, the message was built, the Handle was constructed -- and the action call died because the Handle did not know which client to route through. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…le finds exchange UserExchangeClient.__aenter__ is what sets the academy.handle.exchange_context ContextVar that Handle.action reads to find the outbound exchange. The prior register_handle-only fix binds the handle for inbox routing but does NOT set the contextvar, so the action call still raised ExchangeClientNotFoundError. Restructure dispatch_bootstrap to run the whole send inside async with client: -- exchange_context is set on enter, restored on exit. The aiohttp session gets closed by __aexit__, so the explicit client.close() became redundant. Test fixture _FakeClient is now an async-context-manager stand-in; the two close-on-success / close-on-timeout assertions check enter_count/exit_count instead. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…odels Ported from the synth branch. GPT-5* and o1/o3/o4* reject any non- default temperature with HTTP 400 'Unsupported value: temperature does not support 0.0'. Both ChatOpenAI construction sites (load_openai_model and agent/turn._custom_openai_compatible_kwargs) now consult is_reasoning_model() and drop temperature + the other sampling knobs when the model is one of those. Same module-level is_reasoning_model() helper as on the synth branch so a future merge stays mechanical. This was the last bug between the federated-hello demo daemons making their first LM call and completing the round trip. Both sites successfully discovered each other, received the bootstrap message, and entered their first reasoning round; the round crashed at the LM call because the demo uses GPT-5-mini. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…paign Federated-hello produced 2-3 events per site and the demo was over before the dashboard could render anything. Federated-chat is a back-and-forth counter game between agent-aurora and agent-crux: each turn one agent increments a counter and sends to the peer, until the counter hits 10. ~6 reasoning rounds per agent = ~40 events total in the merged dashboard timeline, plus visible message-flow with cross-site Route labels. Same two-agent shape as federated-hello so the same operator runbook works -- only --campaign federated-chat changes. Registered under 'federated-chat' name with max_decisions=20 slack for retries. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…tus to top level The server returns per-site state nested under snapshot.sites[<site>].status / .placement in federated mode, but agents() reads snapshot.status?.agents, so the dashboard rendered an empty graph + zeroed metrics for federated runs even though events were streaming through correctly. Synthesize merged top-level status/placement during load() so every existing single-site reader (agents, metrics, workflow mode detection) works unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…he port The relay had no signal handlers, so SIGTERM hit the default disposition and bash kill(1) calls were silently ignored. The python kept running, the port stayed bound, and the next launch failed with "Address already in use" -- requiring a manual UAN sweep to recover. Install SIGTERM/SIGINT handlers that close the listen socket and exit cleanly, with try/except around accept() so the close-from-handler path returns instead of tracebacking. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…race and self-kill guard Aurora's login alias round-robins across uan-0001..uan-0010, so a single pid file on the shared FS was meaningless: the pid only exists on the UAN that ran the relay, and the next launch usually lands on a different UAN where that pid is either absent or belongs to someone else. As a result every crashed launch left an orphan relay holding the port, and manual ssh-into-each-UAN cleanup was the only recovery path. Replace the single-pid bookkeeping with per-UAN cleanup that scans ps for python processes whose argv contains the relay script path, owned by $USER, excluding $$ and $PPID. The self-exclusion is load- bearing: pgrep -f matched our own bash script (the relay script path appears in our argv as well), so the previous attempt killed the caller instead of the orphan. Also drop set -e (pgrep returning 1 was triggering silent exits with no log output) and add exec 2>&1 + set -x so the local relay log contains a full trace when something fails -- previously failures produced empty logs and "Local relay log:" with nothing after it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Conflicts (tests only, no src/ conflicts): - tests/test_academy_dashboard.py - tests/test_academy_exchange_registration.py Both resolved by keeping dev-globus's `pytest.importorskip("academy")` guard and dropping the academy.exchange.{local,hybrid,redis} imports that this branch no longer uses (the deterministic-UID rendezvous replaced them). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ed-chat federated-chat covers the same Aurora<->Crux federation smoke path with more dashboard material to render, so the hello campaign is dead weight. Remove the campaign files + registry entries, and re-point the validate_campaign(federated=True) regression test + registration.py docstring example to federated-chat. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…eholder Two related ergonomics fixes for spawn-site / run-compute launches: * When --lm-user is omitted, fall back to $ARGO_USER from the env. HPC users already export ARGO_USER for the rest of the ChemGraph workflow, so requiring a duplicate --lm-user flag was busywork. * _write_lm_config now refuses to ship lm_config.json with the template's literal "<argo-user>" placeholder. Argo would otherwise silently accept the launch and only reject at first LM call time, after the daemon + relay stack was already running -- expensive to debug. The hard error names the fix directly. Tests: stub HttpExchangeFactory in test_academy_exchange_registration so the http-dispatch tests don't try to authenticate against the real hosted exchange. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Four-terminal walkthrough for the cross-HPC demo: dashboard on Mac + spawn-site on Aurora + spawn-site on Crux + bootstrap kickoff. Mirrors the existing example-002 guide's shape, swaps in the federated flow (deterministic peer UIDs, HTTP exchange, ALCF proxy passthrough, Globus device-flow login). README links to the e2e guide and points at the packaged campaign location. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* dashboard_launcher.py: split `import os, shlex, ...` onto separate lines (E401). * mpi.py: drop unused `write_json_atomic` import (F401), left over from the file-based registration scheme that was deleted in 52fa7b5. Pre-existing ruff failures elsewhere in the repo (parsl_tools, mace_mcp_parsl, etc.) are not from this PR and untouched. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

JinchuLi2002 and others added 25 commits June 18, 2026 15:03

chore(.gitignore): ignore *.private-local.* (local scratchpads)

103d716

Mirrors the pattern already used on academy-synth-topology. Allows local journals (e.g. symlinked from ~/.config/chemgraph-journals/) to coexist in the repo without ever being staged.

JinchuLi2002 closed this Jun 22, 2026

JinchuLi2002 reopened this Jun 22, 2026

JinchuLi2002 changed the base branch from main to dev-globus June 22, 2026 18:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(academy): federated cross-HPC campaigns via HTTP exchange + multi-site dashboard#136

feat(academy): federated cross-HPC campaigns via HTTP exchange + multi-site dashboard#136
JinchuLi2002 wants to merge 26 commits into
argonne-lcf:dev-globusfrom
JinchuLi2002:academy-dynamic-campaign

JinchuLi2002 commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

JinchuLi2002 commented Jun 22, 2026

Summary

How it works

What's new

Compatibility

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant