feat(academy): federated cross-HPC campaigns via HTTP exchange + multi-site dashboard#136
Open
JinchuLi2002 wants to merge 26 commits into
Open
feat(academy): federated cross-HPC campaigns via HTTP exchange + multi-site dashboard#136JinchuLi2002 wants to merge 26 commits into
JinchuLi2002 wants to merge 26 commits into
Conversation
Mirrors the pattern already used on academy-synth-topology. Allows local journals (e.g. symlinked from ~/.config/chemgraph-journals/) to coexist in the repo without ever being staged.
Wire Academy's HTTP exchange (default URL: Academy-hosted https://exchange.academy-agents.org/v1, Globus-Auth gated) as a fourth exchange type alongside redis/local/hybrid. Validated end-to-end on an Aurora compute node running example-002: 5 agents register against the hosted exchange, coordinator receives bootstrap, LM traffic flows through the existing UAN relay. This is the first time a ChemGraph Academy campaign has run on Aurora without Redis as the messaging substrate, and the technical groundwork for cross-HPC (e.g. Mac<->Aurora<->Polaris) campaigns. Plumbing - runtime/exchange.py: SUPPORTED_EXCHANGE_TYPES constant covers ('redis', 'local', 'hybrid', 'http') so CLI choices and dispatch table can't drift. New 'http' branch constructs HttpExchangeFactory with optional override URL. exchange_uses_redis() helper lets the launcher gate the rank-0 Redis subprocess without inlining the set. - core/campaign.py: ChemGraphDaemonConfig.http_exchange_url field (None = use Academy-hosted default). - runtime/registration.py: HttpAgentRegistration added to the _REGISTRATION_TYPES dispatch so per-rank registration files can round-trip through disk for the http exchange. - runtime/daemon.py, runtime/mpi.py: matching --exchange-type choices, --http-exchange-url flag, observability snapshot. Aurora-specific compute_launcher.py fixes - _prepare_environment: do NOT strip http_proxy/https_proxy from os.environ when exchange_type=='http'. Aurora's profile lists those in unset_env for the LM relay path (loopback 127.0.0.1) which is correct for redis runs but breaks http exchange. Without this fix the parent Python had no proxy vars so the --genv flags never got populated, and ranks couldn't reach the public internet. - mpiexec cmd: append --genvall plus explicit --genv KEY=VAL pairs for proxy vars when exchange_type=='http'. PALS's documented --genvall default empirically did not forward our parent env; explicit per-var flags were required. - run_allocation: skip rank-0 redis-server subprocess for any exchange that doesn't need Redis (was inline 'in {redis,hybrid}', now uses exchange_uses_redis helper). Tests (19 passing across the two suites) - exchange dispatch parametrized over all four types - SUPPORTED_EXCHANGE_TYPES integrity vs the dispatch table - exchange_uses_redis answers pinned per type - HttpExchangeFactory built with hosted default when url is None, with custom URL when provided - HttpAgentRegistration round-trips through write/load - run_allocation skips Redis subprocess for http exchange - --http-exchange-url forwarded to daemon argv when set, omitted when None - compute_launcher tests pass with the new env-prep signature Operator prerequisites for --exchange-type http on Aurora - Globus token cached at ~/local/share/academy/storage.db (run any HttpExchangeFactory() once interactively to log in via Globus). - http_proxy / https_proxy set to the ALCF proxy (http://proxy.alcf.anl.gov:3128) before invoking 'chemgraph academy run-compute'. - ALCF_USER set to the *workspace* username (e.g. jinchu), which may differ from the SSH login (e.g. jinchuli). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Foundation for the federated ``spawn-site`` flow. The daemon can now launch a named subset of a campaign's agents instead of the whole roster, and rank 0 can skip its in-process bootstrap dispatch so kickoff is deferred to a separate operator-driven step. Both behaviors are opt-in; existing ``run-compute`` single-machine campaigns are untouched. core/campaign.py - ``filter_agents(campaign, names)`` returns a new ``ChemGraphCampaign`` with only the named agents, preserving order so MPI rank-to-agent mapping stays deterministic. Rejects empty selections, duplicate names, and names not declared on the campaign. Deliberately does NOT rewrite ``initial_agent`` -- in the federated flow that name may refer to an agent hosted on another site. - ``ChemGraphDaemonConfig`` gains two fields with backward-compatible defaults: ``agents: tuple[str, ...] = ()`` (empty = launch every declared agent) and ``skip_bootstrap: bool = False``. runtime/daemon.py - ``--agents <comma-list>`` CLI flag, parsed by ``_parse_agents_arg`` (whitespace-trimmed, empty-segment-tolerant). When set, the daemon applies ``filter_agents`` BEFORE ``validate_campaign`` so the downstream ``selected_agent(campaign, rank)`` and ``wait_for_agent_statuses_finished(campaign=...)`` both see the local slice only. - ``--no-bootstrap`` flag. Rank 0's bootstrap dispatch is now gated by ``not skip_bootstrap AND initial_agent in registrations``; the second clause naturally handles the case where ``initial_agent`` lives on another site. The skipped path emits a new ``bootstrap_message_skipped`` system trace recording the reason (flag vs. non-local agent) so investigators can tell "deferred to operator" apart from "silently forgot". Tests: 30/30 existing academy tests pass with the new defaults. Focused tests for filter_agents + slicing arrive with the ``spawn-site`` CLI in the follow-up commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Operator-facing piece of the federated flow: ``chemgraph academy
spawn-site`` launches one site of a multi-site campaign. Same
arguments as ``run-compute`` plus the slice selector ``--agents
worker-a,worker-b``; internal bootstrap is always skipped (the
operator triggers kickoff once every site is up, via the dedicated
``bootstrap`` subcommand that lands in a follow-up commit).
UX target (Aurora + Crux + Mac dashboard):
# Aurora compute node
chemgraph academy spawn-site -- \\
--system aurora --campaign federated-demo.jsonc \\
--agents coordinator-agent --exchange-type http
# Crux compute node
chemgraph academy spawn-site -- \\
--system crux --campaign federated-demo.jsonc \\
--agents worker-a,worker-b --exchange-type http
# Mac (later, after both sides are up)
chemgraph academy bootstrap -- ...
core/campaign.py
- ``parse_agents_selection(raw)`` promotes the comma-list parser to a
public helper so launcher and daemon agree on whitespace / empty-
segment handling. Duplicate detection lives in ``filter_agents``
so the user-facing error appears in one place regardless of the
input path.
runtime/compute_launcher.py
- ``--agents`` + ``--no-bootstrap`` flags. ``AllocationPlan`` gains
matching ``agents: tuple[str, ...]`` and ``skip_bootstrap: bool``
fields, both with backward-compatible defaults so the existing
``run-compute`` flow is unchanged.
- ``prepare_compute_launch`` derives ``agent_count`` from the slice
length when ``--agents`` is given; refuses to mix a contradicting
explicit ``--agent-count`` rather than silently picking one. Mpi
``-n`` therefore always matches the daemon's post-filter agent
ordering.
- ``run_allocation`` forwards ``--agents`` and ``--no-bootstrap``
into the daemon argv only when set.
runtime/daemon.py
- Drops the private ``_parse_agents_arg`` helper in favor of the
shared ``parse_agents_selection`` import.
cli/main.py
- ``academy spawn-site`` subcommand registered. Implementation is a
thin shell over ``compute_main`` that prepends ``--no-bootstrap``
if the operator didn't already include it -- ``spawn-site`` is
semantically ``run-compute`` with bootstrap disabled and an agent
slice required.
Tests (+11, 41 -> 41 in the two touched suites; 63/63 across full
academy sweep)
- parse_agents_selection: trimming, empty segments, None / "" input
- filter_agents: order preservation, unknown-name rejection,
empty-selection rejection, duplicate-name rejection
- prepare_compute_launch: derives agent_count from --agents, rejects
contradicting --agent-count
- run_allocation: --agents and --no-bootstrap are forwarded when
set, omitted when default (so single-machine flow is byte-
identical)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…egistration
Replace the shared-filesystem JSON file
(``<run_dir>/academy_registrations.json``) with exchange-mediated
discovery. The old mechanism required rank 0 to register every agent
on the campaign and write the resulting registrations to disk for the
other ranks to pick up. That works for a single allocation on a
single FS; it cannot span machines, which blocks federated
``spawn-site`` campaigns spread across Aurora + Crux + ...
New flow
- Each rank registers ONLY its own local agent via
``transport.register_agent(ChemGraphLogicalAgent, name=...)``.
- Each rank discovers cross-rank / cross-site peers by polling
``transport.discover()`` with a wall-clock timeout, filtering the
returned ``AgentId`` tuple client-side by ``AgentId.name``.
- No rank-0 special role for registration. Convergence is per-site:
each rank exits the discovery loop as soon as its own
``allowed_peers`` are all visible on the exchange, regardless of
what other ranks / sites are doing.
- ``bootstrap_message_dispatched`` rule simplified to
``initial_agent == agent_spec.name`` (instead of "name in
registrations dict"); semantically identical for single-machine
runs, correct for federated runs.
runtime/registration.py: gutted and rewritten. Old surface area
(``load_academy_registrations``, ``write_academy_registrations``,
``wait_academy_registrations``, ``registration_payload``,
``academy_registration_path``, ``_REGISTRATION_TYPES``,
``_exchange_type_of``) deleted in favor of a single async helper
``discover_peer_agent_ids(transport, peer_names, *, agent_class,
timeout_s, poll_interval_s)``. Returns ``dict[name, AgentId]`` for
``Handle`` construction. Times out with a message listing the
missing peer names so operators can immediately tell which site
failed to register.
runtime/daemon.py: registration block + bootstrap dispatch reworked
to match the new flow. Code shrinks: the rank-0 / rank-N branch is
gone; the post-block "if rank == 0: reload registrations" hack is
gone; ``registrations`` dict and its key lookups replaced with
``registration`` (own) plus ``peer_agent_ids`` (discovered).
observability/run_artifacts.py: ``clear_run_outputs`` no longer
deletes the dead ``academy_registrations.json`` filename.
tests/test_academy_exchange_registration.py: file-based round-trip
tests removed (their target functions no longer exist). Replaced
with discovery-helper tests against a ``_FakeTransport`` whose
``discover()`` returns pre-configured rounds:
* empty peer list short-circuits without any discover() calls
* happy path returns name -> AgentId for requested peers only,
even when discover() also returns other agents (cross-operator
isolation depends on this filter)
* waits across multiple polls for late peers (the federated
convergence story)
* times out with the missing peer names in the message
* first-found-wins for a re-seen peer name across polls
Run-id name-prefixing remains an operator-runbook convention until
auto-prefixing lands; without it, two operators running concurrent
demos against the same hosted exchange would see each other's
agents in their ``discover()`` results.
Tests: 62/62 academy sweep (was 63; net -1 because the parametrized
file-round-trip test was 4 cases and the replacement is 4 helpers +
1 short-circuit case).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The federated piece the operator runs after every site is up. In a
single-machine campaign, rank 0 of the daemon dispatches the
``campaign`` -> ``initial_agent`` bootstrap message in-process; in a
federated campaign that's impossible because the agent that owns
``initial_agent`` may live on a different machine that hasn't even
come up yet. ``spawn-site`` already skips the inline dispatch
(``--no-bootstrap``). This commit adds the matching standalone
command that triggers the kickoff at the right moment from anywhere
with the cached Globus token.
UX:
chemgraph academy bootstrap -- \\
--campaign federated-demo.jsonc \\
--exchange-type http
# or override the recipient for partial re-runs / debugging
chemgraph academy bootstrap -- \\
--campaign federated-demo.jsonc \\
--recipient worker-a \\
--exchange-type http
runtime/bootstrap.py (new)
- ``parse_args``: --campaign (required), --recipient (defaults to
campaign.initial_agent), --exchange-type (defaults to 'http' since
that's the main use case), --http-exchange-url override, redis
triple for the local-broker case, --discover-timeout-s.
- ``dispatch_bootstrap``: opens a user client on the configured
exchange, discovers the recipient by name via the shared
``discover_peer_agent_ids`` helper, sends one
``receive_message`` action, closes the client (also on error so
the aiohttp session backing the http transport doesn't leak).
- ``main``: returns exit code 2 with a stderr message when the
recipient never appears on the exchange, so wrapping shell
scripts can branch on "bootstrap didn't actually happen."
cli/main.py
- ``academy bootstrap`` subparser + dispatch in ``_handle_academy``.
- Usage hint updated to include the new command.
Tests (6 new, 68/68 academy sweep)
- parse_args: --campaign required, exchange-type default, recipient
override
- dispatch_bootstrap: happy-path discovery + handle action, sender /
recipient / message_id consistency, campaign user_task embedded in
the dispatched content, client closed on success
- dispatch_bootstrap: client closed on timeout too (no Handle
construction attempted when discovery fails)
- main: returns 2 on TimeoutError and writes the missing recipient
name to stderr
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The piece that lets one Mac terminal serve a federated campaign
running across multiple HPC sites. Operator runs ONE dashboard
command with ``--system aurora,crux``; the launcher spins up
per-site SSH tunnels + UAN relays + rsync mirrors, and the server
walks the merged subdir tree to render one timeline tagged by site.
runtime/dashboard_launcher.py
- ``--system`` now takes a comma-list ('aurora,crux'). Single-site
invocations are unchanged (the value resolves to a 1-tuple and
the rest of the launcher uses the same per-site setup helper).
- Per-site setup extracted into ``_setup_site`` returning a
``_SiteHandle`` carrying everything the cleanup finally needs.
``main`` loops over the resolved tuple; failure on any site
triggers teardown of the partially-set-up sites.
- Each site gets its own reverse-port (base + site_index) so two
SSH ``-R`` tunnels don't collide on the Mac side.
- Multi-site mode rejects scalar overrides that can't sensibly
apply to every site (--remote-host, --ssh-control-path,
--relay-port, --lm-base-url, --local-run-dir). Operators encode
site differences in the profile JSON instead.
- Single-site mirror layout unchanged
(``<root>/<run_id>/``); multi-site mirrors under
``<root>/<run_id>/<system>/``.
dashboard/server.py
- ``_iter_site_dirs`` detects layout: if ``events.jsonl`` is at the
top level it's legacy single-site; otherwise walk subdirs and
treat each as a site if it has ``events.jsonl`` OR
``dashboard_metadata.json``. The metadata check covers the
early-startup window where a site is up but no events have been
written yet, so federated dashboards don't briefly look like
"empty single-site".
- ``events_payload``: legacy shape preserved for single-site;
federated merges sites in timestamp order with a ``site`` tag on
each event so the UI can color/group per-site.
- ``status_payload``: legacy keys preserved for single-site;
federated nests per-site status/placement/summary under
``sites: {<name>: ...}`` with a top-level ``updated`` reflecting
the latest per-site update.
Tests (+10, 78/78 academy sweep)
- _iter_site_dirs: recognizes metadata-only sites; falls back to
single-site for empty dirs
- events_payload: merges + tags by site; timestamp-sorted output
even when sites are seeded reverse-order
- status_payload: nests under ``sites`` for federated, preserves
legacy keys for single-site (regression guard against an
accidental "make them uniform" refactor)
- _parse_systems_list: single name, comma-list with whitespace,
rejects empty, rejects duplicates
Aurora ⇄ Crux demo runbook (operator runs once both sites have a
system profile in the repo):
# Mac terminal A
chemgraph academy dashboard -- federated-demo-001 \\
--system aurora,crux --campaign federated-demo.jsonc
# Aurora compute
chemgraph academy spawn-site -- \\
--system aurora --campaign federated-demo.jsonc \\
--agents coordinator-agent --exchange-type http
# Crux compute
chemgraph academy spawn-site -- \\
--system crux --campaign federated-demo.jsonc \\
--agents worker-a --exchange-type http
# Mac terminal B
chemgraph academy bootstrap -- \\
--campaign federated-demo.jsonc --exchange-type http
(Crux profile JSON is still TODO -- pre-requisite for the actual
demo, not for the dashboard code.)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…lidation
The non-code artifacts that turn the federation primitives shipped
in B.1-B.4c into an actually runnable Aurora ⇄ Crux demo.
runtime/profiles/crux.template.json (new)
- Mirrors polaris.template.json (/eagle paths) but with Crux-specific
bits: separate venv directory name (academy-swarm-crux) so it
doesn't collide with the existing x86_64 Polaris venv on the same
/eagle workspace; -crux suffix on the relay host file so per-site
relays in the multi-site dashboard don't fight over the same path.
- Registered in profiles/__init__.py BUILTIN_SYSTEM_PROFILES so
``chemgraph academy spawn-site --system crux`` and the multi-site
dashboard launcher both recognize it.
- Same unset_env policy as Aurora/Polaris -- proxies stripped by
default for the LM-relay path; the launcher's exchange_type=='http'
branch already overrides this so http exchange works via the
ALCF proxy (proxy reachability empirically verified on Crux
compute today).
campaigns/federated-hello/ (new)
- Two agents (agent-aurora, agent-crux), each declaring the other as
its only allowed peer. No MCP servers, no resources, no science
tools -- the smallest possible end-to-end campaign that exercises
cross-site discovery + cross-site send_message + LM-driven
decision turns. ~$0.01-0.05 of GPT-5-mini calls per run.
- agent-aurora's mission: send ONE 'hello from aurora' to agent-crux,
finish_turn, then finish_turn on every subsequent wakeup.
- agent-crux's mission: wait, reply ONCE, finish_turn. Strong
anti-loop guidance in both missions + the prompt profile.
- prompt_profiles/default.json: tight system + protocol prompts that
explicitly say "no science tools, only send_message and
finish_turn." langchain_recursion_limit=32 since neither agent
should ever loop more than a handful of rounds.
- lm_config.json: GPT-5-mini template (no temperature field, since
reasoning models reject non-default values -- the launcher's
auto-strip would handle it but cleaner to just omit).
- Registered under 'federated-hello' in CAMPAIGNS +
CAMPAIGN_LAUNCH_DEFAULTS so ``--campaign federated-hello`` works
as a packaged name (no rsync of the campaign dir required).
core/campaign.py: validate_campaign(*, federated=False)
- New keyword-only flag loosens two single-machine assumptions that
break in federated spawn-site flows:
* initial_agent may name an agent hosted on another site
* each agent's allowed_peers may reference cross-site agents
Both are looked up via the exchange at runtime, so the validator
legitimately can't pre-check them in a federated slice.
- Intra-slice checks (duplicate names, self-peer, MCP server / tool
/ resource resolvability) still run. Self-peer in particular
stays a hard error because it would loop messages regardless of
how many sites the campaign spans.
runtime/daemon.py
- Passes federated=bool(config.agents) to validate_campaign. The
presence of an --agents slice is the canonical indicator of
"I'm one site of a federated launch." Single-machine
run-compute flows pass federated=False (the default), so prior
behavior is byte-identical.
Tests (+2, 80/80 academy sweep green; was 78)
- validate_campaign federated=True accepts the cross-site peer
reference in a federated-hello slice that strict validation
rejects (regression guard for the relaxation).
- validate_campaign federated=True still rejects self-peer
(regression guard against accidentally relaxing too much).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The minimum-scope UI work that makes the federation story visible
in the dashboard. Without it the merged event payload (B.4c.2)
landed on a UI that displayed everything as if it were a single
machine -- timeline still rendered, agent graph still rendered, but
operators / demo viewers had no visual cue that the campaign was
spanning multiple HPCs.
What the UI now shows in federated runs:
* Header bar: "Sites: aurora (1🤖 / 12📨) · crux (1🤖 / 8📨)" so
the multi-site nature is immediately legible from the top of
the dashboard.
* Agent-graph swimlanes labelled by site ("aurora", "crux")
instead of by individual compute hostnames ("x4708...",
"x1000...") -- same nodes, same edges, far clearer story.
* Message-flow detail panel: route is labelled "cross-site"
(federated) or "cross-node" (single-machine) depending on
context, with "From site" / "To site" rows showing aurora vs
crux. The literal hostname is still available in each agent's
detail panel.
* Cross-node-messages metric becomes meaningful as "messages that
crossed the HPC boundary" in federated runs.
Single-site runs are visually byte-identical to before:
``snapshot.federated`` is false so ``agentGroup`` falls back to
``agentHost``, the sitesBadge stays hidden, route labels stay
"cross-node" / "same-node", detail rows stay "From host" /
"To host". Test suite (80/80) confirms server-side payload shape
is unchanged for single-site.
Implementation
- ``load()``: detect ``snapshot.sites`` (set by server-side
``_iter_site_dirs`` in B.4c.2), set ``snapshot.federated``,
build a flat ``sitesByAgent`` index from
``sites[*].status.agents`` and ``sites[*].placement.agents``,
backfilled from per-event ``site`` tags as authoritative.
- ``agentSite(agent)`` / ``agentGroup(agent)``: the single point
where federated vs single-site rendering diverges. Every
renderer that asks "what bucket does this agent belong to" now
goes through ``agentGroup`` instead of ``agentHost``.
- ``renderSitesBadge()``: header-bar federation indicator with
per-site agent counts and per-site event counts.
- Three message-route detail panels updated to label by group
rather than hardcoding "host", and to show "cross-site" in
federated mode.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
For accounts whose SSH login differs from the workspace directory
name. ALCF_USER drives every path interpolation
(``/flare/${ALCF_PROJECT}/${ALCF_USER}/``) while ALCF_SSH_USER drives
only ``remote_host`` (``${ALCF_SSH_USER}@aurora.alcf.anl.gov``). The
two collided on a single env var until now, forcing operators to
choose: set ALCF_USER for paths and get the wrong SSH login (which
triggered an ALCF Cyber security challenge on Aurora), or set it
right for SSH and have all the run-dir / venv paths point at a
non-existent directory.
The relevant operator on this repo has SSH login ``jinchuli`` but
their Aurora/Crux/Polaris workspace lives under
``/{flare,eagle}/<proj>/jinchu/`` (no trailing 'i'), so the
ALCF_USER=jinchu setting was producing the right paths but the
wrong SSH user. Now they set ALCF_USER=jinchu for paths and
ALCF_SSH_USER=jinchuli for SSH and both work.
Default ALCF_SSH_USER to ALCF_USER when unset, so the majority of
users for whom the two are equal don't have to set both.
system.py
- New ``_expand_with(text, env)`` does ``os.path.expandvars``-style
substitution against a caller-supplied env dict rather than the
process environment, so the SSH-USER default doesn't leak into
``os.environ`` for subsequent callers.
- ``load_system_profile`` copies the environ, fills in the default,
and substitutes through ``_expand_with``.
profiles/{aurora,crux,polaris}.template.json
- ``remote_host`` now interpolates ``${ALCF_SSH_USER}``; every
other field still uses ``${ALCF_USER}`` for the path component.
Tests: 80/80 academy sweep still green. Default-case behavior (both
env vars equal) is byte-identical to the prior single-var setup.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ide; crux relay port 18187 Two operational fixes from the Aurora<->Crux federated demo. compute_launcher.py - Default startup_timeout_s 120s -> 600s. The realistic worst case for federated launches is one site's HPC queue wait + Python imports outpacing the other site's peer-discovery patience; 120s is comfortably too short. 600s comfortably absorbs debug-scaling / workq schedule delays. Single-machine launches reach discover_peer_agent_ids in seconds so the new ceiling never matters for them. - New --startup-timeout-s CLI flag so operators can extend the window further when they know a site will be slow. profiles/crux.template.json - Bump relay_port 18186 -> 18187 to dodge a leftover ssh -R reverse-tunnel that's still bound to 127.0.0.1:18186 on crux-uan-0001 from a prior failed dashboard launch. Follow-up cleanup: launcher should probe for a free port instead of insisting on the profile's hardcoded one. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The daemon was opaque during its slow stretches -- import, registration, peer discovery, runtime entry, and waiting for the bootstrap message all happened silently from the operator terminal. You could not distinguish "still importing" from "stuck on discovery" from "alive and waiting for bootstrap" without tailing events.jsonl or checking the dashboard. Add four landmark prints, all grep-able as ``[daemon]`` or ``[agent <name>]``: daemon.py - ``[daemon] rankN registered <name> on the exchange`` -- own- registration completed; next step is peer discovery - ``[daemon] rankN discovering peers [...] (timeout 600s)...`` -- entering the wait - ``[daemon] rankN discovered N peer(s): [...]`` -- past discovery, about to enter Runtime - ``[daemon] rankN agent <name> is now running inside Academy Runtime`` -- agent is alive and listening - ``[daemon] rankN dispatched inline bootstrap to <initial>`` / ``... skipping inline bootstrap (federated mode); waiting for chemgraph academy bootstrap ...`` so the operator knows whether to fire the standalone bootstrap subcommand core/agent.py - ``[agent <name>] first message arrived from <sender> (kind=...): <tldr>`` on the FIRST inbox message. For the federated demo the recipient agents both print this -- agent-aurora when bootstrap lands, agent-crux when the hello arrives. Concrete "kickoff worked" signal without needing the dashboard. All prints flush=True so they survive PALS/MPICH buffering when mpiexec is forwarding many ranks stdout simultaneously. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…dezvous
The B.4 federated demo kept timing out at discovery -- both sites
registered their agents, neither could find the other. Root cause
turned out to be that Academy's hosted HttpExchange strips
``AgentId.name`` from ``discover()`` responses: only ``uid`` and
``role`` round-trip. Our name-based filter ``if agent_id.name in
wanted`` was silently never matching across sites because every
discovered AgentId came back with name=None. (The original ChemGraph
test suite missed this because it used ``AgentId.new('worker-a')``
fakes that preserve the name -- the same fakes the real hosted
exchange does not.)
Replacement: deterministic UIDs.
registration.py
- ``deterministic_agent_uid(run_id, agent_name)`` derives a stable
uuid5 from a fixed namespace + ``"{run_id}/{agent_name}"``. Same
inputs on every site produce the same UID, so each rank
constructs every peer's AgentId locally instead of needing
``discover()`` to echo the name back.
- ``deterministic_agent_id(run_id, agent_name)`` builds the full
AgentId with the local name preserved (for trace events) and
the deterministic UID.
- ``register_agent_with_uid(transport, agent_class, agent_id)``
bypasses the SDK's ``register_agent`` (which always generates a
random UID via ``AgentId.new``) and POSTs the pre-built
deterministic AgentId directly to the same mailbox endpoint.
- ``wait_for_peers_alive(transport, peer_ids, ...)`` replaces
``discover_peer_agent_ids``. Matches on UID (preserved by
discover()) instead of name (stripped). Times out with a
message listing missing peer names+UIDs.
daemon.py
- Imports + uses the new helpers. Each rank computes its own
AgentId deterministically and registers with it, then computes
every peer's AgentId locally and waits for the peer's mailbox
to be visible on the exchange. No "discover by name" anywhere.
- Runtime is still handed a real HttpAgentRegistration wrapping
the deterministic AgentId, so the agent runs unchanged.
bootstrap.py
- New ``--run-id`` required arg. The recipient's mailbox UID is
derived from (run-id, recipient-name); operator must pass the
same run-id they used for spawn-site or the bootstrap addresses
a different mailbox than the daemons registered.
- Bumped ``--discover-timeout-s`` default 120s -> 600s to match
spawn-site's startup_timeout_s.
- Uses ``deterministic_agent_id`` + ``wait_for_peers_alive``
instead of name-based discovery.
Side effect: agent names are now campaign-scoped via the run-id.
Two operators running the SAME campaign with the SAME run-id will
collide on the mailbox UIDs and the second registration will fail
with "mailbox already exists" -- correct fail-fast behavior. The
old run-id-prefixing convention from the original docstring is now
load-bearing rather than advisory.
Tests (+5, 85/85 academy sweep green)
- deterministic_agent_uid: stable; differs by run_id; differs by
agent_name
- deterministic_agent_id: name preserved locally
- wait_for_peers_alive: empty list short-circuits; succeeds when
all UIDs present (with names stripped, mirroring the real
exchange response); waits across polls for late peers; times
out naming missing UIDs; ignores unrelated agents
- bootstrap: requires --run-id; defaults discover-timeout to 600s;
sends to deterministic recipient AgentId; closes client on
timeout; main() returns 2 with stderr message
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…event names The skip-trace I added in 5549dbb (operator-visible daemon lifecycle prints) writes a system trace with event name bootstrap_message_skipped, but that name was never added to the CampaignEvent.event Literal enum. The pydantic validator rejected it, crashing the daemon RIGHT AFTER the [daemon] ... is now running inside Academy Runtime print. Cosmetic-but-fatal regression that the test suite missed because no test exercises the skip-bootstrap code path through append_system_trace -- the federated demo is the first place this code path runs end to end. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e context handle.action reads the outbound exchange from a contextvar that is only set when a client is active. Runtime sets it for daemon- side code, but the standalone bootstrap command needs to set it explicitly via client.register_handle(handle) -- otherwise Handle.action raises ExchangeClientNotFoundError. The first federated demo attempt failed here: discovery succeeded, the message was built, the Handle was constructed -- and the action call died because the Handle did not know which client to route through. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…le finds exchange UserExchangeClient.__aenter__ is what sets the academy.handle.exchange_context ContextVar that Handle.action reads to find the outbound exchange. The prior register_handle-only fix binds the handle for inbox routing but does NOT set the contextvar, so the action call still raised ExchangeClientNotFoundError. Restructure dispatch_bootstrap to run the whole send inside async with client: -- exchange_context is set on enter, restored on exit. The aiohttp session gets closed by __aexit__, so the explicit client.close() became redundant. Test fixture _FakeClient is now an async-context-manager stand-in; the two close-on-success / close-on-timeout assertions check enter_count/exit_count instead. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…odels Ported from the synth branch. GPT-5* and o1/o3/o4* reject any non- default temperature with HTTP 400 'Unsupported value: temperature does not support 0.0'. Both ChatOpenAI construction sites (load_openai_model and agent/turn._custom_openai_compatible_kwargs) now consult is_reasoning_model() and drop temperature + the other sampling knobs when the model is one of those. Same module-level is_reasoning_model() helper as on the synth branch so a future merge stays mechanical. This was the last bug between the federated-hello demo daemons making their first LM call and completing the round trip. Both sites successfully discovered each other, received the bootstrap message, and entered their first reasoning round; the round crashed at the LM call because the demo uses GPT-5-mini. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…paign Federated-hello produced 2-3 events per site and the demo was over before the dashboard could render anything. Federated-chat is a back-and-forth counter game between agent-aurora and agent-crux: each turn one agent increments a counter and sends to the peer, until the counter hits 10. ~6 reasoning rounds per agent = ~40 events total in the merged dashboard timeline, plus visible message-flow with cross-site Route labels. Same two-agent shape as federated-hello so the same operator runbook works -- only --campaign federated-chat changes. Registered under 'federated-chat' name with max_decisions=20 slack for retries. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tus to top level The server returns per-site state nested under snapshot.sites[<site>].status / .placement in federated mode, but agents() reads snapshot.status?.agents, so the dashboard rendered an empty graph + zeroed metrics for federated runs even though events were streaming through correctly. Synthesize merged top-level status/placement during load() so every existing single-site reader (agents, metrics, workflow mode detection) works unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…he port The relay had no signal handlers, so SIGTERM hit the default disposition and bash kill(1) calls were silently ignored. The python kept running, the port stayed bound, and the next launch failed with "Address already in use" -- requiring a manual UAN sweep to recover. Install SIGTERM/SIGINT handlers that close the listen socket and exit cleanly, with try/except around accept() so the close-from-handler path returns instead of tracebacking. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…race and self-kill guard Aurora's login alias round-robins across uan-0001..uan-0010, so a single pid file on the shared FS was meaningless: the pid only exists on the UAN that ran the relay, and the next launch usually lands on a different UAN where that pid is either absent or belongs to someone else. As a result every crashed launch left an orphan relay holding the port, and manual ssh-into-each-UAN cleanup was the only recovery path. Replace the single-pid bookkeeping with per-UAN cleanup that scans ps for python processes whose argv contains the relay script path, owned by $USER, excluding $$ and $PPID. The self-exclusion is load- bearing: pgrep -f matched our own bash script (the relay script path appears in our argv as well), so the previous attempt killed the caller instead of the orphan. Also drop set -e (pgrep returning 1 was triggering silent exits with no log output) and add exec 2>&1 + set -x so the local relay log contains a full trace when something fails -- previously failures produced empty logs and "Local relay log:" with nothing after it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Conflicts (tests only, no src/ conflicts):
- tests/test_academy_dashboard.py
- tests/test_academy_exchange_registration.py
Both resolved by keeping dev-globus's `pytest.importorskip("academy")`
guard and dropping the academy.exchange.{local,hybrid,redis} imports
that this branch no longer uses (the deterministic-UID rendezvous
replaced them).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ed-chat federated-chat covers the same Aurora<->Crux federation smoke path with more dashboard material to render, so the hello campaign is dead weight. Remove the campaign files + registry entries, and re-point the validate_campaign(federated=True) regression test + registration.py docstring example to federated-chat. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…eholder Two related ergonomics fixes for spawn-site / run-compute launches: * When --lm-user is omitted, fall back to $ARGO_USER from the env. HPC users already export ARGO_USER for the rest of the ChemGraph workflow, so requiring a duplicate --lm-user flag was busywork. * _write_lm_config now refuses to ship lm_config.json with the template's literal "<argo-user>" placeholder. Argo would otherwise silently accept the launch and only reject at first LM call time, after the daemon + relay stack was already running -- expensive to debug. The hard error names the fix directly. Tests: stub HttpExchangeFactory in test_academy_exchange_registration so the http-dispatch tests don't try to authenticate against the real hosted exchange. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Four-terminal walkthrough for the cross-HPC demo: dashboard on Mac + spawn-site on Aurora + spawn-site on Crux + bootstrap kickoff. Mirrors the existing example-002 guide's shape, swaps in the federated flow (deterministic peer UIDs, HTTP exchange, ALCF proxy passthrough, Globus device-flow login). README links to the e2e guide and points at the packaged campaign location. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* dashboard_launcher.py: split `import os, shlex, ...` onto separate lines (E401). * mpi.py: drop unused `write_json_atomic` import (F401), left over from the file-based registration scheme that was deleted in 52fa7b5. Pre-existing ruff failures elsewhere in the repo (parsl_tools, mace_mcp_parsl, etc.) are not from this PR and untouched. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds federated cross-HPC ChemGraph Academy campaigns. Two agents on
different HPCs discover each other via deterministic UIDs over
Academy's HTTP exchange, exchange messages, and render in one merged
dashboard driven from the operator's Mac.
End-to-end verified on Aurora + Crux:
federated-chatbounced acounter 1→10 between sites, ~230 events streamed live, zero errors.
How it works
Before this PR: campaigns were single-machine. The daemon used
Redis (started by rank 0) as the exchange, peer registrations were
written to a shared-FS file the other ranks polled, and rank 0
dispatched the kickoff message inline at the end of startup. Every
piece of that — Redis subprocess, shared FS, inline kickoff — assumes
one allocation on one machine. To cross HPCs, all three had to change:
the exchange must be reachable from both sites' compute nodes (no
shared Redis), peer rendezvous must happen without a shared FS, and
kickoff must wait until every site has come up.
Identity (no shared FS): each rank computes peer UIDs locally as
uuid5(NS, "{run_id}/{agent_name}"). Both sites derive the same UIDfor each agent without any network lookup.
discover()is used as aUID-keyed liveness probe — not for name resolution, since the hosted
exchange strips names from discovery responses.
Exchange:
--exchange-type httptargets Academy's hosted exchange(Globus-auth'd) instead of a per-allocation Redis. Both sites' compute
nodes talk to the same public endpoint, so messages cross HPC
boundaries without any direct site-to-site network path.
Bootstrap: in single-machine campaigns rank 0 dispatches the
kickoff message inline. In federated mode every spawn-site passes
--no-bootstrap; the operator runschemgraph academy bootstraponce all sites have registered, which sends the kickoff to
initial_agentover the same HTTP exchange.Dashboard:
--system aurora,cruxbrings up one SSH ControlMasterview. Per-site
events.jsonlare interleaved by timestamp; per-sitestatus/placementare merged to top-level keys so existingsingle-site renderers work unchanged.
See
examples/academy/federated-chat/e2e_guide.mdfor the fullfour-terminal walkthrough.
What's new
chemgraph academy bootstrapstandalone subcommand.--exchange-type httpwith proxy passthrough throughmpiexec --genvso MPI ranks can reach the public exchange.--system aurora,crux).process scan (handles Aurora's uan-0001..0010 round-robin alias).
federated-chatpackaged campaign + e2e guide.Compatibility
Single-machine
run-computeflow is byte-identical to pre-PRbehavior. Federated paths are gated on
--agents/--no-bootstrap/
--exchange-type httpbeing explicitly set.Test plan
federated-chatAurora ↔ Crux (counter 1→10, ~230 events)launcher, dashboard, deterministic UID properties