Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
103d716
chore(.gitignore): ignore *.private-local.* (local scratchpads)
JinchuLi2002 Jun 18, 2026
2bffa2a
feat(academy/runtime): add http exchange for cross-HPC messaging
JinchuLi2002 Jun 19, 2026
d5dd340
feat(academy/core+runtime): agent subsetting + skip-bootstrap support
JinchuLi2002 Jun 19, 2026
fcb8de2
feat(academy): spawn-site CLI subcommand for federated launches
JinchuLi2002 Jun 19, 2026
da3ae6f
refactor(academy/runtime): exchange-based peer discovery, drop file r…
JinchuLi2002 Jun 19, 2026
f6e492b
feat(academy): bootstrap subcommand for federated campaign kickoff
JinchuLi2002 Jun 19, 2026
e1492c6
feat(academy/dashboard): multi-site launcher + merged event view
JinchuLi2002 Jun 19, 2026
3893825
feat(academy): crux profile + federated-hello campaign + federated va…
JinchuLi2002 Jun 19, 2026
fdc257f
feat(academy/dashboard): per-site rendering for federated runs
JinchuLi2002 Jun 19, 2026
5549dbb
fix(academy/runtime/profiles): separate ALCF_SSH_USER from ALCF_USER
JinchuLi2002 Jun 19, 2026
e162f13
fix(academy/runtime): startup-timeout-s default 600s + operator overr…
JinchuLi2002 Jun 19, 2026
90b636f
feat(academy): operator-visible daemon lifecycle prints
JinchuLi2002 Jun 19, 2026
52fa7b5
fix(academy/runtime): deterministic peer UIDs for hosted-exchange ren…
JinchuLi2002 Jun 19, 2026
cd353e7
fix(academy/observability): add bootstrap_message_skipped to allowed …
JinchuLi2002 Jun 19, 2026
1b5b9f8
fix(academy/bootstrap): register_handle before action to bind exchang…
JinchuLi2002 Jun 19, 2026
4a4fa3d
fix(academy/bootstrap): enter client as async context manager so Hand…
JinchuLi2002 Jun 19, 2026
f84b57f
fix(academy/models): strip temperature for GPT-5/o-series reasoning m…
JinchuLi2002 Jun 19, 2026
547e647
feat(academy/campaigns): federated-chat multi-turn cross-HPC demo cam…
JinchuLi2002 Jun 19, 2026
5856e7d
fix(academy/dashboard): render federated runs by merging per-site sta…
JinchuLi2002 Jun 19, 2026
9bb441f
fix(academy/relay): handle SIGTERM/SIGINT so kill actually releases t…
JinchuLi2002 Jun 19, 2026
767a97e
fix(academy/relay): reap stale relays per-UAN before binding, with xt…
JinchuLi2002 Jun 19, 2026
92db9b7
Merge origin/dev-globus into academy-dynamic-campaign
JinchuLi2002 Jun 19, 2026
b51f189
chore(academy/campaigns): drop federated-hello, superseded by federat…
JinchuLi2002 Jun 19, 2026
b3b34bf
fix(academy/compute): default --lm-user from $ARGO_USER + reject plac…
JinchuLi2002 Jun 22, 2026
6e0198b
docs(academy): federated-chat e2e guide + README
JinchuLi2002 Jun 22, 2026
1af22be
style(academy): fix ruff lint violations introduced by this branch
JinchuLi2002 Jun 22, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -64,3 +64,7 @@ vib*.traj

# Kubernetes secrets (keep secrets.yaml.template, ignore actual secrets)
k8s/secrets.yaml

# Local private notes / scratchpads — anything matching *.private-local.* stays untracked
*.private-local.*
*.private-local
25 changes: 25 additions & 0 deletions examples/academy/federated-chat/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# Federated-Chat

Two ChemGraph Academy logical agents running on **different HPCs**
(Aurora and Crux), discovering each other through Academy's hosted
HTTP exchange and exchanging messages across the HPC boundary:

```text
agent-aurora (running on Aurora)
agent-crux (running on Crux)
```

The agents play a counter-bouncing game: agent-aurora sends `counter=1`
to agent-crux, agent-crux replies `counter=2`, and so on until the
counter reaches 10. Tiny on purpose — exercises the federated stack
(deterministic peer UIDs, HTTP exchange, multi-site dashboard) without
needing any science tools.

The campaign assets are packaged under:

```text
src/chemgraph/academy/campaigns/federated-chat/
```

See [`e2e_guide.md`](e2e_guide.md) for the full four-terminal walkthrough
(dashboard + Aurora compute + Crux compute + bootstrap kickoff).
179 changes: 179 additions & 0 deletions examples/academy/federated-chat/e2e_guide.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,179 @@
# Federated-Chat E2E Guide

This guide runs the `federated-chat` ChemGraph Academy campaign across
**two HPCs simultaneously** (Aurora and Crux), with the dashboard on
your laptop merging both sites into one view.

The campaign is intentionally minimal: two agents bounce a counter back
and forth across the HPC boundary, each incrementing it, until it hits 10.
It exercises every part of the cross-HPC stack (deterministic peer
discovery, HTTP exchange, cross-site send_message, multi-site dashboard)
without needing any science tools.

```text
agent-aurora agent-crux
↓ counter=1 ──► receive
receive ◄── counter=2 ↓
↓ counter=3 ──► receive
... ...
receive ◄── counter=10 ↓
finish_turn finish_turn
```

Four terminals: dashboard (Mac), Aurora compute, Crux compute, bootstrap (Mac).

## Configure Paths

Set these in every shell (Mac + both HPCs):

```bash
export ALCF_PROJECT=ChemGraph
export ALCF_USER=<shared-filesystem-user> # e.g. jinchu
export ALCF_SSH_USER=<ssh-login> # may differ, e.g. jinchuli
export ARGO_USER=<argo-user> # e.g. jinchu.li
export LOCAL_CHEMGRAPH=<local-chemgraph-checkout>
```

`ALCF_USER` is the shared-filesystem path component (`/flare/$ALCF_PROJECT/$ALCF_USER`).
`ALCF_SSH_USER` is the SSH login. They may differ; the loader defaults
`ALCF_SSH_USER` to `ALCF_USER` if you don't set it.

## One-Time Setup

You need the same setup as `example-002-mace-ensemble-screening` (sync
ChemGraph, install `[academy]` extra, build Redis once) on **both** Aurora
and Crux. Plus one extra step: log in to Academy's hosted exchange so the
Globus token is cached on both compute environments:

```bash
# On Aurora compute (inside an interactive allocation):
python -c "from academy.exchange.cloud import HttpExchangeFactory; HttpExchangeFactory()"
# Follow the device-flow URL printed in the terminal. Same on Crux.
```

The token is written to `~/.local/share/academy/storage.db` and is
shared across runs.

## Terminal 1: Dashboard (Mac)

```bash
cd "$LOCAL_CHEMGRAPH"

export RUN_ID=federated-chat-001

chemgraph academy dashboard -- "$RUN_ID" \
--system aurora,crux \
--campaign federated-chat \
--reverse-port 18190 \
--overwrite-run
```

This brings up:

- one SSH ControlMaster + UAN relay + rsync mirror **per site** (`aurora` and `crux`),
- a single merged dashboard server at `http://127.0.0.1:8765`.

Wait for both relays to print `... relay ready at ...` before continuing.

## Terminal 2: Aurora compute (inside Aurora PBS allocation)

```bash
module load frameworks
source /flare/$ALCF_PROJECT/$ALCF_USER/venvs/academy-swarm/bin/activate
export PATH=/flare/$ALCF_PROJECT/$ALCF_USER/bin:$PATH

# HTTP exchange must reach exchange.academy-agents.org through the ALCF proxy.
export http_proxy=http://proxy.alcf.anl.gov:3128
export https_proxy=http://proxy.alcf.anl.gov:3128
export HTTP_PROXY=$http_proxy
export HTTPS_PROXY=$https_proxy
export no_proxy=127.0.0.1,localhost,.alcf.anl.gov
export NO_PROXY=$no_proxy

chemgraph academy spawn-site -- \
--system aurora \
--run-id "$RUN_ID" \
--campaign federated-chat \
--agents agent-aurora \
--exchange-type http
```

Look for the lifecycle landmarks:

```text
[daemon] rank0 registered 'agent-aurora' on the exchange (uid=...)
[daemon] rank0 waiting for peers ['agent-crux'] to come online (timeout 600s)...
[daemon] rank0 all 1 peer(s) are alive: ['agent-crux']
[daemon] rank0 agent 'agent-aurora' is now running inside Academy Runtime
[daemon] rank0 skipping inline bootstrap (federated mode); waiting for 'chemgraph academy bootstrap'...
```

## Terminal 3: Crux compute (inside Crux PBS allocation)

```bash
source /eagle/$ALCF_PROJECT/$ALCF_USER/venvs/academy-swarm-crux/bin/activate
export PATH=/eagle/$ALCF_PROJECT/$ALCF_USER/bin:$PATH

export http_proxy=http://proxy.alcf.anl.gov:3128
export https_proxy=http://proxy.alcf.anl.gov:3128
export HTTP_PROXY=$http_proxy
export HTTPS_PROXY=$https_proxy
export no_proxy=127.0.0.1,localhost,.alcf.anl.gov
export NO_PROXY=$no_proxy

chemgraph academy spawn-site -- \
--system crux \
--run-id "$RUN_ID" \
--campaign federated-chat \
--agents agent-crux \
--exchange-type http
```

Same landmarks. Both daemons will block at `waiting for 'chemgraph academy
bootstrap'` once they've discovered each other.

## Terminal 4: Bootstrap kickoff (Mac, once both sites are waiting)

```bash
chemgraph academy bootstrap -- \
--campaign federated-chat \
--run-id "$RUN_ID" \
--exchange-type http
```

Prints `ok: sent bootstrap to agent-aurora (message_id=...)`.

## What You Should See

- **Aurora terminal**: `[agent agent-aurora] first message arrived from
'campaign' ...`, then decisions firing, then `message_sent` to agent-crux.
- **Crux terminal**: `[agent agent-crux] first message arrived from
'agent-aurora' ...`, then back-and-forth.
- **Dashboard**: agent nodes appear in the graph, metrics tick up, the
cross-site message-flow edge between aurora and crux fills in, counter
messages climb in the activity log from 1 → 10.

## Troubleshooting

**Both sides stuck at `waiting for peers` past ~60s** → one site isn't
actually registered. Check each compute terminal for the `registered` line.
If one is missing, the daemon hit an exception before registration; scroll
up.

**`Address already in use` on relay startup** → a prior crashed launch
left an orphan. The new self-cleaning relay should handle it
automatically; if it doesn't, the local relay log under
`/tmp/chemgraph-academy-<run-id>-<site>-relay.log` will have a full `set
-x` trace showing exactly which step failed.

**Bootstrap times out** → both sites must already be at `waiting for
'chemgraph academy bootstrap'`. If only one is up, bootstrap can't find
the recipient.

**Argo `<argo-user>` error** → you didn't export `ARGO_USER` before
launching spawn-site. The launcher refuses to ship a config with the
template placeholder; the error message names the fix.

**`Could not validate Globus token`** → the device-flow login expired.
Re-run the `python -c "from academy.exchange.cloud ..."` snippet from
the one-time setup section.
13 changes: 13 additions & 0 deletions src/chemgraph/academy/campaigns/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,13 +6,16 @@


EXAMPLE_002 = 'example-002-mace-ensemble-screening'
FEDERATED_CHAT = 'federated-chat'

CAMPAIGNS = {
'mace-ensemble-screening-20': f'{EXAMPLE_002}/campaign.jsonc',
'federated-chat': f'{FEDERATED_CHAT}/campaign.jsonc',
}

LM_CONFIG_TEMPLATES = {
'argo-gpt54-mace-template': f'{EXAMPLE_002}/lm_config.json',
'argo-gpt5mini-federated-chat': f'{FEDERATED_CHAT}/lm_config.json',
}


Expand All @@ -33,6 +36,16 @@ class CampaignLaunchDefaults:
agents_per_node=1,
max_decisions=24,
),
# Multi-turn cross-HPC counter chat. ~10 send/receive round-trips
# so the dashboard has actual material to render. Each agent runs
# ~6 reasoning rounds (send, receive, send, ..., reach 10,
# finish_turn). max_decisions=20 gives slack for retries.
'federated-chat': CampaignLaunchDefaults(
lm_config_template='argo-gpt5mini-federated-chat',
agent_count=2,
agents_per_node=1,
max_decisions=20,
),
}


Expand Down
45 changes: 45 additions & 0 deletions src/chemgraph/academy/campaigns/federated-chat/campaign.jsonc
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
{
// ---------------------------------------------------------------------
// Federated-chat: a multi-turn cross-HPC conversation. Two agents
// bounce a counter back and forth N times, each incrementing it,
// until the counter hits a threshold and they finish_turn. Designed
// for demos -- produces ~40 events on the dashboard (decisions,
// send_message, message_received) so the federated UI (Sites
// header, message-flow edge, cross-site Route label) has real
// material to render.
//
// Run as:
// chemgraph academy dashboard -- federated-chat-XXX \
// --system aurora,crux --campaign federated-chat --overwrite-run
// chemgraph academy spawn-site -- --system aurora --run-id ... \
// --campaign federated-chat --agents agent-aurora --exchange-type http
// chemgraph academy spawn-site -- --system crux --run-id ... \
// --campaign federated-chat --agents agent-crux --exchange-type http
// chemgraph academy bootstrap -- --campaign federated-chat \
// --run-id ... --exchange-type http
// ---------------------------------------------------------------------
"run_id": "federated-chat",
"user_task": "Federated counter chat: bounce an integer counter between agent-aurora and agent-crux, each incrementing it by 1, until it reaches 10. Then both finish_turn.",
"prompt_profile": "prompt_profiles/default.json",
"initial_agent": "agent-aurora",
"resources": {},
"mcp_servers": [],
"agents": [
{
"name": "agent-aurora",
"role": "FederatedCounterInitiator",
"mission": "You are agent-aurora, running on the Aurora HPC. You are playing a counter-bouncing game with agent-crux across the HPC boundary. Rules: (1) On the bootstrap round, send EXACTLY ONE message to agent-crux with content 'counter=1' and tldr 'counter=1'. Set reply_requested=true. Then call finish_turn. (2) On every subsequent round where you receive a message from agent-crux containing 'counter=N', if N < 10 then send EXACTLY ONE reply to agent-crux with content 'counter=N+1' (you compute N+1 yourself, e.g. counter=3 if you received counter=2) and tldr 'counter=N+1', reply_requested=true, then finish_turn. If N >= 10, send NOTHING and just call finish_turn -- the game is over. (3) NEVER send more than one message per round. (4) NEVER initiate a new chain; only reply when a peer message arrives.",
"allowed_peers": ["agent-crux"],
"mcp_servers": [],
"resources": []
},
{
"name": "agent-crux",
"role": "FederatedCounterResponder",
"mission": "You are agent-crux, running on the Crux HPC. You are playing a counter-bouncing game with agent-aurora across the HPC boundary. Rules: (1) You NEVER initiate a message; you only ever reply. (2) On every round where you receive a message from agent-aurora containing 'counter=N', if N < 10 then send EXACTLY ONE reply to agent-aurora with content 'counter=N+1' (you compute N+1 yourself) and tldr 'counter=N+1', reply_requested=true, then finish_turn. If N >= 10, send NOTHING and just call finish_turn -- the game is over. (3) NEVER send more than one message per round.",
"allowed_peers": ["agent-aurora"],
"mcp_servers": [],
"resources": []
}
]
}
11 changes: 11 additions & 0 deletions src/chemgraph/academy/campaigns/federated-chat/lm_config.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
{
"provider": "openai_compatible_tools",
"base_url": "http://<uan-relay-host>:18186/argoapi/v1",
"model": "GPT-5-mini",
"api_key": "dummy",
"user": "<argo-user>",
"timeout_s": 180,
"max_tokens": 4096,
"max_retries": 3,
"retry_delay_s": 2
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
{
"prompt_version": "federated-chat-v1",
"prompt_style": "json_state",
"system_prompt": "You are a persistent ChemGraph-style LM agent hosted inside an Academy daemon on HPC. You communicate with peers ONLY through send_message. This campaign has NO science tools; your only useful actions are send_message and finish_turn. Follow your mission literally; do not invent additional work. The campaign has a clear termination condition (counter reaches 10); when reached, call finish_turn and STOP.",
"protocol_prompt": "Return one or more tool calls. If no action is useful, call finish_turn. Every send_message call must include tldr: one short line for the dashboard. Set reply_requested=true when the peer should answer, otherwise false. Keep arguments concise. Per your mission: send AT MOST ONE message per round. The counter you receive looks like 'counter=N'; parse N, compute N+1, send 'counter=N+1' as both content and tldr. When N>=10 the game is over -- send nothing, just finish_turn.",
"langchain_recursion_limit": 32,
"state_limits": {
"received_messages_last_n": 8,
"tool_results_last_n": 4,
"actions_last_n": 8
}
}
16 changes: 16 additions & 0 deletions src/chemgraph/academy/core/agent.py
Original file line number Diff line number Diff line change
Expand Up @@ -90,10 +90,26 @@ async def agent_on_startup(self) -> None:
@action
async def receive_message(self, message: dict[str, Any]) -> None:
validate_message(message)
first_message = not self.received_message_history
self.received_message_history.append(message)
self._trace('message_received', message)
if self._wake_event is not None:
self._wake_event.set()
if first_message:
# Operator-visible lifecycle landmark: the FIRST message
# to land on this agent (almost always the campaign
# bootstrap for initial_agent, or a peer's reply on
# everyone else) is the canonical "kickoff arrived"
# signal. Use print so it surfaces on stdout regardless
# of log level configuration on the rank.
sender = message.get('sender', '?')
kind = message.get('kind', '?')
tldr = message.get('tldr') or message.get('content', '')[:60]
print(
f"[agent {self.spec.name}] first message arrived from "
f"{sender!r} (kind={kind}): {tldr}",
flush=True,
)

@action
async def get_status(self) -> dict[str, Any]:
Expand Down
Loading