argonne-lcf · JinchuLi2002 · Jun 18, 2026 · Jun 19, 2026 · Jun 19, 2026 · Jun 19, 2026
diff --git a/.gitignore b/.gitignore
@@ -64,3 +64,7 @@ vib*.traj
 
 # Kubernetes secrets (keep secrets.yaml.template, ignore actual secrets)
 k8s/secrets.yaml
+
+# Local private notes / scratchpads — anything matching *.private-local.* stays untracked
+*.private-local.*
+*.private-local
diff --git a/examples/academy/federated-chat/README.md b/examples/academy/federated-chat/README.md
@@ -0,0 +1,25 @@
+# Federated-Chat
+
+Two ChemGraph Academy logical agents running on **different HPCs**
+(Aurora and Crux), discovering each other through Academy's hosted
+HTTP exchange and exchanging messages across the HPC boundary:
+
+```text
+agent-aurora   (running on Aurora)
+agent-crux     (running on Crux)
+```
+
+The agents play a counter-bouncing game: agent-aurora sends `counter=1`
+to agent-crux, agent-crux replies `counter=2`, and so on until the
+counter reaches 10. Tiny on purpose — exercises the federated stack
+(deterministic peer UIDs, HTTP exchange, multi-site dashboard) without
+needing any science tools.
+
+The campaign assets are packaged under:
+
+```text
+src/chemgraph/academy/campaigns/federated-chat/
+```
+
+See [`e2e_guide.md`](e2e_guide.md) for the full four-terminal walkthrough
+(dashboard + Aurora compute + Crux compute + bootstrap kickoff).
diff --git a/examples/academy/federated-chat/e2e_guide.md b/examples/academy/federated-chat/e2e_guide.md
@@ -0,0 +1,179 @@
+# Federated-Chat E2E Guide
+
+This guide runs the `federated-chat` ChemGraph Academy campaign across
+**two HPCs simultaneously** (Aurora and Crux), with the dashboard on
+your laptop merging both sites into one view.
+
+The campaign is intentionally minimal: two agents bounce a counter back
+and forth across the HPC boundary, each incrementing it, until it hits 10.
+It exercises every part of the cross-HPC stack (deterministic peer
+discovery, HTTP exchange, cross-site send_message, multi-site dashboard)
+without needing any science tools.
+
+```text
+agent-aurora           agent-crux
+   ↓ counter=1 ──►        receive
+   receive ◄── counter=2  ↓
+   ↓ counter=3 ──►        receive
+   ...                    ...
+   receive ◄── counter=10  ↓
+   finish_turn            finish_turn
+```
+
+Four terminals: dashboard (Mac), Aurora compute, Crux compute, bootstrap (Mac).
+
+## Configure Paths
+
+Set these in every shell (Mac + both HPCs):
+
+```bash
+export ALCF_PROJECT=ChemGraph
+export ALCF_USER=<shared-filesystem-user>      # e.g. jinchu
+export ALCF_SSH_USER=<ssh-login>               # may differ, e.g. jinchuli
+export ARGO_USER=<argo-user>                   # e.g. jinchu.li
+export LOCAL_CHEMGRAPH=<local-chemgraph-checkout>
+```
+
+`ALCF_USER` is the shared-filesystem path component (`/flare/$ALCF_PROJECT/$ALCF_USER`).
+`ALCF_SSH_USER` is the SSH login. They may differ; the loader defaults
+`ALCF_SSH_USER` to `ALCF_USER` if you don't set it.
+
+## One-Time Setup
+
+You need the same setup as `example-002-mace-ensemble-screening` (sync
+ChemGraph, install `[academy]` extra, build Redis once) on **both** Aurora
+and Crux. Plus one extra step: log in to Academy's hosted exchange so the
+Globus token is cached on both compute environments:
+
+```bash
+# On Aurora compute (inside an interactive allocation):
+python -c "from academy.exchange.cloud import HttpExchangeFactory; HttpExchangeFactory()"
+# Follow the device-flow URL printed in the terminal. Same on Crux.
+```
+
+The token is written to `~/.local/share/academy/storage.db` and is
+shared across runs.
+
+## Terminal 1: Dashboard (Mac)
+
+```bash
+cd "$LOCAL_CHEMGRAPH"
+
+export RUN_ID=federated-chat-001
+
+chemgraph academy dashboard -- "$RUN_ID" \
+  --system aurora,crux \
+  --campaign federated-chat \
+  --reverse-port 18190 \
+  --overwrite-run
+```
+
+This brings up:
+
+- one SSH ControlMaster + UAN relay + rsync mirror **per site** (`aurora` and `crux`),
+- a single merged dashboard server at `http://127.0.0.1:8765`.
+
+Wait for both relays to print `... relay ready at ...` before continuing.
+
+## Terminal 2: Aurora compute (inside Aurora PBS allocation)
+
+```bash
+module load frameworks
+source /flare/$ALCF_PROJECT/$ALCF_USER/venvs/academy-swarm/bin/activate
+export PATH=/flare/$ALCF_PROJECT/$ALCF_USER/bin:$PATH
+
+# HTTP exchange must reach exchange.academy-agents.org through the ALCF proxy.
+export http_proxy=http://proxy.alcf.anl.gov:3128
+export https_proxy=http://proxy.alcf.anl.gov:3128
+export HTTP_PROXY=$http_proxy
+export HTTPS_PROXY=$https_proxy
+export no_proxy=127.0.0.1,localhost,.alcf.anl.gov
+export NO_PROXY=$no_proxy
+
+chemgraph academy spawn-site -- \
+  --system aurora \
+  --run-id "$RUN_ID" \
+  --campaign federated-chat \
+  --agents agent-aurora \
+  --exchange-type http
+```
+
+Look for the lifecycle landmarks:
+
+```text
+[daemon] rank0 registered 'agent-aurora' on the exchange (uid=...)
+[daemon] rank0 waiting for peers ['agent-crux'] to come online (timeout 600s)...
+[daemon] rank0 all 1 peer(s) are alive: ['agent-crux']
+[daemon] rank0 agent 'agent-aurora' is now running inside Academy Runtime
+[daemon] rank0 skipping inline bootstrap (federated mode); waiting for 'chemgraph academy bootstrap'...
+```
+
+## Terminal 3: Crux compute (inside Crux PBS allocation)
+
+```bash
+source /eagle/$ALCF_PROJECT/$ALCF_USER/venvs/academy-swarm-crux/bin/activate
+export PATH=/eagle/$ALCF_PROJECT/$ALCF_USER/bin:$PATH
+
+export http_proxy=http://proxy.alcf.anl.gov:3128
+export https_proxy=http://proxy.alcf.anl.gov:3128
+export HTTP_PROXY=$http_proxy
+export HTTPS_PROXY=$https_proxy
+export no_proxy=127.0.0.1,localhost,.alcf.anl.gov
+export NO_PROXY=$no_proxy
+
+chemgraph academy spawn-site -- \
+  --system crux \
+  --run-id "$RUN_ID" \
+  --campaign federated-chat \
+  --agents agent-crux \
+  --exchange-type http
+```
+
+Same landmarks. Both daemons will block at `waiting for 'chemgraph academy
+bootstrap'` once they've discovered each other.
+
+## Terminal 4: Bootstrap kickoff (Mac, once both sites are waiting)
+
+```bash
+chemgraph academy bootstrap -- \
+  --campaign federated-chat \
+  --run-id "$RUN_ID" \
+  --exchange-type http
+```
+
+Prints `ok: sent bootstrap to agent-aurora (message_id=...)`.
+
+## What You Should See
+
+- **Aurora terminal**: `[agent agent-aurora] first message arrived from
+  'campaign' ...`, then decisions firing, then `message_sent` to agent-crux.
+- **Crux terminal**: `[agent agent-crux] first message arrived from
+  'agent-aurora' ...`, then back-and-forth.
+- **Dashboard**: agent nodes appear in the graph, metrics tick up, the
+  cross-site message-flow edge between aurora and crux fills in, counter
+  messages climb in the activity log from 1 → 10.
+
+## Troubleshooting
+
+**Both sides stuck at `waiting for peers` past ~60s** → one site isn't
+actually registered. Check each compute terminal for the `registered` line.
+If one is missing, the daemon hit an exception before registration; scroll
+up.
+
+**`Address already in use` on relay startup** → a prior crashed launch
+left an orphan. The new self-cleaning relay should handle it
+automatically; if it doesn't, the local relay log under
+`/tmp/chemgraph-academy-<run-id>-<site>-relay.log` will have a full `set
+-x` trace showing exactly which step failed.
+
+**Bootstrap times out** → both sites must already be at `waiting for
+'chemgraph academy bootstrap'`. If only one is up, bootstrap can't find
+the recipient.
+
+**Argo `<argo-user>` error** → you didn't export `ARGO_USER` before
+launching spawn-site. The launcher refuses to ship a config with the
+template placeholder; the error message names the fix.
+
+**`Could not validate Globus token`** → the device-flow login expired.
+Re-run the `python -c "from academy.exchange.cloud ..."` snippet from
+the one-time setup section.
diff --git a/src/chemgraph/academy/campaigns/__init__.py b/src/chemgraph/academy/campaigns/__init__.py
@@ -6,13 +6,16 @@
 
 
 EXAMPLE_002 = 'example-002-mace-ensemble-screening'
+FEDERATED_CHAT = 'federated-chat'
 
 CAMPAIGNS = {
     'mace-ensemble-screening-20': f'{EXAMPLE_002}/campaign.jsonc',
+    'federated-chat': f'{FEDERATED_CHAT}/campaign.jsonc',
 }
 
 LM_CONFIG_TEMPLATES = {
     'argo-gpt54-mace-template': f'{EXAMPLE_002}/lm_config.json',
+    'argo-gpt5mini-federated-chat': f'{FEDERATED_CHAT}/lm_config.json',
 }
 
 
@@ -33,6 +36,16 @@ class CampaignLaunchDefaults:
         agents_per_node=1,
         max_decisions=24,
     ),
+    # Multi-turn cross-HPC counter chat. ~10 send/receive round-trips
+    # so the dashboard has actual material to render. Each agent runs
+    # ~6 reasoning rounds (send, receive, send, ..., reach 10,
+    # finish_turn). max_decisions=20 gives slack for retries.
+    'federated-chat': CampaignLaunchDefaults(
+        lm_config_template='argo-gpt5mini-federated-chat',
+        agent_count=2,
+        agents_per_node=1,
+        max_decisions=20,
+    ),
 }
 
 

diff --git a/src/chemgraph/academy/campaigns/federated-chat/campaign.jsonc b/src/chemgraph/academy/campaigns/federated-chat/campaign.jsonc
@@ -0,0 +1,45 @@
+{
+  // ---------------------------------------------------------------------
+  // Federated-chat: a multi-turn cross-HPC conversation. Two agents
+  // bounce a counter back and forth N times, each incrementing it,
+  // until the counter hits a threshold and they finish_turn. Designed
+  // for demos -- produces ~40 events on the dashboard (decisions,
+  // send_message, message_received) so the federated UI (Sites
+  // header, message-flow edge, cross-site Route label) has real
+  // material to render.
+  //
+  // Run as:
+  //   chemgraph academy dashboard -- federated-chat-XXX \
+  //     --system aurora,crux --campaign federated-chat --overwrite-run
+  //   chemgraph academy spawn-site -- --system aurora --run-id ... \
+  //     --campaign federated-chat --agents agent-aurora --exchange-type http
+  //   chemgraph academy spawn-site -- --system crux --run-id ... \
+  //     --campaign federated-chat --agents agent-crux --exchange-type http
+  //   chemgraph academy bootstrap -- --campaign federated-chat \
+  //     --run-id ... --exchange-type http
+  // ---------------------------------------------------------------------
+  "run_id": "federated-chat",
+  "user_task": "Federated counter chat: bounce an integer counter between agent-aurora and agent-crux, each incrementing it by 1, until it reaches 10. Then both finish_turn.",
+  "prompt_profile": "prompt_profiles/default.json",
+  "initial_agent": "agent-aurora",
+  "resources": {},
+  "mcp_servers": [],
+  "agents": [
+    {
+      "name": "agent-aurora",
+      "role": "FederatedCounterInitiator",
+      "mission": "You are agent-aurora, running on the Aurora HPC. You are playing a counter-bouncing game with agent-crux across the HPC boundary. Rules: (1) On the bootstrap round, send EXACTLY ONE message to agent-crux with content 'counter=1' and tldr 'counter=1'. Set reply_requested=true. Then call finish_turn. (2) On every subsequent round where you receive a message from agent-crux containing 'counter=N', if N < 10 then send EXACTLY ONE reply to agent-crux with content 'counter=N+1' (you compute N+1 yourself, e.g. counter=3 if you received counter=2) and tldr 'counter=N+1', reply_requested=true, then finish_turn. If N >= 10, send NOTHING and just call finish_turn -- the game is over. (3) NEVER send more than one message per round. (4) NEVER initiate a new chain; only reply when a peer message arrives.",
+      "allowed_peers": ["agent-crux"],
+      "mcp_servers": [],
+      "resources": []
+    },
+    {
+      "name": "agent-crux",
+      "role": "FederatedCounterResponder",
+      "mission": "You are agent-crux, running on the Crux HPC. You are playing a counter-bouncing game with agent-aurora across the HPC boundary. Rules: (1) You NEVER initiate a message; you only ever reply. (2) On every round where you receive a message from agent-aurora containing 'counter=N', if N < 10 then send EXACTLY ONE reply to agent-aurora with content 'counter=N+1' (you compute N+1 yourself) and tldr 'counter=N+1', reply_requested=true, then finish_turn. If N >= 10, send NOTHING and just call finish_turn -- the game is over. (3) NEVER send more than one message per round.",
+      "allowed_peers": ["agent-aurora"],
+      "mcp_servers": [],
+      "resources": []
+    }
+  ]
+}
diff --git a/src/chemgraph/academy/campaigns/federated-chat/lm_config.json b/src/chemgraph/academy/campaigns/federated-chat/lm_config.json
@@ -0,0 +1,11 @@
+{
+  "provider": "openai_compatible_tools",
+  "base_url": "http://<uan-relay-host>:18186/argoapi/v1",
+  "model": "GPT-5-mini",
+  "api_key": "dummy",
+  "user": "<argo-user>",
+  "timeout_s": 180,
+  "max_tokens": 4096,
+  "max_retries": 3,
+  "retry_delay_s": 2
+}
diff --git a/src/chemgraph/academy/campaigns/federated-chat/prompt_profiles/default.json b/src/chemgraph/academy/campaigns/federated-chat/prompt_profiles/default.json
@@ -0,0 +1,12 @@
+{
+  "prompt_version": "federated-chat-v1",
+  "prompt_style": "json_state",
+  "system_prompt": "You are a persistent ChemGraph-style LM agent hosted inside an Academy daemon on HPC. You communicate with peers ONLY through send_message. This campaign has NO science tools; your only useful actions are send_message and finish_turn. Follow your mission literally; do not invent additional work. The campaign has a clear termination condition (counter reaches 10); when reached, call finish_turn and STOP.",
+  "protocol_prompt": "Return one or more tool calls. If no action is useful, call finish_turn. Every send_message call must include tldr: one short line for the dashboard. Set reply_requested=true when the peer should answer, otherwise false. Keep arguments concise. Per your mission: send AT MOST ONE message per round. The counter you receive looks like 'counter=N'; parse N, compute N+1, send 'counter=N+1' as both content and tldr. When N>=10 the game is over -- send nothing, just finish_turn.",
+  "langchain_recursion_limit": 32,
+  "state_limits": {
+    "received_messages_last_n": 8,
+    "tool_results_last_n": 4,
+    "actions_last_n": 8
+  }
+}
diff --git a/src/chemgraph/academy/core/agent.py b/src/chemgraph/academy/core/agent.py
@@ -90,10 +90,26 @@ async def agent_on_startup(self) -> None:
     @action
     async def receive_message(self, message: dict[str, Any]) -> None:
         validate_message(message)
+        first_message = not self.received_message_history
         self.received_message_history.append(message)
         self._trace('message_received', message)
         if self._wake_event is not None:
             self._wake_event.set()
+        if first_message:
+            # Operator-visible lifecycle landmark: the FIRST message
+            # to land on this agent (almost always the campaign
+            # bootstrap for initial_agent, or a peer's reply on
+            # everyone else) is the canonical "kickoff arrived"
+            # signal. Use print so it surfaces on stdout regardless
+            # of log level configuration on the rank.
+            sender = message.get('sender', '?')
+            kind = message.get('kind', '?')
+            tldr = message.get('tldr') or message.get('content', '')[:60]
+            print(
+                f"[agent {self.spec.name}] first message arrived from "
+                f"{sender!r} (kind={kind}): {tldr}",
+                flush=True,
+            )
 
     @action
     async def get_status(self) -> dict[str, Any]: