Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
126 commits
Select commit Hold shift + click to select a range
071a251
Add pluggable execution backend for Parsl, EnsembleLauncher, and Glob…
tdpham2 May 4, 2026
4878740
Fix unreachable code in aurora_parsl and EnsembleLauncher shutdown state
tdpham2 May 4, 2026
a8feddb
Update Globus config
tdpham2 May 14, 2026
3144c6a
Add inline structure for file transferring between local and globus r…
tdpham2 May 14, 2026
e032fb8
Add async job tracking for Globus Compute MCP tools
tdpham2 May 14, 2026
ae7963d
Modified the EL backend implemenations, and added a EL backend test
harikrishna1410 May 21, 2026
1febc5a
Add CGFastMCP backend framework, EL client-only mode, and pickle fix
harikrishna1410 May 22, 2026
28432be
Merge pull request #127 from argonne-lcf/dev-globus_HT
tdpham2 Jun 1, 2026
d60a1e1
Fix PR #127 blockers: silent failure, decorator IndexError, hard EL i…
tdpham2 Jun 1, 2026
4387d13
Add JobTracker persistence and Globus task UUID round-trip
tdpham2 Jun 1, 2026
1b76032
Extend CGFastMCP: tracker kwargs, pre-submit hook, schema_fanout_tool
tdpham2 Jun 1, 2026
04bcc8a
Add remote_structure_directory schemas, GC executor recovery, XANES p…
tdpham2 Jun 1, 2026
aaef2a9
Add Globus Transfer manager and MCP file-staging tools
tdpham2 Jun 1, 2026
d2b3b2d
Reintegrate MACE MCP transport, persistence, and Globus Transfer on C…
tdpham2 Jun 1, 2026
b7ac17c
Add academy module: distributed multi-agent screening via Academy
tdpham2 Jun 1, 2026
b3cc242
Fix bugs in HPC execution layer
tdpham2 Jun 1, 2026
27fff5d
Migrate XANES and gRASPA MCP servers to CGFastMCP
tdpham2 Jun 1, 2026
d28274c
Add pluggable execution backend for Parsl, EnsembleLauncher, and Glob…
tdpham2 May 4, 2026
db2a41c
Fix unreachable code in aurora_parsl and EnsembleLauncher shutdown state
tdpham2 May 4, 2026
39f28a1
Update Globus config
tdpham2 May 14, 2026
b13fc53
Add inline structure for file transferring between local and globus r…
tdpham2 May 14, 2026
b605ec0
Add async job tracking for Globus Compute MCP tools
tdpham2 May 14, 2026
35a2d65
Modified the EL backend implemenations, and added a EL backend test
harikrishna1410 May 21, 2026
212a054
Add CGFastMCP backend framework, EL client-only mode, and pickle fix
harikrishna1410 May 22, 2026
6422b61
Fix PR #127 blockers: silent failure, decorator IndexError, hard EL i…
tdpham2 Jun 1, 2026
a8a3a87
Add JobTracker persistence and Globus task UUID round-trip
tdpham2 Jun 1, 2026
78c9c33
Extend CGFastMCP: tracker kwargs, pre-submit hook, schema_fanout_tool
tdpham2 Jun 1, 2026
2d6a283
Add remote_structure_directory schemas, GC executor recovery, XANES p…
tdpham2 Jun 1, 2026
f1863d3
Add Globus Transfer manager and MCP file-staging tools
tdpham2 Jun 1, 2026
890acaa
Reintegrate MACE MCP transport, persistence, and Globus Transfer on C…
tdpham2 Jun 1, 2026
ee3d727
Add academy module: distributed multi-agent screening via Academy
tdpham2 Jun 1, 2026
e3ac8b0
Fix bugs in HPC execution layer
tdpham2 Jun 1, 2026
2d25f89
Migrate XANES and gRASPA MCP servers to CGFastMCP
tdpham2 Jun 1, 2026
bc54083
Forward HPC env vars to MCP stdio subprocess + document EL config
tdpham2 Jun 2, 2026
e85c675
Silence LocalBackend worker stdout under stdio MCP transport
tdpham2 Jun 4, 2026
84c87dd
Add smoke and demo scripts for execution backends
tdpham2 Jun 4, 2026
a31c148
Add Crux (ALCF) support to Parsl and EnsembleLauncher backends
tdpham2 Jun 5, 2026
2dc37c8
Ensure MACE worker creates output directories
JinchuLi2002 Jun 9, 2026
dfa9950
Fix schema fanout batch return annotation
JinchuLi2002 Jun 9, 2026
aad6fae
Ignore local run and model artifacts
JinchuLi2002 Jun 9, 2026
9d1b20a
Cover schema fanout and MACE output path fixes
JinchuLi2002 Jun 9, 2026
acc8a3b
Create parent directory for generated coordinate files
JinchuLi2002 Jun 9, 2026
42b15a0
Read version from chemgraphagent distribution
JinchuLi2002 Jun 9, 2026
7880562
Parameterize Aurora Parsl worker setup
JinchuLi2002 Jun 9, 2026
591dba7
Clarify MACE model path schema descriptions
JinchuLi2002 Jun 9, 2026
534a292
Add HPC JSON inspection MCP tool
JinchuLi2002 Jun 9, 2026
bf433ae
Add Academy canonical event observability
JinchuLi2002 Jun 9, 2026
ccecdd8
feat(chemgraph): add local workflow tracing
JinchuLi2002 Jun 9, 2026
ab8c0ef
feat(chemgraph): support terminal tools in single-agent graph
JinchuLi2002 Jun 9, 2026
d6f22d2
refactor(academy): reshape persistent agent runtime
JinchuLi2002 Jun 9, 2026
778e94c
feat(academy): add MACE screening campaign example
JinchuLi2002 Jun 9, 2026
0b8197b
feat(cli): wire dashboard and Academy commands
JinchuLi2002 Jun 9, 2026
96241d7
feat(models): support OpenAI-compatible Argo user metadata
JinchuLi2002 Jun 9, 2026
f4992f6
chore(academy): add redis optional dependency
JinchuLi2002 Jun 9, 2026
d8a7868
refactor(academy): rename operator console dashboard launcher
JinchuLi2002 Jun 9, 2026
c89f45e
Merge remote-tracking branch 'origin/dev-globus' into merge-dev-globus
JinchuLi2002 Jun 9, 2026
527acb9
refactor(academy): merge peer request action into send_message
JinchuLi2002 Jun 9, 2026
457351e
refactor(mcp): move in-process FastMCP adapter out of Academy
JinchuLi2002 Jun 10, 2026
186d013
refactor(academy): remove unused run artifacts
JinchuLi2002 Jun 10, 2026
a431bf0
refactor(academy): extract dashboard launcher templates
JinchuLi2002 Jun 10, 2026
b71c441
refactor(academy): run wakeups through chemgraph turn primitive
JinchuLi2002 Jun 10, 2026
77425b2
refactor(academy): trim agent status snapshots
JinchuLi2002 Jun 10, 2026
8772c7c
refactor(academy): remove stale cleanup targets
JinchuLi2002 Jun 10, 2026
11f896d
refactor(agent): route single-agent workflows through run_turn
JinchuLi2002 Jun 10, 2026
5e64525
fix(academy): avoid stdlib shadowing in profiles
JinchuLi2002 Jun 10, 2026
23dab00
fix(dashboard): show daemon workflow turns
JinchuLi2002 Jun 10, 2026
c529682
refactor(models): share LLM endpoint settings
JinchuLi2002 Jun 10, 2026
f67b340
chore(academy): remove stale built-in listings
JinchuLi2002 Jun 10, 2026
2cee93a
chore(academy): remove console script entrypoints
JinchuLi2002 Jun 10, 2026
7e5f91b
refactor(academy): launch campaign MCP tools as servers
JinchuLi2002 Jun 10, 2026
de0a79d
refactor(academy): generalize exchange registration
JinchuLi2002 Jun 10, 2026
99f073d
fix(agent): emit llm decision events for tool calls
JinchuLi2002 Jun 10, 2026
83bcd7a
fix(mcp): isolate hpc backend workers
JinchuLi2002 Jun 10, 2026
4441d98
refactor(academy): move packaged campaigns out of examples
JinchuLi2002 Jun 10, 2026
f5d2f4d
feat(cli): --trace-dir option for traditional dashboard view
JinchuLi2002 Jun 10, 2026
41a3f22
docs(example-002): add sanitized e2e guide
JinchuLi2002 Jun 11, 2026
64c467c
refactor(agent): restore per-workflow graphs; move run_turn to agent/…
JinchuLi2002 Jun 11, 2026
b5183c4
chore(agent): drop unused turn re-exports from llm_agent
JinchuLi2002 Jun 11, 2026
f1593ab
revert(agent): restore llm_agent.py to pre-academy shape
JinchuLi2002 Jun 11, 2026
bcf072d
refactor(agent): extract dashboard event callbacks into agent/events.py
JinchuLi2002 Jun 11, 2026
789920e
feat(agent): add minimum on_event and terminal_tool_names hooks for a…
JinchuLi2002 Jun 11, 2026
60817d1
Fix Parsl pickling of MCP server callables launched via "python -m"
tdpham2 Jun 11, 2026
99ef6f3
Cleanly shut down Parsl DFK in ParslBackend.shutdown
tdpham2 Jun 11, 2026
9be7740
Make Parsl worker_init configurable across HPC system configs
tdpham2 Jun 11, 2026
396a528
Add Crux support and worker-env forwarding to Parsl agent demo
tdpham2 Jun 11, 2026
ff4aa14
Add Crux support to Parsl + EnsembleLauncher smoke harness
tdpham2 Jun 11, 2026
4bca7d9
fix(events): always emit llm_decision so dashboard renders single-LLM…
JinchuLi2002 Jun 11, 2026
cbe6094
fix(mcp): isolate MACE worker in a subprocess to dodge Parsl hang on …
JinchuLi2002 Jun 12, 2026
e9dd903
Pick a multi-node-safe MPI flavour per HPC system for EnsembleLauncher
tdpham2 Jun 12, 2026
92108b1
Serialize MACE model loads across both threads and processes
tdpham2 Jun 12, 2026
cb3e197
Revert "fix(mcp): isolate MACE worker in a subprocess to dodge Parsl …
JinchuLi2002 Jun 12, 2026
bef9d87
Revert "fix(mcp): isolate hpc backend workers"
JinchuLi2002 Jun 12, 2026
ca1019c
docs(example-002): switch MACE path to in-process run_ase
JinchuLi2002 Jun 12, 2026
8043fe9
fix static sync location
harikrishna1410 Jun 13, 2026
317204b
chore(campaigns): rename campaign.json to campaign.jsonc to match con…
JinchuLi2002 Jun 15, 2026
4e6556d
feat(academy): per-agent allowed_tools whitelist on top of mcp_servers
JinchuLi2002 Jun 15, 2026
25565d0
docs(example-002): document http(s)_proxy env vars for compute-node M…
JinchuLi2002 Jun 15, 2026
648b536
fix(tools): create output_results_file parent dir in run_ase_core
JinchuLi2002 Jun 15, 2026
04f1dbf
fix(academy): materialise shared_run resource directories at startup
JinchuLi2002 Jun 15, 2026
a2e27a9
added a one off logging to run_ase_core
harikrishna1410 Jun 15, 2026
f56713d
added ppn to task spec in demo el
harikrishna1410 Jun 15, 2026
aefed79
added try except block in demo chemistry
Jun 15, 2026
4622823
refactor(academy/dashboard): bundle UAN relay script into chemgraph
JinchuLi2002 Jun 16, 2026
d3f3d41
Merge pull request #131 from JinchuLi2002/dev-globus
tdpham2 Jun 16, 2026
64d087a
added -ppn and --ngpus_per_process to mcp demos
harikrishna1410 Jun 16, 2026
09f972b
added a counter in cg mcp to make task_ids unique
harikrishna1410 Jun 17, 2026
507d518
adding some temp cg config dor argo
Jun 17, 2026
c8bb170
Gate MACE inline-structure transport to no-shared-filesystem backends
tdpham2 Jun 17, 2026
4aead3e
Gate MACE inline-structure transport to no-shared-filesystem backends
tdpham2 Jun 17, 2026
1c05436
Drop full_output read-back from MACE worker
tdpham2 Jun 17, 2026
60b4225
Drop full_output read-back from MACE worker
tdpham2 Jun 17, 2026
debdab4
moved el orchestrator to a subprocess
harikrishna1410 Jun 17, 2026
b43d390
added logging in el backend
harikrishna1410 Jun 17, 2026
3feedcd
added better cleanup of el subprocess
Jun 17, 2026
a069520
Guard stdout during EnsembleLauncher teardown
tdpham2 Jun 18, 2026
f639aa4
Guard stdout during EnsembleLauncher teardown
tdpham2 Jun 18, 2026
855be82
Merge remote-tracking branch 'origin/dev-globus-hpc' into el_test
tdpham2 Jun 18, 2026
7ec8dec
Merge pull request #133 from argonne-lcf/el_test
tdpham2 Jun 18, 2026
6741d2e
Merge remote-tracking branch 'origin/dev-globus' into dev-globus-hpc
tdpham2 Jun 18, 2026
932e4c8
Revert hardcoded argo model/base_url in Parsl agent demo
tdpham2 Jun 18, 2026
f3d036b
Remove machine-specific Crux SystemConfig tests
tdpham2 Jun 18, 2026
3434915
Skip TestELBackend when ensemble_launcher is unavailable
tdpham2 Jun 18, 2026
3a541a4
Merge pull request #132 from argonne-lcf/dev-globus-hpc
tdpham2 Jun 18, 2026
b5cac8b
fix(academy): honour [academy] optional-dep contract via lazy re-exports
JinchuLi2002 Jun 18, 2026
d293cea
fix(agent): restore calculator-context wiring lost in revert f1593ab
JinchuLi2002 Jun 18, 2026
72c8f13
Merge pull request #134 from JinchuLi2002/pr120-fixes
tdpham2 Jun 19, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,8 @@ opencode.json
chemgraph_mcp_logs/
vllm/
logs/
runs/
**/*.model
error_log.txt
.env
test.csv
Expand Down
35 changes: 35 additions & 0 deletions examples/academy/example-002-mace-ensemble-screening/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# Example 002: MACE Ensemble Screening

This example demonstrates five persistent ChemGraph Academy logical agents
running under MPI:

```text
coordinator-agent
structure-agent-a
structure-agent-b
mace-agent
assessment-agent
```

The coordinator delegates 20 SMILES candidates, structure agents generate XYZ
files, the MACE agent runs an ensemble energy screen, and the assessment agent
summarizes readiness/ranking evidence.

The campaign assets are packaged under:

```text
src/chemgraph/academy/campaigns/example-002-mace-ensemble-screening/
```

Run it by campaign name:

```bash
chemgraph academy run-compute \
--system aurora \
--run-id aurora-mace-ensemble-screening-001 \
--campaign mace-ensemble-screening-20 \
--lm-user <argo-user>
```

See `notes.md` for the high-level architecture notes. The internal E2E user
guide is intentionally not stored in this public example directory.
316 changes: 316 additions & 0 deletions examples/academy/example-002-mace-ensemble-screening/e2e_guide.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,316 @@
# Example 002 E2E Guide

This guide runs the `mace-ensemble-screening-20` ChemGraph Academy campaign on
Aurora or Polaris. The campaign starts five persistent logical agents under MPI:

```text
coordinator-agent
structure-agent-a
structure-agent-b
mace-agent
assessment-agent
```

The coordinator delegates 20 SMILES candidates, structure agents generate XYZ
files, the MACE agent runs an ensemble energy screen, and the assessment agent
summarizes readiness/ranking evidence.

## About The MACE Path

This example deliberately runs MACE through the general `run_ase` tool
(`chemgraph.mcp.mcp_tools`), which executes MACE in-process inside the MCP
server. It does **not** exercise `chemgraph.mcp.mace_mcp_hpc` or the
Parsl/EnsembleLauncher/Globus Compute backends — those are being reworked in
a separate PR. Once that lands and the WorkerLost subprocess fix is folded
back in, this example can be switched back to the HPC MACE path.

In-process MACE means each per-structure energy evaluation runs synchronously
in the mace-agent's MCP server process. A 20-structure screen completes in
a few minutes on CPU.

## Configure Paths

Set these values in each terminal before copying the commands below:

```bash
export ALCF_PROJECT=<project-name>
export ALCF_USER=<shared-filesystem-user>
export ALCF_LOGIN=<ssh-login>
export ARGO_USER=<argo-user>

export LOCAL_CHEMGRAPH=<local-chemgraph-checkout>
```

For Aurora:

```bash
export ALCF_SYSTEM=aurora
export ALCF_HOST=aurora.alcf.anl.gov
export REMOTE_ROOT=/flare/$ALCF_PROJECT/$ALCF_USER
```

For Polaris:

```bash
export ALCF_SYSTEM=polaris
export ALCF_HOST=polaris.alcf.anl.gov
export REMOTE_ROOT=/eagle/$ALCF_PROJECT/$ALCF_USER
```

`ALCF_USER` is the shared-filesystem path component. It may differ from the SSH
login and from the Argo user.

## One-Time Setup

Sync ChemGraph:

```bash
cd "$LOCAL_CHEMGRAPH"

rsync -az --delete --delete-excluded \
--exclude '.git/' \
--exclude '__pycache__/' \
--exclude '.pytest_cache/' \
--exclude 'runs/' \
--exclude 'venvs/' \
--exclude '*.pyc' \
./ \
"$ALCF_LOGIN@$ALCF_HOST:$REMOTE_ROOT/ChemGraph/"
```

Install ChemGraph dependencies on the remote system:

```bash
ssh "$ALCF_LOGIN@$ALCF_HOST"
cd "$REMOTE_ROOT/ChemGraph"

# Aurora:
module load frameworks

# Polaris:
# module use /soft/modulefiles
# module load conda
# conda activate base

source "$REMOTE_ROOT/venvs/academy-swarm/bin/activate"
python -m pip install -e ".[academy]"
```

Verify the campaign is visible:

```bash
PYTHONDONTWRITEBYTECODE=1 PYTHONPATH=src \
python -m chemgraph.cli.main academy campaigns
```

Expected:

```text
mace-ensemble-screening-20
```

Verify Redis:

```bash
export PATH="$REMOTE_ROOT/tools/redis/bin:$PATH"
command -v redis-server
redis-server --version
```

If Redis is missing, build it once on a login/UAN node:

```bash
cd "$REMOTE_ROOT"
mkdir -p src tools
cd src
test -d redis || git clone --depth 1 https://github.com/redis/redis.git
cd redis
make -j4
make PREFIX="$REMOTE_ROOT/tools/redis" install
```

The `mace_mp` calculator downloads its foundation model on first use into
`~/.cache/mace`, so no manual MACE-model staging is needed for this example.
First-call download can take a minute; pre-warm it once on the compute node
to skip that wait at run time. The compute node only reaches external sites
through the ALCF outbound proxy, so set the proxy env vars first:

```bash
export http_proxy="http://proxy.alcf.anl.gov:3128"
export https_proxy="http://proxy.alcf.anl.gov:3128"
python -c "from mace.calculators import mace_mp; mace_mp(model='medium-mpa-0', device='cpu')"
```

## Start argo-shim

On the local machine:

```bash
CELS_USERNAME="$ARGO_USER" \
PYTHONPATH=<argo-shim-checkout> \
python -m argo_shim --no-auth --no-update-settings --port 18085
```

## Start Dashboard

Use a fresh run id:

```bash
cd "$LOCAL_CHEMGRAPH"

export RUN_ID="${ALCF_SYSTEM}-mace-ensemble-screening-001"

PYTHONPATH=src python -m chemgraph.cli.main academy dashboard -- \
--system "$ALCF_SYSTEM" \
--remote-host "$ALCF_LOGIN@$ALCF_HOST" \
--campaign mace-ensemble-screening-20 \
--lm-connect mac-argo-relay \
"$RUN_ID"
```

The dashboard command starts the local dashboard, an rsync mirror, an SSH
control connection, and a relay from compute nodes to local `argo-shim`.

## Start The Campaign On Compute

Run inside an interactive allocation:

```bash
cd "$REMOTE_ROOT/ChemGraph"

# Aurora:
module load frameworks

# Polaris:
# module use /soft/modulefiles
# module load conda
# conda activate base

source "$REMOTE_ROOT/venvs/academy-swarm/bin/activate"

export RUN_ID="${ALCF_SYSTEM}-mace-ensemble-screening-001"

export NUMEXPR_MAX_THREADS=256
export NUMEXPR_NUM_THREADS=64
export OMP_NUM_THREADS=1
export MKL_NUM_THREADS=1

# Aurora/Polaris compute nodes reach external sites (GitHub, S3) only
# through the ALCF outbound proxy. Without these, mace_mp(model="medium-mpa-0")
# hangs trying to fetch the foundation model on first use.
export http_proxy="http://proxy.alcf.anl.gov:3128"
export https_proxy="http://proxy.alcf.anl.gov:3128"
export no_proxy="localhost,127.0.0.1"

export PATH="$REMOTE_ROOT/bin:$REMOTE_ROOT/tools/redis/bin:$PATH"

chemgraph academy run-compute \
--system "$ALCF_SYSTEM" \
--run-id "$RUN_ID" \
--campaign mace-ensemble-screening-20 \
--lm-user "$ARGO_USER"
```

If the wrapper is installed but `chemgraph` is not on `PATH`, use:

```bash
chemgraph-academy-run \
--system "$ALCF_SYSTEM" \
--run-id "$RUN_ID" \
--campaign mace-ensemble-screening-20 \
--lm-user "$ARGO_USER"
```

## Reopen A Local Dashboard

Once the run has been synced locally:

```bash
cd "$LOCAL_CHEMGRAPH"

PYTHONPATH=src python -m chemgraph.cli.main academy dashboard -- \
--system "$ALCF_SYSTEM" \
--remote-host "$ALCF_LOGIN@$ALCF_HOST" \
--campaign mace-ensemble-screening-20 \
"$RUN_ID" \
--local
```

## Dashboard For Traditional ChemGraph Runs

The dashboard also renders single-agent ChemGraph runs that were not launched
through Academy. Pass `--trace-dir <path>` to `chemgraph run` to write the
events the dashboard needs (`events.jsonl`, `status.json`, `manifest.json`),
then point the dashboard at that directory.

On-site at ANL, the simplest path is the built-in Argo support — no shim or
relay needed (set `ARGO_USER` once per shell, or in your shell profile):

```bash
export ARGO_USER="$ARGO_USER"

chemgraph run \
-q "What is the SMILES for water" \
-m "argo:gpt-5.4" \
--trace-dir ./run-001
```

Then serve the trace directory:

```bash
chemgraph dashboard -- --run-dir ./run-001 --port 8765
# Open http://127.0.0.1:8765
```

The browser shows the same per-agent workflow inspector that Academy displays
for a logical-agent node (query → LLM call → tool calls → output), but at the
top level since the run only has one agent. Use a fresh `--trace-dir` per run
so multiple runs don't pile into one `events.jsonl`.

`--trace-dir` is currently only effective for the `single_agent` workflow.
Other workflows (`multi_agent`, `python_relp`, `graspa`, `rag_agent`,
`single_agent_xanes`, ...) run normally but don't yet emit dashboard events,
and the CLI prints a yellow warning for those.

If the browser shows "Waiting for ChemGraph workflow execution events" after a
run completed successfully, the remote checkout is missing the
`llm_decision`-on-every-LLM-call fix. Sync the latest ChemGraph and clear
stale bytecode locally:

```bash
find src/chemgraph -name __pycache__ -type d -exec rm -rf {} +
```

## Troubleshooting

Check the relay from compute:

```bash
UAN_RELAY_HOST="$(tr -d '[:space:]' < "$REMOTE_ROOT/uan-relay-18186.host")"
curl --noproxy '*' -I "http://${UAN_RELAY_HOST}:18186/v1/models"
```

Expected:

```text
HTTP/1.1 200 OK
```

If the first model response is an Argo access-denied notice for `<argo-user>`,
the compute command was launched without `--lm-user "$ARGO_USER"`. Use a fresh
run id, or restart the dashboard with `--overwrite-run`, then rerun compute
with `--lm-user`.

If imports are slow or NumExpr complains, set:

```bash
export NUMEXPR_MAX_THREADS=256
export NUMEXPR_NUM_THREADS=64
export OMP_NUM_THREADS=1
export MKL_NUM_THREADS=1
```

If MACE energy evaluations are slow, the first call per worker pays a
one-time foundation-model download into `~/.cache/mace`. Pre-warm by
running the snippet under "About The MACE Path" above on the compute node
before launching the campaign.
29 changes: 29 additions & 0 deletions examples/academy/example-002-mace-ensemble-screening/notes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Notes

This root example directory is for user-facing explanation only. The CLI loads
the actual campaign from package data so installed ChemGraph environments can
run the same campaign without relying on a source checkout's root `examples/`
directory.

Packaged assets:

```text
src/chemgraph/academy/campaigns/example-002-mace-ensemble-screening/
campaign.jsonc
lm_config.json
prompt_profiles/
data/
models/
```

The campaign declares MCP server subprocesses for general ChemGraph tools, MACE
screening, and HPC utility inspection. The Academy runtime places one logical
agent per MPI rank, launches the declared MCP servers for each agent, and uses
Academy exchange handles for peer communication.

Each agent's `allowed_tools` field acts as a per-agent whitelist drawn from
the union of the tools its `mcp_servers` advertise. In this example the
structure agents see only `molecule_name_to_smiles` + `smiles_to_coordinate_file`,
and the mace-agent sees only `run_ase` + `extract_output_json` — even though
all four come from the same `general` MCP server. Omit `allowed_tools` (or set
it to `[]`) to expose every tool the connected servers advertise.
Loading