License Notice
Copyright © 2025 Advanced Micro Devices, Inc. All rights reserved.
This project is licensed under the MIT License. You are free to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the software, subject to the following conditions:
- The above copyright notice and this license notice must be included in all copies or substantial portions of the software.
- The software is provided "as is", without warranty of any kind, express or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose, and non-infringement.
Third-party dependencies (ROCm libraries, vLLM, Triton, Magpie, etc.) are governed by their own respective licenses. End users are responsible for reviewing and complying with the licensing terms of any dependencies used in conjunction with this software.
AI Agent notice — This software orchestrates third-party AI coding agents to perform kernel optimization:
Agent Provider What you need Claude Code Anthropic An active Anthropic account and acceptance of Anthropic's usage policies OpenAI Codex OpenAI An active OpenAI account and acceptance of OpenAI's terms of service Cursor Agent Cursor An active Cursor subscription with agent mode enabled Each user must independently obtain their own credentials and comply with the respective provider's licensing and usage terms. This project does not include, bundle, or sublicense access to any AI model or API. Usage of these agents may incur costs billed directly by the provider.
An RL training environment that tasks an LLM agent with optimizing GPU kernels for AMD ROCm hardware. The agent receives a baseline kernel, a sandbox with relevant source code and documentation, and is scored on compilation, correctness, and runtime speedup.
baseline kernel → prompt constructor → LLM agent → grader (Magpie) → score + reinjection
- Benchmark — profile the model end-to-end to identify bottleneck kernels
- Identify — rank kernels by GPU time and select candidates
- Optimize — an LLM agent writes an optimized kernel in
output/<task_id>/solution.{py,hip} - Grade — Magpie checks compilation, correctness, and measures speedup
- Integrate — kernels exceeding the speedup threshold (>1.05×) are hot-patched into site-packages
- Benchmark (final) — re-run E2E benchmark with patches to measure real throughput improvement
- Score & Report — compute rewards, update leaderboard, generate report
- OS: Linux (Ubuntu 22.04+ recommended)
- Python: 3.10+
- Node.js: 18+ (for agent CLIs)
- System packages:
git,curl,jq - GPU (optional): AMD Instinct GPU with ROCm 6.x+ (required for real kernel grading; not needed for CPU-only eval)
git clone <repo-url> Apex
cd ApexInstall whichever agent(s) you plan to use:
# Claude Code
npm install -g @anthropic-ai/claude-code
claude login
# OpenAI Codex
npm install -g @openai/codex
codex login
# Cursor Agent (standalone CLI)
npm install -g cursor-agent
cursor-agent login
# Cursor IDE (alternative — open Apex folder in Cursor; MCP servers auto-configure via .mcp.json)bash setup.shThis single command handles everything:
- CLI selection — choose Claude Code, Codex, Cursor Agent, or all
- Python venv — creates
.venv/(or reuses an existing one) - Python dependencies — installs numpy, PyYAML, pytest, MCP packages, SDKs, etc.
- PyTorch for ROCm — installs
torch+torchvisionfrom the ROCm 7.2 wheel index - Triton — installs the Triton compiler
- ROCm source repos — clones AMD kernel source code into
tools/rocm/(optional, for source-finder & RAG) - Documentation — downloads AMD architecture PDFs for the RAG server (optional)
- MCP servers — installs and registers 5 MCP servers with the selected CLI(s)
- Magpie — clones and installs the kernel evaluation framework into
tools/magpie/ - Skills — makes 13 domain-specific optimization skills discoverable by agents
.mcp.json— generates workspace config so Cursor IDE auto-discovers MCP servers
bash setup.sh # Interactive (prompts for choices)
bash setup.sh --non-interactive # Auto-detect CLIs, accept all defaults
bash setup.sh --skip-downloads # Skip ROCm repo cloning + doc downloads
bash setup.sh --skip-tools # Skip MCP + Magpie installation
bash setup.sh --venv=/path/.venv # Use a specific venv pathsource .venv/bin/activate
export MAGPIE_ROOT=$(pwd)/tools/magpie
# Interactive agent session
claude # or: codex / cursor-agent
# Automated pipeline
python3 workload_optimizer.py run \
-r ./results \
-b $MAGPIE_ROOT/examples/benchmarks/benchmark_vllm_gptoss_120b.yaml \
--kernel-types triton --top-k 10 \
--max-iterations 3 --max-turns 25 --leaderboardApex/
├── workload_optimizer.py # Main pipeline CLI
├── eval.py # Mini eval (CPU-only, no GPU required)
├── setup.sh # One-shot environment setup
├── mcp_config.json # MCP server configuration
├── .mcp.json # Auto-generated by setup.sh (MCP config for Cursor IDE)
│
├── agents/
│ └── backends.py # Claude Code SDK + Codex + Cursor Agent runner
│
├── pipeline/
│ ├── knowledge_base.py # Cross-kernel/cross-run learning store
│ ├── reflector.py # Agent self-reflection between iterations
│ ├── trajectory.py # Trajectory recording (file / CouchDB / S3)
│ ├── leaderboard.py # Leaderboard tracking (file / CouchDB)
│ ├── kernel_bottleneck.py # Profiling data parser, kernel classification
│ └── export_rl_dataset.py # RL/SFT dataset export from trajectories
│
├── prompts/
│ ├── models.py # Model registry (Qwen3.5, GPT-OSS, etc.)
│ ├── configs.py # 17 inference configurations
│ ├── kernel_prompt.py # Kernel-level prompt constructor
│ └── model_prompt.py # Model-level prompt constructor
│
├── graders/
│ ├── score.py # Scoring formula + Magpie helpers
│ ├── kernel_grader.py # Grades kernel tasks via Magpie
│ ├── model_grader.py # Grades E2E model throughput via Magpie
│ ├── ground_truth.py # ROCm kernel discovery + ground truth specs
│ ├── config_generator.py # Magpie config.yaml generation + validation
│ └── cache_manager.py # Cache isolation for reproducible grading
│
├── tools/
│ ├── setup_tools.sh # Installs Magpie, MCP servers, skills
│ ├── skills/ # 13 domain skills (SKILL.md files)
│ ├── mcps/ # MCP server source
│ └── jsons/ # ROCm metadata indexes
│
├── files/
│ ├── setup_files.sh # Clones ROCm repos and downloads docs
│ ├── hip_best_practices.md
│ └── triton_best_practices.md
│
├── tests/ # pytest suite
│
└── output/ # Agent solutions (git-ignored)
└── <task_id>/
├── solution.py / solution.hip
├── config.yaml
└── …
Run the entire optimization loop end-to-end:
source .venv/bin/activate
export MAGPIE_ROOT=$(pwd)/tools/magpie
RESULTS=./results
BENCH_CONFIG=$MAGPIE_ROOT/examples/benchmarks/benchmark_vllm_gptoss_120b.yaml
python3 workload_optimizer.py run \
-r $RESULTS \
-b $BENCH_CONFIG \
--kernel-types triton \
--top-k 10 \
--max-iterations 3 \
--max-turns 25 \
--leaderboard# 1. Benchmark (or --skip-benchmark <path-to-existing-report.json>)
python3 workload_optimizer.py benchmark -r $RESULTS -b $BENCH_CONFIG
# 2. Identify top bottleneck kernels
python3 workload_optimizer.py identify -r $RESULTS --kernel-types triton --top-k 10
# 3. List identified kernels
python3 workload_optimizer.py list-kernels -r $RESULTS
# 4. Optimize all identified kernels
python3 workload_optimizer.py optimize -r $RESULTS --max-iterations 3 --max-turns 25
# 5. Integrate winners (auto-filters to >5% speedup)
python3 workload_optimizer.py integrate -r $RESULTS
# 6. Final E2E benchmark with optimized kernels
python3 workload_optimizer.py benchmark-final -r $RESULTS -b $BENCH_CONFIG
# 7. Score + trajectory + leaderboard
python3 workload_optimizer.py score -r $RESULTS --leaderboard
# 8. Generate report + replication guide
python3 workload_optimizer.py report -r $RESULTS -b $BENCH_CONFIGOptimize a single kernel without running the full pipeline:
python3 workload_optimizer.py optimize-kernel \
-r ./results \
--kernel path/to/baseline_kernel.py \
--kernel-name rms_norm \
--kernel-type triton \
--agent-backend cursor \
--max-iterations 3 --max-turns 25Correctness modes for standalone optimization:
# PyTorch reference (default) — validates against a PyTorch implementation
--correctness-mode pytorch
# Library test — runs the original library's unit test suite
--correctness-mode library_test
# Accordo — HSA-level validation for HIP/C++ kernels
--correctness-mode accordoLaunch the agent directly for exploratory optimization:
# Claude Code
cd Apex && claude
# OpenAI Codex
cd Apex && codex
# Cursor Agent (standalone CLI)
cd Apex && cursor-agent
# Cursor IDE (open Apex folder — MCP servers auto-configure via .mcp.json)
cursor .These prompts are tested and work end-to-end. Open the Claude Code CLI from the Apex directory and paste a prompt:
Run the full optimization pipeline for Qwen3.5 27B with these settings:
triton kernels only, top 3 bottleneck kernels, 3 optimization iterations,
max 25 agent turns per iteration, claude agent backend, and leaderboard enabled.
Set HF_HOME=/mnt/dcgpuval/sirafati/hf before running.
Show the final score comparison and generate a report when done.
Run the full optimization pipeline for GPT OSS 20B with these settings:
triton kernels only, top 3 bottleneck kernels, 3 optimization iterations,
max 25 agent turns per iteration, claude agent backend, and leaderboard enabled.
Show the final score comparison and generate a report when done.
Optimize the rms_norm Triton kernel on MI355X.
Write the solution to output/ and grade it when done. Show the score breakdown.
The agent reads CLAUDE.md / AGENTS.md, discovers MCP tools, and translates the prompt into the correct workload_optimizer.py commands.
How to use:
# Open Claude Code from the Apex directory, then paste any prompt above:
cd Apex && claudeEquivalent direct commands (what the agent executes under the hood):
# Full pipeline for Qwen3.5 27B
python3 workload_optimizer.py run \
-r ./results_qwen35_27b \
-b $MAGPIE_ROOT/examples/benchmarks/benchmark_vllm_qwen35_27b.yaml \
--kernel-types triton --top-k 3 \
--max-iterations 3 --max-turns 25 \
--agent-backend claude --leaderboard
# Full pipeline for GPT-OSS-20B
python3 workload_optimizer.py run \
-r ./results_gptoss_20b \
-b $MAGPIE_ROOT/examples/benchmarks/benchmark_vllm_gptoss_20b.yaml \
--kernel-types triton --top-k 3 \
--max-iterations 3 --max-turns 25 \
--agent-backend claude --leaderboard
# Standalone kernel optimization
python3 workload_optimizer.py optimize-kernel \
-r ./results \
--kernel tools/rocm/aiter/aiter/ops/triton/normalization/rmsnorm.py \
--kernel-name rms_norm --kernel-type triton \
--correctness-mode pytorch \
--agent-backend claude \
--max-iterations 1 --max-turns 10Exercises the full pipeline on a CPU-only task (naive Python RMSNorm → NumPy):
pip install -r requirements-eval.txt
python3 eval.py # Uses Claude API
python3 eval.py --dry-run # Skip API call, grade a trivial solution
python3 eval.py --model claude-opus-4-6 --max-turns 12python3 prompts/kernel_prompt.py --list # List all kernel task IDs
python3 prompts/kernel_prompt.py --task-id llama-3-1-8b-instruct__rms_norm # Print a single prompt
python3 prompts/kernel_prompt.py --all > prompts.jsonl # Dump all as JSONL
python3 prompts/kernel_prompt.py --target gfx942 --list # Target a specific GPUpython3 graders/kernel_grader.py # Grade all kernel tasks in output/
python3 graders/model_grader.py # Grade model-level tasks# Export trajectory data as RL training tasks
python3 workload_optimizer.py export-rl -r ./results --export-output-dir ./datasets
# Include SFT warm-start pairs
python3 workload_optimizer.py export-rl -r ./results --export-output-dir ./datasets --sftpytest tests/ -v
pytest tests/test_prompts.py -v # Prompt tests only
pytest tests/test_graders.py -v # Grader tests only--agent-backend claude # Use Claude Code (default)
--agent-backend codex # Use OpenAI Codex
--agent-backend cursor # Use Cursor AgentThe pipeline runs E2E benchmarks inside Docker containers. Override the vLLM image:
--docker-image vllm/vllm-openai-rocm:v0.19.0Or set the environment variable:
export APEX_VLLM_ROCM_IMAGE=vllm/vllm-openai-rocm:v0.19.0Cache E2E baseline results to skip the ~30-minute benchmark on repeat runs:
--benchmark-cache-hours 4When using the run subcommand (full pipeline), optimize up to N kernels simultaneously
(agent reasoning is API-bound; GPU grading is serialized):
python3 workload_optimizer.py run ... --parallel-kernels 2Note: the standalone optimize subcommand processes kernels sequentially regardless of this flag.
Assign different models based on kernel difficulty:
--agent-model-simple claude-sonnet-4-20250514 \
--agent-model-complex claude-opus-4-6AST-based detection penalizes solutions that fake benchmark results (sys.exit(), hardcoded PASS, fabricated timings). Configure the speedup cap:
--tampering-speedup-cap 1.0score = compiled × 20 + correct × 100 + speedup_score(S)
Where S = baseline_time / optimized_time. Only compiled + correct solutions earn the speedup component.
- Compiled (+20 pts): solution imports and defines the expected function
- Correct (+100 pts): passes all unit tests against the baseline
- Speedup (piecewise):
- S ≥ 1.0:
100 + (S − 1) × 200pts (e.g. 1.2× → 140, 2× → 300, 3× → 500) - S < 1.0:
max(0, 100 × S − 50)pts (regression penalty)
- S ≥ 1.0:
reward = 0.7 × normalized_kernel_score + 0.3 × (optimized_tps / baseline_tps − 1)
Kernel score is normalized to [0, 1] against a reference of 420 pts (compile + correct + 2× speedup). Model-level grading requires a full AMD GPU environment.
Default target: AMD Instinct MI355X (gfx950 / CDNA4).
--target |
Hardware |
|---|---|
gfx950 |
AMD Instinct MI355X (CDNA4) — default |
gfx942 |
AMD Instinct MI300X (CDNA3) |
gfx940 |
AMD Instinct MI300A (CDNA3) |
gfx90a |
AMD Instinct MI250X (CDNA2) |
The GPU is auto-detected via rocm-smi if available; otherwise falls back to gfx950.
When the pipeline integrates optimized kernels for the final E2E benchmark, it hot-patches installed Python modules in site-packages. All patches are restored after benchmarking.
Supported (hot-patching):
- aiter, vllm, sglang — Python/Triton
.pykernels can be replaced in site-packages. Triton JIT re-compiles automatically on next invocation. - aiter HIP — Standalone
.sofiles can be recompiled withhipccand swapped.
Not supported (requires source rebuild):
- System C/C++ libraries — hipBLASLt, rocBLAS, composable_kernel (CK), MIOpen, rccl are system-level shared libraries that cannot be individually hot-patched.
- Monolithic
_C.so— vLLM, sglang, and PyTorch HIP kernels compile into a single binary and cannot be individually replaced.
Apex ships with 5 MCP servers that give agents access to domain-specific tools:
| MCP | Key tools | Purpose |
|---|---|---|
| source-finder | find_kernel_source, classify_kernel |
Search kernel implementations across ROCm repos |
| kernel-rag | search_kernel_optimization, get_optimization_playbook |
Optimization patterns, snippets, domain analysis |
| gpu-info | get_gpu_info, get_arch_optimization_hints |
MI355X / CDNA4 specs and optimization hints |
| fusion-advisor | detect_fusion_opportunities, generate_fused_kernel |
Kernel fusion detection and code generation |
| magpie | analyze, compare, benchmark |
Kernel correctness/performance evaluation |
MCP servers are auto-configured for:
- Claude Code — registered via
claude mcp addduring setup - Codex — registered via
codex mcp addduring setup - Cursor IDE — auto-discovered from
.mcp.jsonin the project root (no manual registration needed)
Apex uses a two-layer prompt system when agents optimize kernels:
System prompt (SYSTEM_PROMPT in workload_optimizer.py) defines the agent's role and constraints:
- GPU kernel engineer persona with AMD ROCm specialization
- MCP tool inventory and when to use each tool
- Skill paths (13
SKILL.mdfiles the agent can read for domain knowledge) - Mandatory compare-before-submit workflow via Magpie
- Speedup calibration guidance and anti-tampering rules
Task prompt (KERNEL_PROMPT_TEMPLATE in prompts/kernel_prompt.py) is built per-kernel with:
| Section | Content |
|---|---|
| Target hardware | GPU arch, wavefront size, MFMA units, LDS, HBM bandwidth |
| Task definition | Kernel type, model architecture, framework (vLLM/SGLang) |
| Source locations | Paths to baseline implementations in tools/rocm/ (aiter, CK, etc.) |
| MCP tools table | Available tools for source search, RAG, GPU info, fusion, Magpie |
| Skills table | 13 domain-specific optimization skills the agent can read |
| Instructions | Step-by-step: locate baseline, analyze bottlenecks, write solution.py |
| Optimization hints | Architecture-specific tips (e.g. MFMA usage, LDS tiling for CDNA4) |
When running via the full pipeline, the prompt is further enriched with:
- Baseline source code inlined as markdown
- Profiling data (GPU time %, bound type, bandwidth/compute utilization)
- Knowledge base insights from prior optimization runs
- Correctness reference (PyTorch reference code or library test commands)
Preview any task prompt:
python3 prompts/kernel_prompt.py --task-id llama-3-1-8b-instruct__rms_normThese kernels have explicit library test commands in the ground truth registry, enabling --correctness-mode library_test for validation against aiter's own pytest suite:
| Kernel | Test command | Type |
|---|---|---|
silu_mul |
pytest aiter/op_tests/triton_tests/test_activation.py |
memory-bound |
gemm_bf16 |
pytest aiter/op_tests/triton_tests/gemm/basic/test_gemm_a16w16.py |
compute-bound |
gemm_w8a8 |
pytest aiter/op_tests/triton_tests/gemm/basic/test_gemm_a8w8.py |
compute-bound |
act_quant_fp8 |
pytest aiter/op_tests/triton_tests/quant/test_quant.py |
memory-bound |
kv_cache_ops |
pytest aiter/op_tests/triton_tests/fusions/test_fused_kv_cache.py |
memory-bound |
all_reduce |
pytest aiter/op_tests/multigpu_tests/test_quick_all_reduce.py |
comms |
21 models covering a range of architectures:
| Family | Models | Attention | MLP |
|---|---|---|---|
| Llama 3 | 1B, 8B, 70B (×2) | GQA | Dense |
| Mistral / Mixtral | 7B, 8×7B, 8×22B | GQA | Dense / MoE |
| Qwen 2.5 | 7B, 32B, 72B, Coder-32B | GQA | Dense |
| Gemma 2 | 9B, 27B | GQA | Dense |
| DeepSeek | R1 (671B), V3 (671B), R1-Distill-70B | MLA / GQA | MoE / Dense |
| Kimi | K2-Thinking | MLA | MoE |
| GPT OSS | 120B | GQA | MoE |
| Phi | 3.5-mini, phi-4 | GQA | Dense |
| Falcon | 7B | MQA | Dense |
12 kernel types are defined, applicable to models based on their architecture:
| Kernel | Framework | Notes |
|---|---|---|
flash_attn_prefill |
Triton | Flash attention for prompt (prefill) phase |
paged_attn_decode |
Triton | Paged attention for autoregressive decoding |
mla_attn |
Triton | Multi-Head Latent Attention (DeepSeek MLA) |
fused_moe |
Triton | Fused MoE gate + routing + expert GEMM |
gemm_w8a8 |
HIP | FP8 × FP8 GEMM for quantized linear layers |
gemm_bf16 |
HIP | BF16 GEMM for QKV/up/gate/down projections |
rms_norm |
Triton | Pre/post-attention and MLP normalization |
rope_embedding |
Triton | Rotary position embedding (Q and K) |
kv_cache_ops |
Triton | KV cache reshape, copy, and quantization |
all_reduce |
HIP | Tensor-parallel all-reduce (RCCL + fused kernels) |
act_quant_fp8 |
Triton | Dynamic per-token FP8 activation quantization |
silu_mul |
Triton | Fused SiLU × gate (SwiGLU) for MLP |
| Kernel | Speedup | Score | Settings | Notes |
|---|---|---|---|---|
all_reduce |
36.35x | 7290 | 3 iter / 25 turns | HIP, library_test; multi-GPU |
rms_norm |
1.05x | 674 | 1 iter / 10 turns | Triton, pytorch mode |
fused_moe |
1.14x | 248 | 1 iter / 10 turns | Triton, pytorch mode |
gemm_bf16 |
1.00x | 220 | 3 iter / 25 turns | Triton, library_test |
silu_mul |
1.00x | 220 | 1 iter / 10 turns | Triton, library_test |
act_quant_fp8 |
1.00x | 220 | 1 iter / 10 turns | Triton, library_test |
kv_cache_ops |
1.00x | 220 | 1 iter / 10 turns | Triton, library_test |
7 kernels validated with correct optimizations. Top performers: all_reduce (36.35x), fused_moe (1.14x), rms_norm (1.05x).
# Install manually with the correct ROCm version
pip3 install torch torchvision --index-url https://download.pytorch.org/whl/rocm7.2
# For older ROCm:
pip3 install torch torchvision --index-url https://download.pytorch.org/whl/rocm6.2# Verify Node.js 18+
node --version
# Claude Code
npm install -g @anthropic-ai/claude-code && claude login
# Codex
npm install -g @openai/codex && codex login
# Cursor Agent
npm install -g cursor-agent && cursor-agent loginCursor auto-discovers MCP servers from .mcp.json. If MCPs aren't loading:
- Verify
.mcp.jsonexists in the Apex root (generated bysetup.sh) - Restart Cursor or reload the window
- Check that the Python path in
.mcp.jsonmatches your venv
The pipeline auto-detects GPUs via rocm-smi. For CPU-only development:
# Use the mini eval (no GPU needed)
python3 eval.py --dry-run
# Use --skip-benchmark with pre-recorded profiling data
python3 workload_optimizer.py run -r ./results --skip-benchmark report.json ...