Apex — GPU Kernel Optimization Pipeline

License Notice

Copyright © 2025 Advanced Micro Devices, Inc. All rights reserved.

This project is licensed under the MIT License. You are free to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the software, subject to the following conditions:

The above copyright notice and this license notice must be included in all copies or substantial portions of the software.

The software is provided "as is", without warranty of any kind, express or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose, and non-infringement.

Third-party dependencies (ROCm libraries, vLLM, Triton, Magpie, etc.) are governed by their own respective licenses. End users are responsible for reviewing and complying with the licensing terms of any dependencies used in conjunction with this software.

AI Agent notice — This software orchestrates third-party AI coding agents to perform kernel optimization:

Agent Provider What you need

Claude Code Anthropic An active Anthropic account and acceptance of Anthropic's usage policies

OpenAI Codex OpenAI An active OpenAI account and acceptance of OpenAI's terms of service

Cursor Agent Cursor An active Cursor subscription with agent mode enabled

Each user must independently obtain their own credentials and comply with the respective provider's licensing and usage terms. This project does not include, bundle, or sublicense access to any AI model or API. Usage of these agents may incur costs billed directly by the provider.

An RL training environment that tasks an LLM agent with optimizing GPU kernels for AMD ROCm hardware. The agent receives a baseline kernel, a sandbox with relevant source code and documentation, and is scored on compilation, correctness, and runtime speedup.

How It Works

baseline kernel  →  prompt constructor  →  LLM agent  →  grader (Magpie)  →  score + reinjection

Benchmark — profile the model end-to-end to identify bottleneck kernels
Identify — rank kernels by GPU time and select candidates
Optimize — an LLM agent writes an optimized kernel in output/<task_id>/solution.{py,hip}
Grade — Magpie checks compilation, correctness, and measures speedup
Integrate — kernels exceeding the speedup threshold (>1.05×) are hot-patched into site-packages
Benchmark (final) — re-run E2E benchmark with patches to measure real throughput improvement
Score & Report — compute rewards, update leaderboard, generate report

Quick Start

Prerequisites

OS: Linux (Ubuntu 22.04+ recommended)
Python: 3.10+
Node.js: 18+ (for agent CLIs)
System packages: git, curl, jq
GPU (optional): AMD Instinct GPU with ROCm 6.x+ (required for real kernel grading; not needed for CPU-only eval)

1. Clone the Repository

git clone <repo-url> Apex
cd Apex

2. Install at Least One Agent CLI

Install whichever agent(s) you plan to use:

# Claude Code
npm install -g @anthropic-ai/claude-code
claude login

# OpenAI Codex
npm install -g @openai/codex
codex login

# Cursor Agent (standalone CLI)
npm install -g cursor-agent
cursor-agent login

# Cursor IDE (alternative — open Apex folder in Cursor; MCP servers auto-configure via .mcp.json)

3. Run Setup

bash setup.sh

This single command handles everything:

CLI selection — choose Claude Code, Codex, Cursor Agent, or all
Python venv — creates .venv/ (or reuses an existing one)
Python dependencies — installs numpy, PyYAML, pytest, MCP packages, SDKs, etc.
PyTorch for ROCm — installs torch + torchvision from the ROCm 7.2 wheel index
Triton — installs the Triton compiler
ROCm source repos — clones AMD kernel source code into tools/rocm/ (optional, for source-finder & RAG)
Documentation — downloads AMD architecture PDFs for the RAG server (optional)
MCP servers — installs and registers 5 MCP servers with the selected CLI(s)
Magpie — clones and installs the kernel evaluation framework into tools/magpie/
Skills — makes 13 domain-specific optimization skills discoverable by agents
.mcp.json — generates workspace config so Cursor IDE auto-discovers MCP servers

Setup Flags

bash setup.sh                     # Interactive (prompts for choices)
bash setup.sh --non-interactive   # Auto-detect CLIs, accept all defaults
bash setup.sh --skip-downloads    # Skip ROCm repo cloning + doc downloads
bash setup.sh --skip-tools        # Skip MCP + Magpie installation
bash setup.sh --venv=/path/.venv  # Use a specific venv path

4. Activate and Run

source .venv/bin/activate
export MAGPIE_ROOT=$(pwd)/tools/magpie

# Interactive agent session
claude                # or: codex / cursor-agent

# Automated pipeline
python3 workload_optimizer.py run \
  -r ./results \
  -b $MAGPIE_ROOT/examples/benchmarks/benchmark_vllm_gptoss_120b.yaml \
  --kernel-types triton --top-k 10 \
  --max-iterations 3 --max-turns 25 --leaderboard

Repository Structure

Apex/
├── workload_optimizer.py    # Main pipeline CLI
├── eval.py                  # Mini eval (CPU-only, no GPU required)
├── setup.sh                 # One-shot environment setup
├── mcp_config.json          # MCP server configuration
├── .mcp.json                # Auto-generated by setup.sh (MCP config for Cursor IDE)
│
├── agents/
│   └── backends.py          # Claude Code SDK + Codex + Cursor Agent runner
│
├── pipeline/
│   ├── knowledge_base.py    # Cross-kernel/cross-run learning store
│   ├── reflector.py         # Agent self-reflection between iterations
│   ├── trajectory.py        # Trajectory recording (file / CouchDB / S3)
│   ├── leaderboard.py       # Leaderboard tracking (file / CouchDB)
│   ├── kernel_bottleneck.py # Profiling data parser, kernel classification
│   └── export_rl_dataset.py # RL/SFT dataset export from trajectories
│
├── prompts/
│   ├── models.py            # Model registry (Qwen3.5, GPT-OSS, etc.)
│   ├── configs.py           # 17 inference configurations
│   ├── kernel_prompt.py     # Kernel-level prompt constructor
│   └── model_prompt.py      # Model-level prompt constructor
│
├── graders/
│   ├── score.py             # Scoring formula + Magpie helpers
│   ├── kernel_grader.py     # Grades kernel tasks via Magpie
│   ├── model_grader.py      # Grades E2E model throughput via Magpie
│   ├── ground_truth.py      # ROCm kernel discovery + ground truth specs
│   ├── config_generator.py  # Magpie config.yaml generation + validation
│   └── cache_manager.py     # Cache isolation for reproducible grading
│
├── tools/
│   ├── setup_tools.sh       # Installs Magpie, MCP servers, skills
│   ├── skills/              # 13 domain skills (SKILL.md files)
│   ├── mcps/                # MCP server source
│   └── jsons/               # ROCm metadata indexes
│
├── files/
│   ├── setup_files.sh       # Clones ROCm repos and downloads docs
│   ├── hip_best_practices.md
│   └── triton_best_practices.md
│
├── tests/                   # pytest suite
│
└── output/                  # Agent solutions (git-ignored)
    └── <task_id>/
        ├── solution.py / solution.hip
        ├── config.yaml
        └── …

Usage

Full Pipeline (Automated)

Run the entire optimization loop end-to-end:

source .venv/bin/activate
export MAGPIE_ROOT=$(pwd)/tools/magpie

RESULTS=./results
BENCH_CONFIG=$MAGPIE_ROOT/examples/benchmarks/benchmark_vllm_gptoss_120b.yaml

python3 workload_optimizer.py run \
  -r $RESULTS \
  -b $BENCH_CONFIG \
  --kernel-types triton \
  --top-k 10 \
  --max-iterations 3 \
  --max-turns 25 \
  --leaderboard

Step-by-Step Pipeline

# 1. Benchmark (or --skip-benchmark <path-to-existing-report.json>)
python3 workload_optimizer.py benchmark -r $RESULTS -b $BENCH_CONFIG

# 2. Identify top bottleneck kernels
python3 workload_optimizer.py identify -r $RESULTS --kernel-types triton --top-k 10

# 3. List identified kernels
python3 workload_optimizer.py list-kernels -r $RESULTS

# 4. Optimize all identified kernels
python3 workload_optimizer.py optimize -r $RESULTS --max-iterations 3 --max-turns 25

# 5. Integrate winners (auto-filters to >5% speedup)
python3 workload_optimizer.py integrate -r $RESULTS

# 6. Final E2E benchmark with optimized kernels
python3 workload_optimizer.py benchmark-final -r $RESULTS -b $BENCH_CONFIG

# 7. Score + trajectory + leaderboard
python3 workload_optimizer.py score -r $RESULTS --leaderboard

# 8. Generate report + replication guide
python3 workload_optimizer.py report -r $RESULTS -b $BENCH_CONFIG

Standalone Kernel Optimization

Optimize a single kernel without running the full pipeline:

python3 workload_optimizer.py optimize-kernel \
  -r ./results \
  --kernel path/to/baseline_kernel.py \
  --kernel-name rms_norm \
  --kernel-type triton \
  --agent-backend cursor \
  --max-iterations 3 --max-turns 25

Correctness modes for standalone optimization:

# PyTorch reference (default) — validates against a PyTorch implementation
--correctness-mode pytorch

# Library test — runs the original library's unit test suite
--correctness-mode library_test

# Accordo — HSA-level validation for HIP/C++ kernels
--correctness-mode accordo

Interactive Agent Mode

Launch the agent directly for exploratory optimization:

# Claude Code
cd Apex && claude

# OpenAI Codex
cd Apex && codex

# Cursor Agent (standalone CLI)
cd Apex && cursor-agent

# Cursor IDE (open Apex folder — MCP servers auto-configure via .mcp.json)
cursor .

Agent-Driven Kernel Optimization Examples

These prompts are tested and work end-to-end. Open the Claude Code CLI from the Apex directory and paste a prompt:

Run the full optimization pipeline for Qwen3.5 27B with these settings:
triton kernels only, top 3 bottleneck kernels, 3 optimization iterations,
max 25 agent turns per iteration, claude agent backend, and leaderboard enabled.
Set HF_HOME=/mnt/dcgpuval/sirafati/hf before running.
Show the final score comparison and generate a report when done.

Run the full optimization pipeline for GPT OSS 20B with these settings:
triton kernels only, top 3 bottleneck kernels, 3 optimization iterations,
max 25 agent turns per iteration, claude agent backend, and leaderboard enabled.
Show the final score comparison and generate a report when done.

Optimize the rms_norm Triton kernel on MI355X.
Write the solution to output/ and grade it when done. Show the score breakdown.

The agent reads CLAUDE.md / AGENTS.md, discovers MCP tools, and translates the prompt into the correct workload_optimizer.py commands.

How to use:

# Open Claude Code from the Apex directory, then paste any prompt above:
cd Apex && claude

Equivalent direct commands (what the agent executes under the hood):

# Full pipeline for Qwen3.5 27B
python3 workload_optimizer.py run \
  -r ./results_qwen35_27b \
  -b $MAGPIE_ROOT/examples/benchmarks/benchmark_vllm_qwen35_27b.yaml \
  --kernel-types triton --top-k 3 \
  --max-iterations 3 --max-turns 25 \
  --agent-backend claude --leaderboard

# Full pipeline for GPT-OSS-20B
python3 workload_optimizer.py run \
  -r ./results_gptoss_20b \
  -b $MAGPIE_ROOT/examples/benchmarks/benchmark_vllm_gptoss_20b.yaml \
  --kernel-types triton --top-k 3 \
  --max-iterations 3 --max-turns 25 \
  --agent-backend claude --leaderboard

# Standalone kernel optimization
python3 workload_optimizer.py optimize-kernel \
  -r ./results \
  --kernel tools/rocm/aiter/aiter/ops/triton/normalization/rmsnorm.py \
  --kernel-name rms_norm --kernel-type triton \
  --correctness-mode pytorch \
  --agent-backend claude \
  --max-iterations 1 --max-turns 10

Mini Eval (No GPU Required)

Exercises the full pipeline on a CPU-only task (naive Python RMSNorm → NumPy):

pip install -r requirements-eval.txt

python3 eval.py              # Uses Claude API
python3 eval.py --dry-run    # Skip API call, grade a trivial solution
python3 eval.py --model claude-opus-4-6 --max-turns 12

Explore Prompts

python3 prompts/kernel_prompt.py --list                          # List all kernel task IDs
python3 prompts/kernel_prompt.py --task-id llama-3-1-8b-instruct__rms_norm  # Print a single prompt
python3 prompts/kernel_prompt.py --all > prompts.jsonl             # Dump all as JSONL
python3 prompts/kernel_prompt.py --target gfx942 --list            # Target a specific GPU

Grade Output Tasks

python3 graders/kernel_grader.py    # Grade all kernel tasks in output/
python3 graders/model_grader.py     # Grade model-level tasks

Export RL/SFT Datasets

# Export trajectory data as RL training tasks
python3 workload_optimizer.py export-rl -r ./results --export-output-dir ./datasets

# Include SFT warm-start pairs
python3 workload_optimizer.py export-rl -r ./results --export-output-dir ./datasets --sft

Run Tests

pytest tests/ -v
pytest tests/test_prompts.py -v     # Prompt tests only
pytest tests/test_graders.py -v     # Grader tests only

Pipeline Options

Agent Backend

--agent-backend claude         # Use Claude Code (default)
--agent-backend codex          # Use OpenAI Codex
--agent-backend cursor         # Use Cursor Agent

Docker Image Override

The pipeline runs E2E benchmarks inside Docker containers. Override the vLLM image:

--docker-image vllm/vllm-openai-rocm:v0.19.0

Or set the environment variable:

export APEX_VLLM_ROCM_IMAGE=vllm/vllm-openai-rocm:v0.19.0

Benchmark Caching

Cache E2E baseline results to skip the ~30-minute benchmark on repeat runs:

--benchmark-cache-hours 4

Parallel Kernel Optimization

When using the run subcommand (full pipeline), optimize up to N kernels simultaneously (agent reasoning is API-bound; GPU grading is serialized):

python3 workload_optimizer.py run ... --parallel-kernels 2

Note: the standalone optimize subcommand processes kernels sequentially regardless of this flag.

Agent Model Routing

Assign different models based on kernel difficulty:

--agent-model-simple claude-sonnet-4-20250514 \
--agent-model-complex claude-opus-4-6

Anti-Tampering

AST-based detection penalizes solutions that fake benchmark results (sys.exit(), hardcoded PASS, fabricated timings). Configure the speedup cap:

--tampering-speedup-cap 1.0

Scoring

score = compiled × 20  +  correct × 100  +  speedup_score(S)

Where S = baseline_time / optimized_time. Only compiled + correct solutions earn the speedup component.

Compiled (+20 pts): solution imports and defines the expected function
Correct (+100 pts): passes all unit tests against the baseline
Speedup (piecewise):
- S ≥ 1.0: 100 + (S − 1) × 200 pts (e.g. 1.2× → 140, 2× → 300, 3× → 500)
- S < 1.0: max(0, 100 × S − 50) pts (regression penalty)

Model-Level Reward

reward = 0.7 × normalized_kernel_score  +  0.3 × (optimized_tps / baseline_tps − 1)

Kernel score is normalized to [0, 1] against a reference of 420 pts (compile + correct + 2× speedup). Model-level grading requires a full AMD GPU environment.

Target Hardware

Default target: AMD Instinct MI355X (gfx950 / CDNA4).

`--target`	Hardware
`gfx950`	AMD Instinct MI355X (CDNA4) — default
`gfx942`	AMD Instinct MI300X (CDNA3)
`gfx940`	AMD Instinct MI300A (CDNA3)
`gfx90a`	AMD Instinct MI250X (CDNA2)

The GPU is auto-detected via rocm-smi if available; otherwise falls back to gfx950.

Kernel Reintegration (Hot-Patching)

When the pipeline integrates optimized kernels for the final E2E benchmark, it hot-patches installed Python modules in site-packages. All patches are restored after benchmarking.

Supported (hot-patching):

aiter, vllm, sglang — Python/Triton .py kernels can be replaced in site-packages. Triton JIT re-compiles automatically on next invocation.
aiter HIP — Standalone .so files can be recompiled with hipcc and swapped.

Not supported (requires source rebuild):

System C/C++ libraries — hipBLASLt, rocBLAS, composable_kernel (CK), MIOpen, rccl are system-level shared libraries that cannot be individually hot-patched.
Monolithic _C.so — vLLM, sglang, and PyTorch HIP kernels compile into a single binary and cannot be individually replaced.

MCP Servers

Apex ships with 5 MCP servers that give agents access to domain-specific tools:

MCP	Key tools	Purpose
source-finder	`find_kernel_source`, `classify_kernel`	Search kernel implementations across ROCm repos
kernel-rag	`search_kernel_optimization`, `get_optimization_playbook`	Optimization patterns, snippets, domain analysis
gpu-info	`get_gpu_info`, `get_arch_optimization_hints`	MI355X / CDNA4 specs and optimization hints
fusion-advisor	`detect_fusion_opportunities`, `generate_fused_kernel`	Kernel fusion detection and code generation
magpie	`analyze`, `compare`, `benchmark`	Kernel correctness/performance evaluation

MCP servers are auto-configured for:

Claude Code — registered via claude mcp add during setup
Codex — registered via codex mcp add during setup
Cursor IDE — auto-discovered from .mcp.json in the project root (no manual registration needed)

Prompt Architecture

Apex uses a two-layer prompt system when agents optimize kernels:

System prompt (SYSTEM_PROMPT in workload_optimizer.py) defines the agent's role and constraints:

GPU kernel engineer persona with AMD ROCm specialization
MCP tool inventory and when to use each tool
Skill paths (13 SKILL.md files the agent can read for domain knowledge)
Mandatory compare-before-submit workflow via Magpie
Speedup calibration guidance and anti-tampering rules

Task prompt (KERNEL_PROMPT_TEMPLATE in prompts/kernel_prompt.py) is built per-kernel with:

Section	Content
Target hardware	GPU arch, wavefront size, MFMA units, LDS, HBM bandwidth
Task definition	Kernel type, model architecture, framework (vLLM/SGLang)
Source locations	Paths to baseline implementations in `tools/rocm/` (aiter, CK, etc.)
MCP tools table	Available tools for source search, RAG, GPU info, fusion, Magpie
Skills table	13 domain-specific optimization skills the agent can read
Instructions	Step-by-step: locate baseline, analyze bottlenecks, write `solution.py`
Optimization hints	Architecture-specific tips (e.g. MFMA usage, LDS tiling for CDNA4)

When running via the full pipeline, the prompt is further enriched with:

Baseline source code inlined as markdown
Profiling data (GPU time %, bound type, bandwidth/compute utilization)
Knowledge base insights from prior optimization runs
Correctness reference (PyTorch reference code or library test commands)

Preview any task prompt:

python3 prompts/kernel_prompt.py --task-id llama-3-1-8b-instruct__rms_norm

Kernels with Library Test Coverage

These kernels have explicit library test commands in the ground truth registry, enabling --correctness-mode library_test for validation against aiter's own pytest suite:

Kernel	Test command	Type
`silu_mul`	`pytest aiter/op_tests/triton_tests/test_activation.py`	memory-bound
`gemm_bf16`	`pytest aiter/op_tests/triton_tests/gemm/basic/test_gemm_a16w16.py`	compute-bound
`gemm_w8a8`	`pytest aiter/op_tests/triton_tests/gemm/basic/test_gemm_a8w8.py`	compute-bound
`act_quant_fp8`	`pytest aiter/op_tests/triton_tests/quant/test_quant.py`	memory-bound
`kv_cache_ops`	`pytest aiter/op_tests/triton_tests/fusions/test_fused_kv_cache.py`	memory-bound
`all_reduce`	`pytest aiter/op_tests/multigpu_tests/test_quick_all_reduce.py`	comms

Model Registry

21 models covering a range of architectures:

Family	Models	Attention	MLP
Llama 3	1B, 8B, 70B (×2)	GQA	Dense
Mistral / Mixtral	7B, 8×7B, 8×22B	GQA	Dense / MoE
Qwen 2.5	7B, 32B, 72B, Coder-32B	GQA	Dense
Gemma 2	9B, 27B	GQA	Dense
DeepSeek	R1 (671B), V3 (671B), R1-Distill-70B	MLA / GQA	MoE / Dense
Kimi	K2-Thinking	MLA	MoE
GPT OSS	120B	GQA	MoE
Phi	3.5-mini, phi-4	GQA	Dense
Falcon	7B	MQA	Dense

Kernel Types

12 kernel types are defined, applicable to models based on their architecture:

Kernel	Framework	Notes
`flash_attn_prefill`	Triton	Flash attention for prompt (prefill) phase
`paged_attn_decode`	Triton	Paged attention for autoregressive decoding
`mla_attn`	Triton	Multi-Head Latent Attention (DeepSeek MLA)
`fused_moe`	Triton	Fused MoE gate + routing + expert GEMM
`gemm_w8a8`	HIP	FP8 × FP8 GEMM for quantized linear layers
`gemm_bf16`	HIP	BF16 GEMM for QKV/up/gate/down projections
`rms_norm`	Triton	Pre/post-attention and MLP normalization
`rope_embedding`	Triton	Rotary position embedding (Q and K)
`kv_cache_ops`	Triton	KV cache reshape, copy, and quantization
`all_reduce`	HIP	Tensor-parallel all-reduce (RCCL + fused kernels)
`act_quant_fp8`	Triton	Dynamic per-token FP8 activation quantization
`silu_mul`	Triton	Fused SiLU × gate (SwiGLU) for MLP

Validated Results (agent-driven, cursor backend)

Kernel	Speedup	Score	Settings	Notes
`all_reduce`	36.35x	7290	3 iter / 25 turns	HIP, library_test; multi-GPU
`rms_norm`	1.05x	674	1 iter / 10 turns	Triton, pytorch mode
`fused_moe`	1.14x	248	1 iter / 10 turns	Triton, pytorch mode
`gemm_bf16`	1.00x	220	3 iter / 25 turns	Triton, library_test
`silu_mul`	1.00x	220	1 iter / 10 turns	Triton, library_test
`act_quant_fp8`	1.00x	220	1 iter / 10 turns	Triton, library_test
`kv_cache_ops`	1.00x	220	1 iter / 10 turns	Triton, library_test

7 kernels validated with correct optimizations. Top performers: all_reduce (36.35x), fused_moe (1.14x), rms_norm (1.05x).

Troubleshooting

PyTorch ROCm installation fails

# Install manually with the correct ROCm version
pip3 install torch torchvision --index-url https://download.pytorch.org/whl/rocm7.2

# For older ROCm:
pip3 install torch torchvision --index-url https://download.pytorch.org/whl/rocm6.2

Agent CLI not found

# Verify Node.js 18+
node --version

# Claude Code
npm install -g @anthropic-ai/claude-code && claude login

# Codex
npm install -g @openai/codex && codex login

# Cursor Agent
npm install -g cursor-agent && cursor-agent login

MCP servers not working in Cursor IDE

Cursor auto-discovers MCP servers from .mcp.json. If MCPs aren't loading:

Verify .mcp.json exists in the Apex root (generated by setup.sh)
Restart Cursor or reload the window
Check that the Python path in .mcp.json matches your venv

No GPU detected

The pipeline auto-detects GPUs via rocm-smi. For CPU-only development:

# Use the mini eval (no GPU needed)
python3 eval.py --dry-run

# Use --skip-benchmark with pre-recorded profiling data
python3 workload_optimizer.py run -r ./results --skip-benchmark report.json ...

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
agents		agents
docker		docker
files		files
graders		graders
pipeline		pipeline
prompts		prompts
tests		tests
tools		tools
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
eval.py		eval.py
mcp_config.json		mcp_config.json
requirements-eval.txt		requirements-eval.txt
setup.sh		setup.sh
workload_optimizer.py		workload_optimizer.py

Agent	Provider	What you need
Claude Code	Anthropic	An active Anthropic account and acceptance of Anthropic's usage policies
OpenAI Codex	OpenAI	An active OpenAI account and acceptance of OpenAI's terms of service
Cursor Agent	Cursor	An active Cursor subscription with agent mode enabled

Folders and files

Latest commit

History

Repository files navigation

Apex — GPU Kernel Optimization Pipeline

How It Works

Quick Start

Prerequisites

1. Clone the Repository

2. Install at Least One Agent CLI

3. Run Setup

Setup Flags

4. Activate and Run

Repository Structure

Usage

Full Pipeline (Automated)

Step-by-Step Pipeline

Standalone Kernel Optimization

Interactive Agent Mode

Agent-Driven Kernel Optimization Examples

Mini Eval (No GPU Required)

Explore Prompts

Grade Output Tasks

Export RL/SFT Datasets

Run Tests

Pipeline Options

Agent Backend

Docker Image Override

Benchmark Caching

Parallel Kernel Optimization

Agent Model Routing

Anti-Tampering

Scoring

Model-Level Reward

Target Hardware

Kernel Reintegration (Hot-Patching)

MCP Servers

Prompt Architecture

Kernels with Library Test Coverage

Model Registry

Kernel Types

Validated Results (agent-driven, cursor backend)

Troubleshooting

PyTorch ROCm installation fails

Agent CLI not found

MCP servers not working in Cursor IDE

No GPU detected

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages