Which benchmark should score your kernel agent — and how does each one actually measure? A curated, evidence-graded catalog of every benchmark relevant to LLM/agent GPU-kernel generation. 中文版: README.zh.md
30-second orientation. This list contains two different kinds of things, and keeping them apart is the whole game:
- Agent benchmarks — harnesses designed to score LLM kernel generation (KernelBench and its 27 successors). These produce the headline numbers in papers.
- Task sources (a.k.a. substrates) — the classic HPC/GPU suites that agents are evaluated on or asked to optimize (PolyBench, NPB, Rodinia, …). An agent paper saying "we optimize XSBench kernels" is using a task source, not a benchmark — its numbers are only comparable to another paper that shares both the task source and the measurement method.
Sibling list: awesome-kernel-agent (the agents themselves). Single source of truth: data/benchmarks.yaml + data/scorecard.yaml; every table below is generated — do not hand-edit.
151 benchmarks · 28 purpose-built agent benchmarks · 123 substrate / dataset / tooling entries across 13 families.
| Layer | Count |
|---|---|
| agent-benchmark | 28 |
| substrate-suite | 110 |
| dataset | 3 |
| tooling | 10 |
| Top hardware targets | Entries |
|---|---|
| NVIDIA | 122 |
| CPU | 67 |
| AMD | 45 |
| Intel-GPU | 13 |
| FPGA | 7 |
| Ascend-NPU | 6 |
| TPU | 3 |
| Cambricon | 2 |
- ① What should I evaluate my agent on?
- ② How does each benchmark actually measure? — the scorecard
- ③ Where do I get kernels/tasks to optimize?
- ④ Who can I compare against?
- Appendix A — motif coverage matrix · Appendix B — how this list is built · Glossary
Pick by what you need to prove:
- "My numbers are comparable to prior work" → KernelBench — the de-facto standard everyone reports. Know its limits (see scorecard) and never use it as sole evidence.
- "My speedups are not reward-hacking artifacts" → robust-kbench — anti-gaming filters, forward+backward, 1e-5 tolerances.
- "My kernels are correct on real production shapes" → BackendBench — PyTorch OpInfo edge cases + TorchBench production traces; ships kernels as a pip backend.
- "My kernels are fast in an absolute sense, not just vs eager" → SOL-ExecBench / CUDABench — ceiling-relative scores (caveat: both use datasheet peaks, which are 8–15% loose vs measured ceilings).
- "My agent's wins survive a real serving stack" → ISO-Bench (vLLM/SGLang merged-PR tasks) / FlashInfer-Bench.
- "My results are honest about compute budget" → GEAK — the only one decomposing sequential@k vs parallel@k.
- Triton specifically → TritonBench; ROCm/AMD → GEAK, AgentKernelArena, NPUEval; Ascend NPU → MultiKernelBench, AscendKernelBench, NPUKernelBench; CUDA correctness with hidden tests → ComputeEval.
The honest answer for a serious paper is a portfolio: one comparability anchor (KernelBench) + one hardened oracle (robust-kbench or BackendBench) + one absolute metric (SOL-ExecBench-style) + explicit budget reporting. No single existing benchmark covers all four — that gap is visible in the scorecard below.
Two benchmarks can give the same agent wildly different numbers, because the score is a product of five design choices: the task set, the correctness oracle, the timing methodology, the baseline, and the budget accounting. This scorecard grades each benchmark on the choices that separate trustworthy numbers from inflated ones — graded only from primary evidence (harness source code / paper), never from README claims.
| Benchmark | Year | Hardware | Oracle | Timing | Baseline | Budget | Pick this if… |
|---|---|---|---|---|---|---|---|
| BackendBench | 2025 | NVIDIA | ● | ○ | PyTorch op semantics (correctness-first; kernels run as a pip backend) | — | You want production-shape correctness and a path to actually shipping kernels as a backend. |
| ComputeEval | 2025 | NVIDIA | ● | ○ | n/a (correctness-first; optional speedup vs reference) | ◐ | You want CUDA correctness with anti-overfitting (hidden tests) from NVIDIA's own harness. |
| GEAK Benchmarks | 2025 | AMD | ◐ | ○ | expert reference kernels (TritonBench-revised + ROCm set) | ● | You target ROCm/Triton and want budget-decomposed (sequential vs parallel) reporting. |
| KernelBench | 2025 | NVIDIA | ○ | ○ | PyTorch eager (torch.compile default-mode as secondary) | — | You need comparability with the de-facto standard everyone reports — never as sole evidence. |
| KernelBench-v3 | 2025 | NVIDIA | ◐ | ◐ | PyTorch eager (reference.py) | — | You want a community-hardened KernelBench variant reported across several GPUs. |
| MultiKernelBench | 2025 | NVIDIA · Ascend-NPU · TPU | ○ | ○ | each platform's own PyTorch backend | ◐ | You need multi-platform coverage (CUDA / Ascend C / TPU Pallas) and accept per-platform-only numbers. |
| TritonBench | 2025 | NVIDIA · AMD | ◐ | ◐ | PyTorch/reference impl; descriptive GPU-efficiency vs A100 theoretical peak | — | You generate Triton specifically and want real-world (GitHub-sourced) operators. |
| robust-kbench | 2025 | NVIDIA | ● | ◐ | PyTorch eager (KernelBench-style) | — | You want to prove your speedups survive anti-gaming filters (the KernelBench exploit fixes). |
| CUDABench | 2026 | NVIDIA | ? | ◐ | attainable-GFLOPs roofline ceiling (datasheet peak) | ? | You want a roofline-relative Performance-Score with profiler-measured intensity (datasheet-ceiling caveat). |
| ISO-Bench | 2026 | NVIDIA | ◐ | ◐ | unoptimized baseline + human-merged-PR solution (dual reference) | ◐ | You want real serving-stack tasks (vLLM/SGLang merged PRs) with did-it-fix-the-bottleneck attribution. |
| SOL-ExecBench | 2026 | NVIDIA | ◐ | ● | agent-optimized PyTorch baseline + analytical speed-of-light ceiling (datasheet peak) | — | You want an absolute (ceiling-relative) score on production LLM kernels — datasheet-ceiling caveat applies. |
Grades (criteria in data/scorecard.yaml / SCHEMA.md): ● strong · ◐ partial · ○ weak · — none · ? unverified. Grades are assigned ONLY from primary evidence (harness source / paper) — no benchmark is graded from its README claims.
Not yet graded (17): C2HLSC, ParEval, AscendKernelBench (AKG-AGENT), CANN Bench, FlashInfer-Bench, HLS-Eval, NKIBench, NPUEval, QiMeng-TensorOp, QiMeng-Xpiler, TritonGym, AgentKernelArena, KernelBench-MUSA (MooreEval), KernelBenchX, KernelCraft, MSKernelBench, NPUKernelBench. PRs grading these against the criteria are the most valuable contribution this list can receive.
Full inventory of all purpose-built agent benchmarks (facets: abstraction, motifs, hardware, verification)
| Benchmark | Year | Abstraction | Motifs | Hardware | Verify | Runs on (substrate) |
|---|---|---|---|---|---|---|
| C2HLSC | 2024 | kernel | mixed | FPGA | ✅ | — |
| ParEval | 2024 | kernel | dense-LA · sparse-LA · structured-grid · graph-traversal · reduction-scan · n-body · mixed | NVIDIA · AMD · CPU | ✅ | — |
| AscendKernelBench (AKG-AGENT) | 2025 | operator | dense-LA · elementwise · reduction-scan · mixed | Ascend-NPU · NVIDIA · CPU | ✅ | — |
| BackendBench | 2025 | operator | dense-LA · elementwise · reduction-scan · mixed | NVIDIA | ✅ | — |
| CANN Bench | 2025 | operator | mixed | Ascend-NPU | ✅ | — |
| ComputeEval | 2025 | kernel | dense-LA · reduction-scan · elementwise · mixed | NVIDIA | ✅ | — |
| FlashInfer-Bench | 2025 | operator | attention · dense-LA · mixed | NVIDIA | ✅ | — |
| GEAK Benchmarks | 2025 | kernel | dense-LA · attention · reduction-scan · mixed | AMD | ✅ | — |
| HLS-Eval | 2025 | kernel | mixed | FPGA | ✅ | — |
| KernelBench | 2025 | operator | dense-LA · structured-grid · reduction-scan · elementwise · attention · mixed | NVIDIA | ✅ | — |
| KernelBench-v3 | 2025 | operator | dense-LA · attention · mixed | NVIDIA | ✅ | — |
| MultiKernelBench | 2025 | operator | dense-LA · attention · reduction-scan · elementwise · mixed | NVIDIA · Ascend-NPU · TPU | ✅ | — |
| NKIBench | 2025 | kernel | dense-LA · attention · mixed | Trainium | ✅ | — |
| NPUEval | 2025 | kernel | dense-LA · elementwise · reduction-scan · mixed | AMD | ✅ | — |
| QiMeng-TensorOp | 2025 | operator | dense-LA | RISC-V · ARM · NVIDIA | ◐ | — |
| QiMeng-Xpiler | 2025 | operator | dense-LA · mixed | Cambricon · NVIDIA · AMD · CPU | ✅ | — |
| TritonBench | 2025 | kernel | dense-LA · attention · reduction-scan · elementwise · mixed | NVIDIA · AMD | ✅ | — |
| TritonGym | 2025 | kernel | mixed | NVIDIA | ✅ | — |
| robust-kbench | 2025 | operator | dense-LA · attention · reduction-scan · mixed | NVIDIA | ✅ | — |
| AgentKernelArena | 2026 | kernel | mixed | AMD | ✅ | — |
| CUDABench | 2026 | operator | dense-LA · attention · reduction-scan · elementwise · mixed | NVIDIA | ✅ | — |
| ISO-Bench | 2026 | operator | mixed | NVIDIA | ✅ | — |
| KernelBench-MUSA (MooreEval) | 2026 | operator | dense-LA · attention · mixed | MooreThreads · NVIDIA | ✅ | — |
| KernelBenchX | 2026 | operator | dense-LA · attention · reduction-scan · elementwise · mixed | NVIDIA | ✅ | — |
| KernelCraft | 2026 | kernel | mixed | NVIDIA · AMD | ✅ | — |
| MSKernelBench | 2026 | operator | dense-LA · sparse-LA · attention · mixed | NVIDIA | ✅ | — |
| NPUKernelBench | 2026 | kernel | mixed | Ascend-NPU | ✅ | — |
| SOL-ExecBench | 2026 | operator | dense-LA · attention · mixed | NVIDIA | ✅ | — |
The classic suites below are task sources: curated kernels with reference implementations that agents optimize, translate, or get fine-tuned on. Groups are ordered by how often kernel-agent work reaches for them; the Verify column (✅ built-in correctness oracle) is what makes a suite resistant to wrong-but-fast kernels, and Ships (✅ re-runnable reference kernels) is what makes results auditable.
The kernels your agent must beat: cuDNN/cuBLAS, FlashAttention, Triton tutorials, Liger. Use as reference implementations and speedup denominators.
| Suite | Year · Org | Motifs | Languages / model | Verify | Ships | Used by (agents) |
|---|---|---|---|---|---|---|
| cuDNN / cuBLAS | 2014 · NVIDIA | dense-LA · attention · elementwise · mixed | CUDA | ◐ | ✅ | AI CUDA Engineer, KernelBench, TritonBench |
| DeepBench | 2016 · Baidu Research | dense-LA · elementwise | CUDA · HIP · ARM | ◐ | ◐ | — |
| DNNMark | 2017 · Northeastern (NUCAR) | dense-LA · elementwise | CUDA · HIP | ◐ | ◐ | — |
| PyTorch operator_benchmark | 2019 · Meta / PyTorch | dense-LA · elementwise · reduction-scan · mixed | Python · CUDA | ◐ | ◐ | — |
| NVBench | 2021 · NVIDIA | mixed | CUDA/C++ | — | — | — |
| Triton tutorial kernels | 2021 · OpenAI / triton-lang | dense-LA · attention · reduction-scan · elementwise | Triton | ✅ | ✅ | TritonBench, GEAK, KernelLLM, AutoTriton |
| xFormers benchmarks | 2021 · Meta AI (FAIR) | attention · dense-LA | CUDA · CUTLASS · Triton | ✅ | ✅ | — |
| FlashAttention benchmarks | 2022 · Dao-AILab | attention · reduction-scan | CUDA · CuTeDSL | ✅ | ✅ | — |
| Liger-Kernel benchmarks | 2024 · LinkedIn | reduction-scan · elementwise · attention | Triton | ✅ | ✅ | TritonBench |
Machine-generated baselines (TVM/Ansor, Hidet, CUTLASS profiler). Use when you want your agent compared against autotuned — not just eager — code.
| Suite | Year · Org | Motifs | Languages / model | Verify | Ships | Used by (agents) |
|---|---|---|---|---|---|---|
| CUTLASS Profiler | 2017 · NVIDIA | dense-LA | CUDA C++ · CuTe · Python DSL | ✅ | ✅ | KernelAgent, MANTIS, CUDABench |
| TVM / Ansor | 2018 · Apache TVM / OctoML | dense-LA · mixed | TVM TE/TIR | ✅ | ✅ | — |
| IREE benchmarks | 2020 · Google / OpenXLA | dense-LA · mixed | MLIR (Linalg) | ✅ | ◐ | — |
| MLIR microkernels | 2020 · LLVM / MLIR community | dense-LA · elementwise | MLIR | ◐ | ◐ | — |
| AKG (Auto Kernel Generator) | 2021 · Huawei / MindSpore | dense-LA · elementwise · reduction-scan · mixed | polyhedral DSL · Triton-Ascend | ✅ | ✅ | AKG kernel Agent |
| TenSet | 2021 · UC Berkeley / OctoML (Zheng) | dense-LA | TVM schedules | — | ◐ | — |
| Roller | 2022 · Microsoft Research Asia | dense-LA | TVM TE · CUDA | ✅ | ✅ | — |
| Hidet | 2023 · U. Toronto / CentML | dense-LA · attention | Python DSL · CUDA | ✅ | ✅ | — |
| Welder | 2023 · MSR Asia / Peking U. | dense-LA · mixed | TVM · CUTLASS · CUDA | ✅ | ✅ | — |
Where kernel wins become end-to-end wins: TTFT/TPOT/throughput harnesses (vLLM, SGLang). Use to show a kernel matters at the serving level.
| Suite | Year · Org | Motifs | Languages / model | Verify | Ships | Used by (agents) |
|---|---|---|---|---|---|---|
| LLMPerf | 2023 · Ray project (Anyscale) | attention | Python | ◐ | — | — |
| vLLM benchmarks | 2023 · vLLM project (UC Berkeley → community) | attention · dense-LA | Python | ◐ | — | FlashInfer-Bench |
| FlexAttention / attention-gym | 2024 · Meta / PyTorch | attention | Python · Triton | ✅ | ✅ | — |
| GenAI-Perf | 2024 · NVIDIA | attention | Python | — | — | — |
| SGLang benchmarks | 2024 · SGLang project | attention · dense-LA | Python | ◐ | — | Astra |
| AIPerf | 2025 · NVIDIA | attention | Python | — | — | — |
Model-level suites (MLPerf, TorchBench, TorchInductor). Use to bound end-to-end impact of kernel-level changes.
| Suite | Year · Org | Motifs | Languages / model | Verify | Ships | Used by (agents) |
|---|---|---|---|---|---|---|
| Fathom | 2016 · Harvard (VLSI-Arch) | dense-LA · attention · mixed | Python · TensorFlow | — | — | — |
| DAWNBench | 2017 · Stanford DAWN | dense-LA · mixed | Python | ✅ | — | — |
| AI-Benchmark | 2018 · ETH Zurich | dense-LA · mixed | Python · TensorFlow | ◐ | — | — |
| AIBench | 2018 · ICT CAS / BenchCouncil | dense-LA · attention · mixed | Python · C++ | ◐ | — | — |
| MLPerf Training | 2018 · MLCommons | dense-LA · attention · mixed | Python · PyTorch · TensorFlow | ✅ | — | — |
| MLPerf Inference | 2019 · MLCommons | dense-LA · attention · mixed | Python · C++ | ✅ | — | — |
| Megatron-LM benchmarks | 2019 · NVIDIA | dense-LA · attention · mixed | Python · CUDA | ◐ | ◐ | — |
| MLPerf Tiny | 2021 · MLCommons | dense-LA · elementwise · mixed | C · C++ | ✅ | ◐ | — |
| TorchBench | 2021 · Meta / PyTorch | dense-LA · attention · mixed | Python · CUDA | ✅ | — | — |
| NVIDIA Transformer Engine benchmarks | 2022 · NVIDIA | dense-LA · attention | Python · C++ · CUDA | ✅ | ✅ | — |
| TorchInductor / Dynamo suite | 2022 · Meta / PyTorch | dense-LA · attention · elementwise · mixed | Python · Triton | ✅ | ◐ | KernelBench-style agents (baseline) |
SpMV/SpMM/SDDMM kernels and the matrix collections (SuiteSparse, DLMC) they run on. Irregular memory patterns — a hard, under-benchmarked motif.
| Suite | Year · Org | Motifs | Languages / model | Verify | Ships | Used by (agents) |
|---|---|---|---|---|---|---|
| SuiteSparse Matrix Collection | 2011 · Texas A&M (Davis); orig. U. Florida | sparse-LA | Matrix Market | — | — | Auto-SpMV |
| cuSPARSE | 2014 · NVIDIA | sparse-LA | CUDA | ◐ | ✅ | — |
| TACO benchmarks | 2017 · MIT CSAIL (Kjolstad) | sparse-LA | C++ | ✅ | ✅ | — |
| ASpT | 2019 · Academic (Hong et al.) | sparse-LA | CUDA · OpenMP | ◐ | ✅ | — |
| DLMC (Deep Learning Matrix Collection) | 2020 · Google Research / Stanford | sparse-LA | Matrix Market | — | — | Sputnik, Magicube, SparseTIR |
| GE-SpMM / dgSPARSE | 2020 · Tsinghua (Huang et al.) | sparse-LA · graph-traversal | CUDA | ◐ | ✅ | — |
| Sputnik | 2020 · Google Research / Stanford | sparse-LA | CUDA/C++ | ✅ | ✅ | — |
| cuSPARSELt | 2020 · NVIDIA | sparse-LA | CUDA | ◐ | ✅ | — |
| vectorSparse | 2021 · Academic (Chen et al.) | sparse-LA | CUDA | ◐ | ✅ | — |
| Magicube | 2022 · ETH Zurich (Li et al.) | sparse-LA | CUDA | ◐ | ✅ | — |
| SparseTIR | 2023 · U. Washington (SAMPL) | sparse-LA | TVM TensorIR · CUDA | ✅ | ✅ | — |
| VENOM / Spatha | 2023 · Universidad de Malaga (Castro et al.) | sparse-LA | CUDA | ◐ | ✅ | — |
Structured-grid stencils, FFT/spectral kernels, and their DSLs (Halide, Devito). Classic memory-bound optimization targets with clean verification.
| Suite | Year · Org | Motifs | Languages / model | Verify | Ships | Used by (agents) |
|---|---|---|---|---|---|---|
| benchFFT (FFTW) | 2003 · MIT (Frigo, Johnson) | spectral | C | ✅ | ✅ | — |
| Stencil Probe | 2007 · UC Berkeley (Kamil et al.) | structured-grid | C | ◐ | ✅ | — |
| Pochoir | 2011 · MIT (Tang et al.) | structured-grid | C++ | ✅ | ✅ | — |
| Halide benchmarks | 2012 · MIT CSAIL | structured-grid · dense-LA | Halide DSL · CUDA | ✅ | ✅ | — |
| ExaStencils / ExaSlang | 2014 · U. Erlangen / Passau | structured-grid | ExaSlang · CUDA · OpenMP · MPI | ✅ | ✅ | — |
| PolyMage | 2015 · IISc Bangalore (Mullapudi et al.) | structured-grid | Python DSL · C | ✅ | ✅ | — |
| Devito benchmarks | 2016 · Imperial College London | structured-grid | Python DSL · C | ✅ | ✅ | — |
| StencilGen (Artemis) | 2018 · Ohio State (Rawat et al.) | structured-grid | CUDA | ✅ | ✅ | — |
| AN5D | 2020 · U. Tokyo / RIKEN (Matsumura et al.) | structured-grid | CUDA | ✅ | ✅ | — |
| heFFTe | 2020 · ICL U. Tennessee | spectral | C++ · CUDA · HIP | ✅ | ✅ | — |
BFS/PageRank/connected-components with strong optimized baselines (Gunrock, GAPBS). Use to test agents beyond dense regular loops.
| Suite | Year · Org | Motifs | Languages / model | Verify | Ships | Used by (agents) |
|---|---|---|---|---|---|---|
| LonestarGPU / Lonestar | 2012 · UT Austin (ISS) | graph-traversal · unstructured-grid | CUDA · C++ | ✅ | ✅ | — |
| Pannotia | 2013 · AMD Research + U. Virginia | graph-traversal | OpenCL · HIP · CUDA | ✅ | ✅ | — |
| GAP Benchmark Suite | 2015 · UC Berkeley (Beamer) | graph-traversal · sparse-LA | C++ · OpenMP | ✅ | ✅ | — |
| GraphBIG | 2015 · Georgia Tech + IBM | graph-traversal | C++ · CUDA | ✅ | ✅ | — |
| Gunrock | 2016 · UC Davis (Owens) | graph-traversal | CUDA/C++ | ✅ | ✅ | — |
| GARDENIA | 2017 · NUDT (Chen) | graph-traversal · sparse-LA | CUDA · OpenCL · OpenMP | ✅ | ✅ | — |
| GBBS | 2020 · MIT / CMU (Dhulipala et al.) | graph-traversal | C++ | ✅ | ✅ | — |
| Indigo3 | 2024 · Texas State (Burtscher) | graph-traversal | C · OpenMP · CUDA · HIP | ✅ | ✅ | — |
Regular affine loop kernels (GEMM-like, stencils) with trivially checkable outputs — the easiest substrate to verify, and the most-used in agent papers.
| Suite | Year · Org | Motifs | Languages / model | Verify | Ships | Used by (agents) |
|---|---|---|---|---|---|---|
| PolyBench/C | 2010 · OSU / CSU (Pouchet, Yuki) | dense-LA · structured-grid · mixed | C | ✅ | ✅ | Performance-Aligned LLMs, ComPilot, ACCeLLiuM |
| PolyBench/Fortran | 2012 · OSU (Pouchet, Narayan) | dense-LA · structured-grid | Fortran | ✅ | ✅ | — |
| PolyBench/GPU | 2012 · U. Delaware (Grauer-Gray, Cavazos) | dense-LA · structured-grid | CUDA · OpenCL · HMPP · OpenACC | ✅ | ✅ | MIREncoder |
| PolyBench-ACC | 2013 · Cavazos Lab, U. Delaware | dense-LA · structured-grid · mixed | CUDA · OpenCL · OpenACC · OpenMP · HMPP | ✅ | ✅ | CUDAnalyst, MEP |
| PolyBench-RAJA | 2016 · U. Delaware (Killian) | dense-LA · structured-grid | C++ (RAJA) | ✅ | ✅ | — |
| PolyBench-NN | 2019 · IIT Hyderabad (IITH-Compilers) | dense-LA · structured-grid | C | ✅ | ✅ | — |
| PolyBench/Python | 2021 · UDC-GAC (U. da Coruna) | dense-LA · structured-grid | Python · NumPy | ✅ | ✅ | — |
DOE/NASA mini-apps (NPB, XSBench, LULESH): small, science-representative, almost all ship a built-in verification figure-of-merit.
| Suite | Year · Org | Motifs | Languages / model | Verify | Ships | Used by (agents) |
|---|---|---|---|---|---|---|
| NPB (NAS Parallel Benchmarks) | 1991 · NASA Advanced Supercomputing | dense-LA · sparse-LA · spectral · structured-grid · mixed | Fortran · C | ✅ | ✅ | CUDAnalyst, AutoParLLM |
| miniMD | 2008 · Sandia (Mantevo) | n-body | C++ · Kokkos · OpenMP | ✅ | ✅ | — |
| LULESH | 2010 · LLNL | unstructured-grid · n-body | C++ · CUDA · RAJA · OpenMP | ✅ | ✅ | — |
| miniFE | 2011 · Sandia (Mantevo) | sparse-LA · unstructured-grid | C++ · CUDA · Kokkos · OpenMP | ✅ | ✅ | LLMPerf-Opt |
| CoMD | 2012 · ExMatEx / ECP-CoPA | n-body | C · CUDA · OpenCL · OpenMP | ✅ | ✅ | — |
| PENNANT | 2012 · LANL | unstructured-grid | C++ · CUDA · MPI | ✅ | ✅ | — |
| Nekbone | 2013 · Argonne (Nek5000 team) | dense-LA · structured-grid | Fortran · CUDA-Fortran · OpenACC | ✅ | ✅ | — |
| HPGMG | 2014 · LBNL | structured-grid | C · CUDA · OpenMP | ✅ | ✅ | — |
| RSBench | 2014 · Argonne (ANL-CESAR) | monte-carlo · mixed | C · CUDA · OpenCL · SYCL · OpenMP-target | ✅ | ✅ | LLMPerf-Opt |
| XSBench | 2014 · Argonne (ANL-CESAR) | monte-carlo · mixed | C · CUDA · OpenCL · SYCL · OpenMP-target | ✅ | ✅ | CUDAnalyst, LLMPerf-Opt, LASSI, OMPar, ParEval-Repo |
| miniAMR | 2014 · Sandia (Mantevo) | structured-grid | C · MPI · OpenMP | ✅ | ◐ | — |
| BabelStream | 2015 · U. Bristol (UoB-HPC) | structured-grid · data-movement | CUDA · HIP · SYCL · OpenCL · OpenMP · Kokkos · RAJA · TBB · Thrust | ✅ | ✅ | — |
| SimpleMOC | 2015 · Argonne (CESAR) + MIT | monte-carlo · dense-LA | C · CUDA · OpenCL · OpenMP-target | ✅ | ◐ | ParEval-Repo |
| Quicksilver | 2016 · LLNL | monte-carlo | C++ · CUDA · OpenMP | ✅ | ✅ | — |
| SU3_Bench | 2019 · LBNL / NERSC | dense-LA | C · CUDA · HIP · SYCL · OpenMP-target | ✅ | ✅ | LASSI, OMPar |
Rodinia, SHOC, Parboil, HeCBench: the broad-coverage GPU canon. HeCBench alone gives one kernel in 4+ programming models.
| Suite | Year · Org | Motifs | Languages / model | Verify | Ships | Used by (agents) |
|---|---|---|---|---|---|---|
| CUDA Samples | 2008 · NVIDIA | dense-LA · spectral · reduction-scan · mixed | CUDA | ◐ | ✅ | — |
| Mars | 2008 · HKUST / Microsoft | mapreduce | CUDA | ✅ | ◐ | — |
| Parboil | 2008 · UIUC (IMPACT) | dense-LA · sparse-LA · structured-grid · graph-traversal · n-body · mapreduce | C · CUDA · OpenCL · OpenMP | ✅ | ✅ | MIREncoder |
| GPGPU-Sim ISPASS-2009 | 2009 · UBC (Aamodt) | dense-LA · graph-traversal · mixed | CUDA | ✅ | ✅ | — |
| Rodinia | 2009 · U. Virginia (Skadron / LAVA) | dense-LA · structured-grid · unstructured-grid · graph-traversal · dynamic-programming · mixed | C · CUDA · OpenCL · OpenMP | ✅ | ✅ | AutoParLLM, MIREncoder |
| SHOC | 2010 · ORNL + U. Tennessee | dense-LA · sparse-LA · spectral · structured-grid · graph-traversal · reduction-scan · mixed | CUDA · OpenCL · MPI | ✅ | ✅ | MIREncoder |
| OpenDwarfs | 2012 · Virginia Tech (Synergy) | dense-LA · sparse-LA · spectral · n-body · structured-grid · unstructured-grid · graph-traversal · dynamic-programming · combinational-logic · branch-and-bound · graphical-models · finite-state-machine | OpenCL | ✅ | ✅ | — |
| Hetero-Mark | 2016 · Northeastern (NUCAR) | dense-LA · graph-traversal · mixed | CUDA · HIP · HC · OpenCL | ✅ | ✅ | — |
| Chai | 2017 · UIUC (IMPACT) + U. Cordoba | dense-LA · graph-traversal · mixed | CUDA · OpenCL | ✅ | ✅ | — |
| Tartan | 2018 · PNNL + ORNL | data-movement · mixed | CUDA · MPI | — | ◐ | — |
| Mirovia | 2019 · UT Austin / VMware | dense-LA · structured-grid · mixed | CUDA | ✅ | ✅ | — |
| Tango | 2019 · MoCA Lab (SJSU / UC Merced) | dense-LA · attention · mixed | CUDA · OpenCL | ✅ | ✅ | — |
| Altis | 2020 · UT Austin (SCEA) / VMware | dense-LA · structured-grid · mixed | CUDA | ✅ | ✅ | — |
| SYCL-Bench | 2020 · U. Salerno (unisa-hpc) | dense-LA · structured-grid · reduction-scan · mixed | SYCL | ◐ | ✅ | — |
| HeCBench | 2021 · ORNL (Jin, Vetter) | dense-LA · sparse-LA · structured-grid · monte-carlo · graph-traversal · mixed | CUDA · HIP · SYCL · OpenMP-target | ✅ | ✅ | LASSI, OMPar, LLMPerf-Opt, Bolet et al. |
STREAM, ERT, mixbench, gpu-burn: measure what the hardware can actually do. The calibration layer any ceiling-relative metric depends on.
| Suite | Year · Org | Motifs | Languages / model | Verify | Ships | Used by (agents) |
|---|---|---|---|---|---|---|
| STREAM | 1995 · John McCalpin (U. Virginia / TACC) | structured-grid · data-movement | C · Fortran | ✅ | ✅ | — |
| HPL (High Performance Linpack) | 2000 · Netlib / ICL U. Tennessee | dense-LA | C · MPI · BLAS | ✅ | ✅ | — |
| HPCC (HPC Challenge) | 2003 · ICL U. Tennessee | dense-LA · spectral · mapreduce · data-movement · mixed | C · MPI · BLAS | ✅ | ✅ | — |
| HPCG | 2013 · Sandia / ICL (Heroux, Dongarra) | sparse-LA · structured-grid | C++ · MPI · OpenMP | ✅ | ✅ | — |
| gpu-burn | 2013 · Ville Timonen | dense-LA | C++/CUDA | ✅ | ◐ | — |
| clpeak | 2014 · Krishnaraj Bhat | mixed | C++ · OpenCL | — | ◐ | — |
| Empirical Roofline Toolkit (ERT) | 2015 · LBNL CRD / Berkeley | mixed | C · MPI · OpenMP · CUDA | ◐ | ✅ | — |
| mixbench | 2015 · U. Athens (Konstantinidis) | mixed | CUDA · OpenCL · HIP · SYCL · OpenMP | ◐ | ✅ | — |
| gpumembench | 2016 · U. Athens (Konstantinidis) | data-movement | CUDA · OpenCL | ◐ | ✅ | — |
| HPL-MxP (HPL-AI) | 2019 · ICL U. Tennessee | dense-LA | C/C++ · CUDA · HIP · MPI | ✅ | ◐ | — |
| nvbandwidth | 2022 · NVIDIA | data-movement | C++/CUDA | ◐ | ✅ | — |
LLVM test-suite, SPEC ACCEL/OMP, MachSuite: codegen substrates with strict correctness baked in.
| Suite | Year · Org | Motifs | Languages / model | Verify | Ships | Used by (agents) |
|---|---|---|---|---|---|---|
| LLVM test-suite | 2004 · LLVM community | dense-LA · structured-grid · mixed | C · C++ · CUDA | ✅ | ✅ | — |
| CHStone | 2008 · Ritsumeikan U. | combinational-logic · dense-LA · mixed | C | ✅ | ✅ | — |
| SPEC OMP2012 | 2012 · SPEC/HPG | dense-LA · structured-grid · n-body · mixed | C · C++ · Fortran · OpenMP | ✅ | ✅ | — |
| MachSuite | 2014 · Harvard (Reagen et al.) | dense-LA · sparse-LA · spectral · graph-traversal · dynamic-programming · mixed | C | ✅ | ✅ | — |
| SPEC ACCEL | 2014 · SPEC/HPG | dense-LA · structured-grid · mixed | OpenCL · OpenACC · OpenMP-target | ✅ | ✅ | — |
| Rosetta (HLS) | 2018 · Cornell (Zhang et al.) | dense-LA · structured-grid · mixed | C++ HLS · OpenCL | ✅ | ✅ | — |
Ascend, Cambricon, Tenstorrent, IPU substrates — where vendor-kernel scarcity makes agents most valuable and baselines weakest.
| Suite | Year · Org | Motifs | Languages / model | Verify | Ships | Used by (agents) |
|---|---|---|---|---|---|---|
| Graphcore IPU benchmarks | 2020 · Graphcore + U. Bristol | dense-LA · attention · mixed | Poplar C++ · PopLibs | ◐ | ✅ | — |
| Cambricon mlu-ops | 2022 · Cambricon | dense-LA · elementwise · mixed | BANG C | ◐ | ✅ | — |
| Tenstorrent tt-metal | 2023 · Tenstorrent | dense-LA · attention · mixed | C++ (TT-Metalium) · Python (TT-NN) | ◐ | ✅ | — |
| Ascend cann-ops | 2024 · Huawei | dense-LA · elementwise · mixed | Ascend C | ✅ | ✅ | AscendOptimizer |
Which kernel-agent works already evaluate on each classic task source. Two agents are directly comparable only if they share a task source and a measurement method. Agents that ship re-runnable kernels (✅) are auditable; the rest are trust-me numbers.
| Substrate | Family | Used by (kernel agents) | Ships kernels |
|---|---|---|---|
| AKG (Auto Kernel Generator) | tensor-compiler | AKG kernel Agent | ✅ |
| Ascend cann-ops | emerging-accelerator | AscendOptimizer | ✅ |
| CUTLASS Profiler | tensor-compiler | KernelAgent, MANTIS, CUDABench | ✅ |
| DLMC (Deep Learning Matrix Collection) | sparse-la | Sputnik, Magicube, SparseTIR | — |
| HeCBench | classic-gpu-suite | LASSI, OMPar, LLMPerf-Opt, Bolet et al. | ✅ |
| Liger-Kernel benchmarks | dl-micro | TritonBench | ✅ |
| NPB (NAS Parallel Benchmarks) | hpc-mini-app | CUDAnalyst, AutoParLLM | ✅ |
| Parboil | classic-gpu-suite | MIREncoder | ✅ |
| PolyBench-ACC | polyhedral | CUDAnalyst, MEP | ✅ |
| PolyBench/C | polyhedral | Performance-Aligned LLMs, ComPilot, ACCeLLiuM | ✅ |
| PolyBench/GPU | polyhedral | MIREncoder | ✅ |
| RSBench | hpc-mini-app | LLMPerf-Opt | ✅ |
| Rodinia | classic-gpu-suite | AutoParLLM, MIREncoder | ✅ |
| SGLang benchmarks | serving-inference | Astra | — |
| SHOC | classic-gpu-suite | MIREncoder | ✅ |
| SU3_Bench | hpc-mini-app | LASSI, OMPar | ✅ |
| SimpleMOC | hpc-mini-app | ParEval-Repo | ◐ |
| SuiteSparse Matrix Collection | sparse-la | Auto-SpMV | — |
| TorchInductor / Dynamo suite | dl-system | KernelBench-style agents (baseline) | ◐ |
| Triton tutorial kernels | dl-micro | TritonBench, GEAK, KernelLLM, AutoTriton | ✅ |
| XSBench | hpc-mini-app | CUDAnalyst, LLMPerf-Opt, LASSI, OMPar, ParEval-Repo | ✅ |
| cuDNN / cuBLAS | dl-micro | AI CUDA Engineer, KernelBench, TritonBench | ✅ |
| miniFE | hpc-mini-app | LLMPerf-Opt | ✅ |
| vLLM benchmarks | serving-inference | FlashInfer-Bench | — |
The four most-used anchors — PolyBench · NPB · XSBench · HeCBench — are where a new HPC-kernel agent should report at least once.
Motifs are the ~17 fundamental computation patterns (dense linear algebra, stencils, graph traversal, attention, …) from Berkeley's classic taxonomy. Rows = motifs, columns = task-source groups, cells = how many suites exercise that combination. Empty rows are coverage gaps — motifs almost no benchmark tests (and a caution against "our agent generalizes" claims). This matrix is the task-selection tool we use when designing balanced benchmark suites.
| Motif \ Family | DLOP | TC | SERV | DLSYS | SPARSE | STEN | GRAPH | POLY | HPC | GPU | NUM | COMP | ACCEL | Σ |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| dense-LA | 6 | 9 | 2 | 11 | 1 | 7 | 4 | 13 | 4 | 6 | 4 | 67 | ||
| sparse-LA | 12 | 2 | 2 | 4 | 1 | 1 | 22 | |||||||
| spectral | 2 | 1 | 3 | 1 | 1 | 8 | ||||||||
| n-body | 3 | 2 | 1 | 6 | ||||||||||
| structured-grid | 8 | 7 | 5 | 8 | 2 | 4 | 34 | |||||||
| unstructured-grid | 1 | 3 | 2 | 6 | ||||||||||
| monte-carlo | 4 | 1 | 5 | |||||||||||
| graph-traversal | 1 | 8 | 8 | 1 | 18 | |||||||||
| dynamic-programming | 2 | 1 | 3 | |||||||||||
| combinational-logic | 1 | 1 | 2 | |||||||||||
| branch-and-bound | 1 | 1 | ||||||||||||
| graphical-models | 1 | 1 | ||||||||||||
| finite-state-machine | 1 | 1 | ||||||||||||
| mapreduce | 2 | 1 | 3 | |||||||||||
| attention | 5 | 1 | 6 | 8 | 1 | 2 | 23 | |||||||
| reduction-scan | 4 | 1 | 3 | 8 | ||||||||||
| elementwise | 6 | 2 | 2 | 2 | 12 | |||||||||
| data-movement | 1 | 1 | 4 | 6 | ||||||||||
| mixed | 3 | 4 | 10 | 2 | 3 | 12 | 4 | 6 | 4 | 48 |
DLOP = DL operators & vendor baselines
TC = Tensor compilers & autotuners
SERV = LLM serving benchmarks
DLSYS = Whole-model DL benchmarks
SPARSE = Sparse linear algebra
STEN = Stencils, FFT & PDE
GRAPH = Graph analytics (irregular)
POLY = Dense loop nests (PolyBench lineage)
HPC = HPC proxy & mini-apps
GPU = Classic GPU suites (pre-DL canon)
NUM = Peak & roofline probes
COMP = Compiler & HLS test suites
ACCEL = NPU & emerging accelerators
- Data layer:
data/benchmarks.yaml(one entry per benchmark, multi-tag facets: layer, abstraction, motifs, bottleneck, hardware, verify, ships, status — full vocabularies inSCHEMA.md) +data/scorecard.yaml(methodology grades with evidence links).scripts/generate.pyvalidates and renders everything;--checkruns in CI. - Inclusion scope: anything a kernel agent could plausibly be evaluated on, fine-tuned against, or asked to optimize/translate.
- Provenance & dedup: suites repackaging others (SPEC ACCEL ⊂ Parboil/Rodinia/NPB; KernelBench derivatives) are marked, not double-counted. A
statusfield tracks liveness — catalogs rot. - Design lineage: Berkeley's 13 computational motifs, the Herten et al. HPC benchmark survey YAML-to-tables pattern, and the methodology-scorecard layer introduced here.
| Term | Plain meaning |
|---|---|
| kernel agent | An LLM-driven system that writes/optimizes GPU kernels, usually in a generate→profile→refine loop |
| substrate / task source | A classic benchmark suite used as raw material for agents — not itself designed to score them |
| correctness oracle | The machinery deciding "is this kernel's output right?" — its strictness decides whether cheating kernels score |
| motif | A fundamental computation pattern (dense LA, stencil, graph traversal, attention…); the coverage vocabulary |
| polyhedral (suite) | Kernels that are regular nested loops over arrays (GEMM-like) — easy to analyze, verify, and transform |
| proxy/mini-app | A few-thousand-line stand-in for a real scientific application (e.g. XSBench ≈ a reactor code's hot loop) |
| verify ✅ / ships ✅ | Suite has a built-in correctness check / ships re-runnable reference kernels (anti-gaming, auditability) |
| roofline / SOL | The hardware ceiling (compute or bandwidth) a kernel could at best reach; "speed-of-light" in NVIDIA jargon |
| pass@k / budget | Score given k attempts; without a declared budget, best-of-many and single-shot results are incomparable |
- awesome-kernel-agent — the agents, DSLs, datasets, and tooling.
- Herten et al., An HPC Benchmark Survey and Taxonomy (IJHPCA 2026).
- Awesome-LLM4Kernel · awesome-LLM-driven-kernel-generation.
Edit data/benchmarks.yaml (new entries) or data/scorecard.yaml (methodology grades — the most valuable PR: read a harness's source, grade it against the criteria, cite the evidence). Then python3 scripts/generate.py and commit the regenerated files. See CONTRIBUTING.md.
If this catalog or its methodology scorecard is useful in your research, please cite:
@misc{you2026awesomekernelbenchmark,
author = {Lianzhong You},
title = {Awesome Kernel Benchmark: An Evidence-Graded Catalog of
Benchmarks for LLM Kernel Agents},
year = {2026},
publisher = {GitHub},
url = {https://github.com/youyve/awesome-kernel-benchmark}
}Catalog content: CC-BY-4.0 (attribution required) · code in scripts/: MIT · each listed benchmark retains its own license (per-entry license field).