Skip to content

youyve/awesome-kernel-benchmark

Repository files navigation

Awesome Kernel Benchmark Awesome Benchmarks Evidence--graded PRs Welcome Last Commit

Which benchmark should score your kernel agent — and how does each one actually measure? A curated, evidence-graded catalog of every benchmark relevant to LLM/agent GPU-kernel generation. 中文版: README.zh.md

30-second orientation. This list contains two different kinds of things, and keeping them apart is the whole game:

  1. Agent benchmarks — harnesses designed to score LLM kernel generation (KernelBench and its 27 successors). These produce the headline numbers in papers.
  2. Task sources (a.k.a. substrates) — the classic HPC/GPU suites that agents are evaluated on or asked to optimize (PolyBench, NPB, Rodinia, …). An agent paper saying "we optimize XSBench kernels" is using a task source, not a benchmark — its numbers are only comparable to another paper that shares both the task source and the measurement method.

Sibling list: awesome-kernel-agent (the agents themselves). Single source of truth: data/benchmarks.yaml + data/scorecard.yaml; every table below is generated — do not hand-edit.

151 benchmarks · 28 purpose-built agent benchmarks · 123 substrate / dataset / tooling entries across 13 families.

Layer Count
agent-benchmark 28
substrate-suite 110
dataset 3
tooling 10
Top hardware targets Entries
NVIDIA 122
CPU 67
AMD 45
Intel-GPU 13
FPGA 7
Ascend-NPU 6
TPU 3
Cambricon 2

Contents


① What should I evaluate my agent on?

Pick by what you need to prove:

  • "My numbers are comparable to prior work"KernelBench — the de-facto standard everyone reports. Know its limits (see scorecard) and never use it as sole evidence.
  • "My speedups are not reward-hacking artifacts"robust-kbench — anti-gaming filters, forward+backward, 1e-5 tolerances.
  • "My kernels are correct on real production shapes"BackendBench — PyTorch OpInfo edge cases + TorchBench production traces; ships kernels as a pip backend.
  • "My kernels are fast in an absolute sense, not just vs eager"SOL-ExecBench / CUDABench — ceiling-relative scores (caveat: both use datasheet peaks, which are 8–15% loose vs measured ceilings).
  • "My agent's wins survive a real serving stack"ISO-Bench (vLLM/SGLang merged-PR tasks) / FlashInfer-Bench.
  • "My results are honest about compute budget"GEAK — the only one decomposing sequential@k vs parallel@k.
  • Triton specificallyTritonBench; ROCm/AMD → GEAK, AgentKernelArena, NPUEval; Ascend NPUMultiKernelBench, AscendKernelBench, NPUKernelBench; CUDA correctness with hidden testsComputeEval.

The honest answer for a serious paper is a portfolio: one comparability anchor (KernelBench) + one hardened oracle (robust-kbench or BackendBench) + one absolute metric (SOL-ExecBench-style) + explicit budget reporting. No single existing benchmark covers all four — that gap is visible in the scorecard below.

② How does each benchmark actually measure? — the methodology scorecard

Two benchmarks can give the same agent wildly different numbers, because the score is a product of five design choices: the task set, the correctness oracle, the timing methodology, the baseline, and the budget accounting. This scorecard grades each benchmark on the choices that separate trustworthy numbers from inflated ones — graded only from primary evidence (harness source code / paper), never from README claims.

Benchmark Year Hardware Oracle Timing Baseline Budget Pick this if…
BackendBench 2025 NVIDIA PyTorch op semantics (correctness-first; kernels run as a pip backend) You want production-shape correctness and a path to actually shipping kernels as a backend.
ComputeEval 2025 NVIDIA n/a (correctness-first; optional speedup vs reference) You want CUDA correctness with anti-overfitting (hidden tests) from NVIDIA's own harness.
GEAK Benchmarks 2025 AMD expert reference kernels (TritonBench-revised + ROCm set) You target ROCm/Triton and want budget-decomposed (sequential vs parallel) reporting.
KernelBench 2025 NVIDIA PyTorch eager (torch.compile default-mode as secondary) You need comparability with the de-facto standard everyone reports — never as sole evidence.
KernelBench-v3 2025 NVIDIA PyTorch eager (reference.py) You want a community-hardened KernelBench variant reported across several GPUs.
MultiKernelBench 2025 NVIDIA · Ascend-NPU · TPU each platform's own PyTorch backend You need multi-platform coverage (CUDA / Ascend C / TPU Pallas) and accept per-platform-only numbers.
TritonBench 2025 NVIDIA · AMD PyTorch/reference impl; descriptive GPU-efficiency vs A100 theoretical peak You generate Triton specifically and want real-world (GitHub-sourced) operators.
robust-kbench 2025 NVIDIA PyTorch eager (KernelBench-style) You want to prove your speedups survive anti-gaming filters (the KernelBench exploit fixes).
CUDABench 2026 NVIDIA ? attainable-GFLOPs roofline ceiling (datasheet peak) ? You want a roofline-relative Performance-Score with profiler-measured intensity (datasheet-ceiling caveat).
ISO-Bench 2026 NVIDIA unoptimized baseline + human-merged-PR solution (dual reference) You want real serving-stack tasks (vLLM/SGLang merged PRs) with did-it-fix-the-bottleneck attribution.
SOL-ExecBench 2026 NVIDIA agent-optimized PyTorch baseline + analytical speed-of-light ceiling (datasheet peak) You want an absolute (ceiling-relative) score on production LLM kernels — datasheet-ceiling caveat applies.

Grades (criteria in data/scorecard.yaml / SCHEMA.md): ● strong · ◐ partial · ○ weak · — none · ? unverified. Grades are assigned ONLY from primary evidence (harness source / paper) — no benchmark is graded from its README claims.

Not yet graded (17): C2HLSC, ParEval, AscendKernelBench (AKG-AGENT), CANN Bench, FlashInfer-Bench, HLS-Eval, NKIBench, NPUEval, QiMeng-TensorOp, QiMeng-Xpiler, TritonGym, AgentKernelArena, KernelBench-MUSA (MooreEval), KernelBenchX, KernelCraft, MSKernelBench, NPUKernelBench. PRs grading these against the criteria are the most valuable contribution this list can receive.

Full inventory of all purpose-built agent benchmarks (facets: abstraction, motifs, hardware, verification)
Benchmark Year Abstraction Motifs Hardware Verify Runs on (substrate)
C2HLSC 2024 kernel mixed FPGA
ParEval 2024 kernel dense-LA · sparse-LA · structured-grid · graph-traversal · reduction-scan · n-body · mixed NVIDIA · AMD · CPU
AscendKernelBench (AKG-AGENT) 2025 operator dense-LA · elementwise · reduction-scan · mixed Ascend-NPU · NVIDIA · CPU
BackendBench 2025 operator dense-LA · elementwise · reduction-scan · mixed NVIDIA
CANN Bench 2025 operator mixed Ascend-NPU
ComputeEval 2025 kernel dense-LA · reduction-scan · elementwise · mixed NVIDIA
FlashInfer-Bench 2025 operator attention · dense-LA · mixed NVIDIA
GEAK Benchmarks 2025 kernel dense-LA · attention · reduction-scan · mixed AMD
HLS-Eval 2025 kernel mixed FPGA
KernelBench 2025 operator dense-LA · structured-grid · reduction-scan · elementwise · attention · mixed NVIDIA
KernelBench-v3 2025 operator dense-LA · attention · mixed NVIDIA
MultiKernelBench 2025 operator dense-LA · attention · reduction-scan · elementwise · mixed NVIDIA · Ascend-NPU · TPU
NKIBench 2025 kernel dense-LA · attention · mixed Trainium
NPUEval 2025 kernel dense-LA · elementwise · reduction-scan · mixed AMD
QiMeng-TensorOp 2025 operator dense-LA RISC-V · ARM · NVIDIA
QiMeng-Xpiler 2025 operator dense-LA · mixed Cambricon · NVIDIA · AMD · CPU
TritonBench 2025 kernel dense-LA · attention · reduction-scan · elementwise · mixed NVIDIA · AMD
TritonGym 2025 kernel mixed NVIDIA
robust-kbench 2025 operator dense-LA · attention · reduction-scan · mixed NVIDIA
AgentKernelArena 2026 kernel mixed AMD
CUDABench 2026 operator dense-LA · attention · reduction-scan · elementwise · mixed NVIDIA
ISO-Bench 2026 operator mixed NVIDIA
KernelBench-MUSA (MooreEval) 2026 operator dense-LA · attention · mixed MooreThreads · NVIDIA
KernelBenchX 2026 operator dense-LA · attention · reduction-scan · elementwise · mixed NVIDIA
KernelCraft 2026 kernel mixed NVIDIA · AMD
MSKernelBench 2026 operator dense-LA · sparse-LA · attention · mixed NVIDIA
NPUKernelBench 2026 kernel mixed Ascend-NPU
SOL-ExecBench 2026 operator dense-LA · attention · mixed NVIDIA

③ Where do I get kernels/tasks to optimize?

The classic suites below are task sources: curated kernels with reference implementations that agents optimize, translate, or get fine-tuned on. Groups are ordered by how often kernel-agent work reaches for them; the Verify column (✅ built-in correctness oracle) is what makes a suite resistant to wrong-but-fast kernels, and Ships (✅ re-runnable reference kernels) is what makes results auditable.

DL operators & vendor baselines

The kernels your agent must beat: cuDNN/cuBLAS, FlashAttention, Triton tutorials, Liger. Use as reference implementations and speedup denominators.

Suite Year · Org Motifs Languages / model Verify Ships Used by (agents)
cuDNN / cuBLAS 2014 · NVIDIA dense-LA · attention · elementwise · mixed CUDA AI CUDA Engineer, KernelBench, TritonBench
DeepBench 2016 · Baidu Research dense-LA · elementwise CUDA · HIP · ARM
DNNMark 2017 · Northeastern (NUCAR) dense-LA · elementwise CUDA · HIP
PyTorch operator_benchmark 2019 · Meta / PyTorch dense-LA · elementwise · reduction-scan · mixed Python · CUDA
NVBench 2021 · NVIDIA mixed CUDA/C++
Triton tutorial kernels 2021 · OpenAI / triton-lang dense-LA · attention · reduction-scan · elementwise Triton TritonBench, GEAK, KernelLLM, AutoTriton
xFormers benchmarks 2021 · Meta AI (FAIR) attention · dense-LA CUDA · CUTLASS · Triton
FlashAttention benchmarks 2022 · Dao-AILab attention · reduction-scan CUDA · CuTeDSL
Liger-Kernel benchmarks 2024 · LinkedIn reduction-scan · elementwise · attention Triton TritonBench

Tensor compilers & autotuners

Machine-generated baselines (TVM/Ansor, Hidet, CUTLASS profiler). Use when you want your agent compared against autotuned — not just eager — code.

Suite Year · Org Motifs Languages / model Verify Ships Used by (agents)
CUTLASS Profiler 2017 · NVIDIA dense-LA CUDA C++ · CuTe · Python DSL KernelAgent, MANTIS, CUDABench
TVM / Ansor 2018 · Apache TVM / OctoML dense-LA · mixed TVM TE/TIR
IREE benchmarks 2020 · Google / OpenXLA dense-LA · mixed MLIR (Linalg)
MLIR microkernels 2020 · LLVM / MLIR community dense-LA · elementwise MLIR
AKG (Auto Kernel Generator) 2021 · Huawei / MindSpore dense-LA · elementwise · reduction-scan · mixed polyhedral DSL · Triton-Ascend AKG kernel Agent
TenSet 2021 · UC Berkeley / OctoML (Zheng) dense-LA TVM schedules
Roller 2022 · Microsoft Research Asia dense-LA TVM TE · CUDA
Hidet 2023 · U. Toronto / CentML dense-LA · attention Python DSL · CUDA
Welder 2023 · MSR Asia / Peking U. dense-LA · mixed TVM · CUTLASS · CUDA

LLM serving benchmarks

Where kernel wins become end-to-end wins: TTFT/TPOT/throughput harnesses (vLLM, SGLang). Use to show a kernel matters at the serving level.

Suite Year · Org Motifs Languages / model Verify Ships Used by (agents)
LLMPerf 2023 · Ray project (Anyscale) attention Python
vLLM benchmarks 2023 · vLLM project (UC Berkeley → community) attention · dense-LA Python FlashInfer-Bench
FlexAttention / attention-gym 2024 · Meta / PyTorch attention Python · Triton
GenAI-Perf 2024 · NVIDIA attention Python
SGLang benchmarks 2024 · SGLang project attention · dense-LA Python Astra
AIPerf 2025 · NVIDIA attention Python

Whole-model DL benchmarks

Model-level suites (MLPerf, TorchBench, TorchInductor). Use to bound end-to-end impact of kernel-level changes.

Suite Year · Org Motifs Languages / model Verify Ships Used by (agents)
Fathom 2016 · Harvard (VLSI-Arch) dense-LA · attention · mixed Python · TensorFlow
DAWNBench 2017 · Stanford DAWN dense-LA · mixed Python
AI-Benchmark 2018 · ETH Zurich dense-LA · mixed Python · TensorFlow
AIBench 2018 · ICT CAS / BenchCouncil dense-LA · attention · mixed Python · C++
MLPerf Training 2018 · MLCommons dense-LA · attention · mixed Python · PyTorch · TensorFlow
MLPerf Inference 2019 · MLCommons dense-LA · attention · mixed Python · C++
Megatron-LM benchmarks 2019 · NVIDIA dense-LA · attention · mixed Python · CUDA
MLPerf Tiny 2021 · MLCommons dense-LA · elementwise · mixed C · C++
TorchBench 2021 · Meta / PyTorch dense-LA · attention · mixed Python · CUDA
NVIDIA Transformer Engine benchmarks 2022 · NVIDIA dense-LA · attention Python · C++ · CUDA
TorchInductor / Dynamo suite 2022 · Meta / PyTorch dense-LA · attention · elementwise · mixed Python · Triton KernelBench-style agents (baseline)

Sparse linear algebra

SpMV/SpMM/SDDMM kernels and the matrix collections (SuiteSparse, DLMC) they run on. Irregular memory patterns — a hard, under-benchmarked motif.

Suite Year · Org Motifs Languages / model Verify Ships Used by (agents)
SuiteSparse Matrix Collection 2011 · Texas A&M (Davis); orig. U. Florida sparse-LA Matrix Market Auto-SpMV
cuSPARSE 2014 · NVIDIA sparse-LA CUDA
TACO benchmarks 2017 · MIT CSAIL (Kjolstad) sparse-LA C++
ASpT 2019 · Academic (Hong et al.) sparse-LA CUDA · OpenMP
DLMC (Deep Learning Matrix Collection) 2020 · Google Research / Stanford sparse-LA Matrix Market Sputnik, Magicube, SparseTIR
GE-SpMM / dgSPARSE 2020 · Tsinghua (Huang et al.) sparse-LA · graph-traversal CUDA
Sputnik 2020 · Google Research / Stanford sparse-LA CUDA/C++
cuSPARSELt 2020 · NVIDIA sparse-LA CUDA
vectorSparse 2021 · Academic (Chen et al.) sparse-LA CUDA
Magicube 2022 · ETH Zurich (Li et al.) sparse-LA CUDA
SparseTIR 2023 · U. Washington (SAMPL) sparse-LA TVM TensorIR · CUDA
VENOM / Spatha 2023 · Universidad de Malaga (Castro et al.) sparse-LA CUDA

Stencils, FFT & PDE

Structured-grid stencils, FFT/spectral kernels, and their DSLs (Halide, Devito). Classic memory-bound optimization targets with clean verification.

Suite Year · Org Motifs Languages / model Verify Ships Used by (agents)
benchFFT (FFTW) 2003 · MIT (Frigo, Johnson) spectral C
Stencil Probe 2007 · UC Berkeley (Kamil et al.) structured-grid C
Pochoir 2011 · MIT (Tang et al.) structured-grid C++
Halide benchmarks 2012 · MIT CSAIL structured-grid · dense-LA Halide DSL · CUDA
ExaStencils / ExaSlang 2014 · U. Erlangen / Passau structured-grid ExaSlang · CUDA · OpenMP · MPI
PolyMage 2015 · IISc Bangalore (Mullapudi et al.) structured-grid Python DSL · C
Devito benchmarks 2016 · Imperial College London structured-grid Python DSL · C
StencilGen (Artemis) 2018 · Ohio State (Rawat et al.) structured-grid CUDA
AN5D 2020 · U. Tokyo / RIKEN (Matsumura et al.) structured-grid CUDA
heFFTe 2020 · ICL U. Tennessee spectral C++ · CUDA · HIP

Graph analytics (irregular)

BFS/PageRank/connected-components with strong optimized baselines (Gunrock, GAPBS). Use to test agents beyond dense regular loops.

Suite Year · Org Motifs Languages / model Verify Ships Used by (agents)
LonestarGPU / Lonestar 2012 · UT Austin (ISS) graph-traversal · unstructured-grid CUDA · C++
Pannotia 2013 · AMD Research + U. Virginia graph-traversal OpenCL · HIP · CUDA
GAP Benchmark Suite 2015 · UC Berkeley (Beamer) graph-traversal · sparse-LA C++ · OpenMP
GraphBIG 2015 · Georgia Tech + IBM graph-traversal C++ · CUDA
Gunrock 2016 · UC Davis (Owens) graph-traversal CUDA/C++
GARDENIA 2017 · NUDT (Chen) graph-traversal · sparse-LA CUDA · OpenCL · OpenMP
GBBS 2020 · MIT / CMU (Dhulipala et al.) graph-traversal C++
Indigo3 2024 · Texas State (Burtscher) graph-traversal C · OpenMP · CUDA · HIP

Dense loop nests (PolyBench lineage)

Regular affine loop kernels (GEMM-like, stencils) with trivially checkable outputs — the easiest substrate to verify, and the most-used in agent papers.

Suite Year · Org Motifs Languages / model Verify Ships Used by (agents)
PolyBench/C 2010 · OSU / CSU (Pouchet, Yuki) dense-LA · structured-grid · mixed C Performance-Aligned LLMs, ComPilot, ACCeLLiuM
PolyBench/Fortran 2012 · OSU (Pouchet, Narayan) dense-LA · structured-grid Fortran
PolyBench/GPU 2012 · U. Delaware (Grauer-Gray, Cavazos) dense-LA · structured-grid CUDA · OpenCL · HMPP · OpenACC MIREncoder
PolyBench-ACC 2013 · Cavazos Lab, U. Delaware dense-LA · structured-grid · mixed CUDA · OpenCL · OpenACC · OpenMP · HMPP CUDAnalyst, MEP
PolyBench-RAJA 2016 · U. Delaware (Killian) dense-LA · structured-grid C++ (RAJA)
PolyBench-NN 2019 · IIT Hyderabad (IITH-Compilers) dense-LA · structured-grid C
PolyBench/Python 2021 · UDC-GAC (U. da Coruna) dense-LA · structured-grid Python · NumPy

HPC proxy & mini-apps

DOE/NASA mini-apps (NPB, XSBench, LULESH): small, science-representative, almost all ship a built-in verification figure-of-merit.

Suite Year · Org Motifs Languages / model Verify Ships Used by (agents)
NPB (NAS Parallel Benchmarks) 1991 · NASA Advanced Supercomputing dense-LA · sparse-LA · spectral · structured-grid · mixed Fortran · C CUDAnalyst, AutoParLLM
miniMD 2008 · Sandia (Mantevo) n-body C++ · Kokkos · OpenMP
LULESH 2010 · LLNL unstructured-grid · n-body C++ · CUDA · RAJA · OpenMP
miniFE 2011 · Sandia (Mantevo) sparse-LA · unstructured-grid C++ · CUDA · Kokkos · OpenMP LLMPerf-Opt
CoMD 2012 · ExMatEx / ECP-CoPA n-body C · CUDA · OpenCL · OpenMP
PENNANT 2012 · LANL unstructured-grid C++ · CUDA · MPI
Nekbone 2013 · Argonne (Nek5000 team) dense-LA · structured-grid Fortran · CUDA-Fortran · OpenACC
HPGMG 2014 · LBNL structured-grid C · CUDA · OpenMP
RSBench 2014 · Argonne (ANL-CESAR) monte-carlo · mixed C · CUDA · OpenCL · SYCL · OpenMP-target LLMPerf-Opt
XSBench 2014 · Argonne (ANL-CESAR) monte-carlo · mixed C · CUDA · OpenCL · SYCL · OpenMP-target CUDAnalyst, LLMPerf-Opt, LASSI, OMPar, ParEval-Repo
miniAMR 2014 · Sandia (Mantevo) structured-grid C · MPI · OpenMP
BabelStream 2015 · U. Bristol (UoB-HPC) structured-grid · data-movement CUDA · HIP · SYCL · OpenCL · OpenMP · Kokkos · RAJA · TBB · Thrust
SimpleMOC 2015 · Argonne (CESAR) + MIT monte-carlo · dense-LA C · CUDA · OpenCL · OpenMP-target ParEval-Repo
Quicksilver 2016 · LLNL monte-carlo C++ · CUDA · OpenMP
SU3_Bench 2019 · LBNL / NERSC dense-LA C · CUDA · HIP · SYCL · OpenMP-target LASSI, OMPar

Classic GPU suites (pre-DL canon)

Rodinia, SHOC, Parboil, HeCBench: the broad-coverage GPU canon. HeCBench alone gives one kernel in 4+ programming models.

Suite Year · Org Motifs Languages / model Verify Ships Used by (agents)
CUDA Samples 2008 · NVIDIA dense-LA · spectral · reduction-scan · mixed CUDA
Mars 2008 · HKUST / Microsoft mapreduce CUDA
Parboil 2008 · UIUC (IMPACT) dense-LA · sparse-LA · structured-grid · graph-traversal · n-body · mapreduce C · CUDA · OpenCL · OpenMP MIREncoder
GPGPU-Sim ISPASS-2009 2009 · UBC (Aamodt) dense-LA · graph-traversal · mixed CUDA
Rodinia 2009 · U. Virginia (Skadron / LAVA) dense-LA · structured-grid · unstructured-grid · graph-traversal · dynamic-programming · mixed C · CUDA · OpenCL · OpenMP AutoParLLM, MIREncoder
SHOC 2010 · ORNL + U. Tennessee dense-LA · sparse-LA · spectral · structured-grid · graph-traversal · reduction-scan · mixed CUDA · OpenCL · MPI MIREncoder
OpenDwarfs 2012 · Virginia Tech (Synergy) dense-LA · sparse-LA · spectral · n-body · structured-grid · unstructured-grid · graph-traversal · dynamic-programming · combinational-logic · branch-and-bound · graphical-models · finite-state-machine OpenCL
Hetero-Mark 2016 · Northeastern (NUCAR) dense-LA · graph-traversal · mixed CUDA · HIP · HC · OpenCL
Chai 2017 · UIUC (IMPACT) + U. Cordoba dense-LA · graph-traversal · mixed CUDA · OpenCL
Tartan 2018 · PNNL + ORNL data-movement · mixed CUDA · MPI
Mirovia 2019 · UT Austin / VMware dense-LA · structured-grid · mixed CUDA
Tango 2019 · MoCA Lab (SJSU / UC Merced) dense-LA · attention · mixed CUDA · OpenCL
Altis 2020 · UT Austin (SCEA) / VMware dense-LA · structured-grid · mixed CUDA
SYCL-Bench 2020 · U. Salerno (unisa-hpc) dense-LA · structured-grid · reduction-scan · mixed SYCL
HeCBench 2021 · ORNL (Jin, Vetter) dense-LA · sparse-LA · structured-grid · monte-carlo · graph-traversal · mixed CUDA · HIP · SYCL · OpenMP-target LASSI, OMPar, LLMPerf-Opt, Bolet et al.

Peak & roofline probes

STREAM, ERT, mixbench, gpu-burn: measure what the hardware can actually do. The calibration layer any ceiling-relative metric depends on.

Suite Year · Org Motifs Languages / model Verify Ships Used by (agents)
STREAM 1995 · John McCalpin (U. Virginia / TACC) structured-grid · data-movement C · Fortran
HPL (High Performance Linpack) 2000 · Netlib / ICL U. Tennessee dense-LA C · MPI · BLAS
HPCC (HPC Challenge) 2003 · ICL U. Tennessee dense-LA · spectral · mapreduce · data-movement · mixed C · MPI · BLAS
HPCG 2013 · Sandia / ICL (Heroux, Dongarra) sparse-LA · structured-grid C++ · MPI · OpenMP
gpu-burn 2013 · Ville Timonen dense-LA C++/CUDA
clpeak 2014 · Krishnaraj Bhat mixed C++ · OpenCL
Empirical Roofline Toolkit (ERT) 2015 · LBNL CRD / Berkeley mixed C · MPI · OpenMP · CUDA
mixbench 2015 · U. Athens (Konstantinidis) mixed CUDA · OpenCL · HIP · SYCL · OpenMP
gpumembench 2016 · U. Athens (Konstantinidis) data-movement CUDA · OpenCL
HPL-MxP (HPL-AI) 2019 · ICL U. Tennessee dense-LA C/C++ · CUDA · HIP · MPI
nvbandwidth 2022 · NVIDIA data-movement C++/CUDA

Compiler & HLS test suites

LLVM test-suite, SPEC ACCEL/OMP, MachSuite: codegen substrates with strict correctness baked in.

Suite Year · Org Motifs Languages / model Verify Ships Used by (agents)
LLVM test-suite 2004 · LLVM community dense-LA · structured-grid · mixed C · C++ · CUDA
CHStone 2008 · Ritsumeikan U. combinational-logic · dense-LA · mixed C
SPEC OMP2012 2012 · SPEC/HPG dense-LA · structured-grid · n-body · mixed C · C++ · Fortran · OpenMP
MachSuite 2014 · Harvard (Reagen et al.) dense-LA · sparse-LA · spectral · graph-traversal · dynamic-programming · mixed C
SPEC ACCEL 2014 · SPEC/HPG dense-LA · structured-grid · mixed OpenCL · OpenACC · OpenMP-target
Rosetta (HLS) 2018 · Cornell (Zhang et al.) dense-LA · structured-grid · mixed C++ HLS · OpenCL

NPU & emerging accelerators

Ascend, Cambricon, Tenstorrent, IPU substrates — where vendor-kernel scarcity makes agents most valuable and baselines weakest.

Suite Year · Org Motifs Languages / model Verify Ships Used by (agents)
Graphcore IPU benchmarks 2020 · Graphcore + U. Bristol dense-LA · attention · mixed Poplar C++ · PopLibs
Cambricon mlu-ops 2022 · Cambricon dense-LA · elementwise · mixed BANG C
Tenstorrent tt-metal 2023 · Tenstorrent dense-LA · attention · mixed C++ (TT-Metalium) · Python (TT-NN)
Ascend cann-ops 2024 · Huawei dense-LA · elementwise · mixed Ascend C AscendOptimizer

④ Who can I compare against?

Which kernel-agent works already evaluate on each classic task source. Two agents are directly comparable only if they share a task source and a measurement method. Agents that ship re-runnable kernels (✅) are auditable; the rest are trust-me numbers.

Substrate Family Used by (kernel agents) Ships kernels
AKG (Auto Kernel Generator) tensor-compiler AKG kernel Agent
Ascend cann-ops emerging-accelerator AscendOptimizer
CUTLASS Profiler tensor-compiler KernelAgent, MANTIS, CUDABench
DLMC (Deep Learning Matrix Collection) sparse-la Sputnik, Magicube, SparseTIR
HeCBench classic-gpu-suite LASSI, OMPar, LLMPerf-Opt, Bolet et al.
Liger-Kernel benchmarks dl-micro TritonBench
NPB (NAS Parallel Benchmarks) hpc-mini-app CUDAnalyst, AutoParLLM
Parboil classic-gpu-suite MIREncoder
PolyBench-ACC polyhedral CUDAnalyst, MEP
PolyBench/C polyhedral Performance-Aligned LLMs, ComPilot, ACCeLLiuM
PolyBench/GPU polyhedral MIREncoder
RSBench hpc-mini-app LLMPerf-Opt
Rodinia classic-gpu-suite AutoParLLM, MIREncoder
SGLang benchmarks serving-inference Astra
SHOC classic-gpu-suite MIREncoder
SU3_Bench hpc-mini-app LASSI, OMPar
SimpleMOC hpc-mini-app ParEval-Repo
SuiteSparse Matrix Collection sparse-la Auto-SpMV
TorchInductor / Dynamo suite dl-system KernelBench-style agents (baseline)
Triton tutorial kernels dl-micro TritonBench, GEAK, KernelLLM, AutoTriton
XSBench hpc-mini-app CUDAnalyst, LLMPerf-Opt, LASSI, OMPar, ParEval-Repo
cuDNN / cuBLAS dl-micro AI CUDA Engineer, KernelBench, TritonBench
miniFE hpc-mini-app LLMPerf-Opt
vLLM benchmarks serving-inference FlashInfer-Bench

The four most-used anchors — PolyBench · NPB · XSBench · HeCBench — are where a new HPC-kernel agent should report at least once.


Appendix A — motif coverage matrix

Motifs are the ~17 fundamental computation patterns (dense linear algebra, stencils, graph traversal, attention, …) from Berkeley's classic taxonomy. Rows = motifs, columns = task-source groups, cells = how many suites exercise that combination. Empty rows are coverage gaps — motifs almost no benchmark tests (and a caution against "our agent generalizes" claims). This matrix is the task-selection tool we use when designing balanced benchmark suites.

Motif \ Family DLOP TC SERV DLSYS SPARSE STEN GRAPH POLY HPC GPU NUM COMP ACCEL Σ
dense-LA 6 9 2 11 1 7 4 13 4 6 4 67
sparse-LA 12 2 2 4 1 1 22
spectral 2 1 3 1 1 8
n-body 3 2 1 6
structured-grid 8 7 5 8 2 4 34
unstructured-grid 1 3 2 6
monte-carlo 4 1 5
graph-traversal 1 8 8 1 18
dynamic-programming 2 1 3
combinational-logic 1 1 2
branch-and-bound 1 1
graphical-models 1 1
finite-state-machine 1 1
mapreduce 2 1 3
attention 5 1 6 8 1 2 23
reduction-scan 4 1 3 8
elementwise 6 2 2 2 12
data-movement 1 1 4 6
mixed 3 4 10 2 3 12 4 6 4 48

DLOP = DL operators & vendor baselines
TC = Tensor compilers & autotuners
SERV = LLM serving benchmarks
DLSYS = Whole-model DL benchmarks
SPARSE = Sparse linear algebra
STEN = Stencils, FFT & PDE
GRAPH = Graph analytics (irregular)
POLY = Dense loop nests (PolyBench lineage)
HPC = HPC proxy & mini-apps
GPU = Classic GPU suites (pre-DL canon)
NUM = Peak & roofline probes
COMP = Compiler & HLS test suites
ACCEL = NPU & emerging accelerators

Appendix B — how this list is built

  • Data layer: data/benchmarks.yaml (one entry per benchmark, multi-tag facets: layer, abstraction, motifs, bottleneck, hardware, verify, ships, status — full vocabularies in SCHEMA.md) + data/scorecard.yaml (methodology grades with evidence links). scripts/generate.py validates and renders everything; --check runs in CI.
  • Inclusion scope: anything a kernel agent could plausibly be evaluated on, fine-tuned against, or asked to optimize/translate.
  • Provenance & dedup: suites repackaging others (SPEC ACCEL ⊂ Parboil/Rodinia/NPB; KernelBench derivatives) are marked, not double-counted. A status field tracks liveness — catalogs rot.
  • Design lineage: Berkeley's 13 computational motifs, the Herten et al. HPC benchmark survey YAML-to-tables pattern, and the methodology-scorecard layer introduced here.

Glossary

Term Plain meaning
kernel agent An LLM-driven system that writes/optimizes GPU kernels, usually in a generate→profile→refine loop
substrate / task source A classic benchmark suite used as raw material for agents — not itself designed to score them
correctness oracle The machinery deciding "is this kernel's output right?" — its strictness decides whether cheating kernels score
motif A fundamental computation pattern (dense LA, stencil, graph traversal, attention…); the coverage vocabulary
polyhedral (suite) Kernels that are regular nested loops over arrays (GEMM-like) — easy to analyze, verify, and transform
proxy/mini-app A few-thousand-line stand-in for a real scientific application (e.g. XSBench ≈ a reactor code's hot loop)
verify ✅ / ships ✅ Suite has a built-in correctness check / ships re-runnable reference kernels (anti-gaming, auditability)
roofline / SOL The hardware ceiling (compute or bandwidth) a kernel could at best reach; "speed-of-light" in NVIDIA jargon
pass@k / budget Score given k attempts; without a declared budget, best-of-many and single-shot results are incomparable

Related lists

Contributing

Edit data/benchmarks.yaml (new entries) or data/scorecard.yaml (methodology grades — the most valuable PR: read a harness's source, grade it against the criteria, cite the evidence). Then python3 scripts/generate.py and commit the regenerated files. See CONTRIBUTING.md.

Citation

If this catalog or its methodology scorecard is useful in your research, please cite:

@misc{you2026awesomekernelbenchmark,
  author       = {Lianzhong You},
  title        = {Awesome Kernel Benchmark: An Evidence-Graded Catalog of
                  Benchmarks for LLM Kernel Agents},
  year         = {2026},
  publisher    = {GitHub},
  url          = {https://github.com/youyve/awesome-kernel-benchmark}
}

License

Catalog content: CC-BY-4.0 (attribution required) · code in scripts/: MIT · each listed benchmark retains its own license (per-entry license field).

About

An Evidence-Graded Catalog of Benchmarks for LLM Kernel Agents

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages