Awesome Kernel Benchmark

Which benchmark should score your kernel agent — and how does each one actually measure? A curated, evidence-graded catalog of every benchmark relevant to LLM/agent GPU-kernel generation. 中文版: README.zh.md

30-second orientation. This list contains two different kinds of things, and keeping them apart is the whole game:

Agent benchmarks — harnesses designed to score LLM kernel generation (KernelBench and its 27 successors). These produce the headline numbers in papers.
Task sources (a.k.a. substrates) — the classic HPC/GPU suites that agents are evaluated on or asked to optimize (PolyBench, NPB, Rodinia, …). An agent paper saying "we optimize XSBench kernels" is using a task source, not a benchmark — its numbers are only comparable to another paper that shares both the task source and the measurement method.

Sibling list: awesome-kernel-agent (the agents themselves). Single source of truth: data/benchmarks.yaml + data/scorecard.yaml; every table below is generated — do not hand-edit.

151 benchmarks · 28 purpose-built agent benchmarks · 123 substrate / dataset / tooling entries across 13 families.

Layer	Count
agent-benchmark	28
substrate-suite	110
dataset	3
tooling	10

Top hardware targets	Entries
NVIDIA	122
CPU	67
AMD	45
Intel-GPU	13
FPGA	7
Ascend-NPU	6
TPU	3
Cambricon	2

① What should I evaluate my agent on?

Pick by what you need to prove:

"My numbers are comparable to prior work" → KernelBench — the de-facto standard everyone reports. Know its limits (see scorecard) and never use it as sole evidence.
"My speedups are not reward-hacking artifacts" → robust-kbench — anti-gaming filters, forward+backward, 1e-5 tolerances.
"My kernels are correct on real production shapes" → BackendBench — PyTorch OpInfo edge cases + TorchBench production traces; ships kernels as a pip backend.
"My kernels are fast in an absolute sense, not just vs eager" → SOL-ExecBench / CUDABench — ceiling-relative scores (caveat: both use datasheet peaks, which are 8–15% loose vs measured ceilings).
"My agent's wins survive a real serving stack" → ISO-Bench (vLLM/SGLang merged-PR tasks) / FlashInfer-Bench.
"My results are honest about compute budget" → GEAK — the only one decomposing sequential@k vs parallel@k.
Triton specifically → TritonBench; ROCm/AMD → GEAK, AgentKernelArena, NPUEval; Ascend NPU → MultiKernelBench, AscendKernelBench, NPUKernelBench; CUDA correctness with hidden tests → ComputeEval.

The honest answer for a serious paper is a portfolio: one comparability anchor (KernelBench) + one hardened oracle (robust-kbench or BackendBench) + one absolute metric (SOL-ExecBench-style) + explicit budget reporting. No single existing benchmark covers all four — that gap is visible in the scorecard below.

② How does each benchmark actually measure? — the methodology scorecard

Two benchmarks can give the same agent wildly different numbers, because the score is a product of five design choices: the task set, the correctness oracle, the timing methodology, the baseline, and the budget accounting. This scorecard grades each benchmark on the choices that separate trustworthy numbers from inflated ones — graded only from primary evidence (harness source code / paper), never from README claims.

Benchmark	Year	Hardware	Oracle	Timing	Baseline	Budget	Pick this if…
BackendBench	2025	NVIDIA	●	○	PyTorch op semantics (correctness-first; kernels run as a pip backend)	—	You want production-shape correctness and a path to actually shipping kernels as a backend.
ComputeEval	2025	NVIDIA	●	○	n/a (correctness-first; optional speedup vs reference)	◐	You want CUDA correctness with anti-overfitting (hidden tests) from NVIDIA's own harness.
GEAK Benchmarks	2025	AMD	◐	○	expert reference kernels (TritonBench-revised + ROCm set)	●	You target ROCm/Triton and want budget-decomposed (sequential vs parallel) reporting.
KernelBench	2025	NVIDIA	○	○	PyTorch eager (torch.compile default-mode as secondary)	—	You need comparability with the de-facto standard everyone reports — never as sole evidence.
KernelBench-v3	2025	NVIDIA	◐	◐	PyTorch eager (reference.py)	—	You want a community-hardened KernelBench variant reported across several GPUs.
MultiKernelBench	2025	NVIDIA · Ascend-NPU · TPU	○	○	each platform's own PyTorch backend	◐	You need multi-platform coverage (CUDA / Ascend C / TPU Pallas) and accept per-platform-only numbers.
TritonBench	2025	NVIDIA · AMD	◐	◐	PyTorch/reference impl; descriptive GPU-efficiency vs A100 theoretical peak	—	You generate Triton specifically and want real-world (GitHub-sourced) operators.
robust-kbench	2025	NVIDIA	●	◐	PyTorch eager (KernelBench-style)	—	You want to prove your speedups survive anti-gaming filters (the KernelBench exploit fixes).
CUDABench	2026	NVIDIA	?	◐	attainable-GFLOPs roofline ceiling (datasheet peak)	?	You want a roofline-relative Performance-Score with profiler-measured intensity (datasheet-ceiling caveat).
ISO-Bench	2026	NVIDIA	◐	◐	unoptimized baseline + human-merged-PR solution (dual reference)	◐	You want real serving-stack tasks (vLLM/SGLang merged PRs) with did-it-fix-the-bottleneck attribution.
SOL-ExecBench	2026	NVIDIA	◐	●	agent-optimized PyTorch baseline + analytical speed-of-light ceiling (datasheet peak)	—	You want an absolute (ceiling-relative) score on production LLM kernels — datasheet-ceiling caveat applies.

Grades (criteria in data/scorecard.yaml / SCHEMA.md): ● strong · ◐ partial · ○ weak · — none · ? unverified. Grades are assigned ONLY from primary evidence (harness source / paper) — no benchmark is graded from its README claims.

Not yet graded (17): C2HLSC, ParEval, AscendKernelBench (AKG-AGENT), CANN Bench, FlashInfer-Bench, HLS-Eval, NKIBench, NPUEval, QiMeng-TensorOp, QiMeng-Xpiler, TritonGym, AgentKernelArena, KernelBench-MUSA (MooreEval), KernelBenchX, KernelCraft, MSKernelBench, NPUKernelBench. PRs grading these against the criteria are the most valuable contribution this list can receive.

Full inventory of all purpose-built agent benchmarks (facets: abstraction, motifs, hardware, verification)

Benchmark	Year	Abstraction	Motifs	Hardware	Verify	Runs on (substrate)
C2HLSC	2024	kernel	mixed	FPGA	✅	—
ParEval	2024	kernel	dense-LA · sparse-LA · structured-grid · graph-traversal · reduction-scan · n-body · mixed	NVIDIA · AMD · CPU	✅	—
AscendKernelBench (AKG-AGENT)	2025	operator	dense-LA · elementwise · reduction-scan · mixed	Ascend-NPU · NVIDIA · CPU	✅	—
BackendBench	2025	operator	dense-LA · elementwise · reduction-scan · mixed	NVIDIA	✅	—
CANN Bench	2025	operator	mixed	Ascend-NPU	✅	—
ComputeEval	2025	kernel	dense-LA · reduction-scan · elementwise · mixed	NVIDIA	✅	—
FlashInfer-Bench	2025	operator	attention · dense-LA · mixed	NVIDIA	✅	—
GEAK Benchmarks	2025	kernel	dense-LA · attention · reduction-scan · mixed	AMD	✅	—
HLS-Eval	2025	kernel	mixed	FPGA	✅	—
KernelBench	2025	operator	dense-LA · structured-grid · reduction-scan · elementwise · attention · mixed	NVIDIA	✅	—
KernelBench-v3	2025	operator	dense-LA · attention · mixed	NVIDIA	✅	—
MultiKernelBench	2025	operator	dense-LA · attention · reduction-scan · elementwise · mixed	NVIDIA · Ascend-NPU · TPU	✅	—
NKIBench	2025	kernel	dense-LA · attention · mixed	Trainium	✅	—
NPUEval	2025	kernel	dense-LA · elementwise · reduction-scan · mixed	AMD	✅	—
QiMeng-TensorOp	2025	operator	dense-LA	RISC-V · ARM · NVIDIA	◐	—
QiMeng-Xpiler	2025	operator	dense-LA · mixed	Cambricon · NVIDIA · AMD · CPU	✅	—
TritonBench	2025	kernel	dense-LA · attention · reduction-scan · elementwise · mixed	NVIDIA · AMD	✅	—
TritonGym	2025	kernel	mixed	NVIDIA	✅	—
robust-kbench	2025	operator	dense-LA · attention · reduction-scan · mixed	NVIDIA	✅	—
AgentKernelArena	2026	kernel	mixed	AMD	✅	—
CUDABench	2026	operator	dense-LA · attention · reduction-scan · elementwise · mixed	NVIDIA	✅	—
ISO-Bench	2026	operator	mixed	NVIDIA	✅	—
KernelBench-MUSA (MooreEval)	2026	operator	dense-LA · attention · mixed	MooreThreads · NVIDIA	✅	—
KernelBenchX	2026	operator	dense-LA · attention · reduction-scan · elementwise · mixed	NVIDIA	✅	—
KernelCraft	2026	kernel	mixed	NVIDIA · AMD	✅	—
MSKernelBench	2026	operator	dense-LA · sparse-LA · attention · mixed	NVIDIA	✅	—
NPUKernelBench	2026	kernel	mixed	Ascend-NPU	✅	—
SOL-ExecBench	2026	operator	dense-LA · attention · mixed	NVIDIA	✅	—

③ Where do I get kernels/tasks to optimize?

The classic suites below are task sources: curated kernels with reference implementations that agents optimize, translate, or get fine-tuned on. Groups are ordered by how often kernel-agent work reaches for them; the Verify column (✅ built-in correctness oracle) is what makes a suite resistant to wrong-but-fast kernels, and Ships (✅ re-runnable reference kernels) is what makes results auditable.

DL operators & vendor baselines

The kernels your agent must beat: cuDNN/cuBLAS, FlashAttention, Triton tutorials, Liger. Use as reference implementations and speedup denominators.

Suite	Year · Org	Motifs	Languages / model	Verify	Ships	Used by (agents)
cuDNN / cuBLAS	2014 · NVIDIA	dense-LA · attention · elementwise · mixed	CUDA	◐	✅	AI CUDA Engineer, KernelBench, TritonBench
DeepBench	2016 · Baidu Research	dense-LA · elementwise	CUDA · HIP · ARM	◐	◐	—
DNNMark	2017 · Northeastern (NUCAR)	dense-LA · elementwise	CUDA · HIP	◐	◐	—
PyTorch operator_benchmark	2019 · Meta / PyTorch	dense-LA · elementwise · reduction-scan · mixed	Python · CUDA	◐	◐	—
NVBench	2021 · NVIDIA	mixed	CUDA/C++	—	—	—
Triton tutorial kernels	2021 · OpenAI / triton-lang	dense-LA · attention · reduction-scan · elementwise	Triton	✅	✅	TritonBench, GEAK, KernelLLM, AutoTriton
xFormers benchmarks	2021 · Meta AI (FAIR)	attention · dense-LA	CUDA · CUTLASS · Triton	✅	✅	—
FlashAttention benchmarks	2022 · Dao-AILab	attention · reduction-scan	CUDA · CuTeDSL	✅	✅	—
Liger-Kernel benchmarks	2024 · LinkedIn	reduction-scan · elementwise · attention	Triton	✅	✅	TritonBench

Tensor compilers & autotuners

Machine-generated baselines (TVM/Ansor, Hidet, CUTLASS profiler). Use when you want your agent compared against autotuned — not just eager — code.

Suite	Year · Org	Motifs	Languages / model	Verify	Ships	Used by (agents)
CUTLASS Profiler	2017 · NVIDIA	dense-LA	CUDA C++ · CuTe · Python DSL	✅	✅	KernelAgent, MANTIS, CUDABench
TVM / Ansor	2018 · Apache TVM / OctoML	dense-LA · mixed	TVM TE/TIR	✅	✅	—
IREE benchmarks	2020 · Google / OpenXLA	dense-LA · mixed	MLIR (Linalg)	✅	◐	—
MLIR microkernels	2020 · LLVM / MLIR community	dense-LA · elementwise	MLIR	◐	◐	—
AKG (Auto Kernel Generator)	2021 · Huawei / MindSpore	dense-LA · elementwise · reduction-scan · mixed	polyhedral DSL · Triton-Ascend	✅	✅	AKG kernel Agent
TenSet	2021 · UC Berkeley / OctoML (Zheng)	dense-LA	TVM schedules	—	◐	—
Roller	2022 · Microsoft Research Asia	dense-LA	TVM TE · CUDA	✅	✅	—
Hidet	2023 · U. Toronto / CentML	dense-LA · attention	Python DSL · CUDA	✅	✅	—
Welder	2023 · MSR Asia / Peking U.	dense-LA · mixed	TVM · CUTLASS · CUDA	✅	✅	—

LLM serving benchmarks

Where kernel wins become end-to-end wins: TTFT/TPOT/throughput harnesses (vLLM, SGLang). Use to show a kernel matters at the serving level.

Suite	Year · Org	Motifs	Languages / model	Verify	Ships	Used by (agents)
LLMPerf	2023 · Ray project (Anyscale)	attention	Python	◐	—	—
vLLM benchmarks	2023 · vLLM project (UC Berkeley → community)	attention · dense-LA	Python	◐	—	FlashInfer-Bench
FlexAttention / attention-gym	2024 · Meta / PyTorch	attention	Python · Triton	✅	✅	—
GenAI-Perf	2024 · NVIDIA	attention	Python	—	—	—
SGLang benchmarks	2024 · SGLang project	attention · dense-LA	Python	◐	—	Astra
AIPerf	2025 · NVIDIA	attention	Python	—	—	—

Whole-model DL benchmarks

Model-level suites (MLPerf, TorchBench, TorchInductor). Use to bound end-to-end impact of kernel-level changes.

Suite	Year · Org	Motifs	Languages / model	Verify	Ships	Used by (agents)
Fathom	2016 · Harvard (VLSI-Arch)	dense-LA · attention · mixed	Python · TensorFlow	—	—	—
DAWNBench	2017 · Stanford DAWN	dense-LA · mixed	Python	✅	—	—
AI-Benchmark	2018 · ETH Zurich	dense-LA · mixed	Python · TensorFlow	◐	—	—
AIBench	2018 · ICT CAS / BenchCouncil	dense-LA · attention · mixed	Python · C++	◐	—	—
MLPerf Training	2018 · MLCommons	dense-LA · attention · mixed	Python · PyTorch · TensorFlow	✅	—	—
MLPerf Inference	2019 · MLCommons	dense-LA · attention · mixed	Python · C++	✅	—	—
Megatron-LM benchmarks	2019 · NVIDIA	dense-LA · attention · mixed	Python · CUDA	◐	◐	—
MLPerf Tiny	2021 · MLCommons	dense-LA · elementwise · mixed	C · C++	✅	◐	—
TorchBench	2021 · Meta / PyTorch	dense-LA · attention · mixed	Python · CUDA	✅	—	—
NVIDIA Transformer Engine benchmarks	2022 · NVIDIA	dense-LA · attention	Python · C++ · CUDA	✅	✅	—
TorchInductor / Dynamo suite	2022 · Meta / PyTorch	dense-LA · attention · elementwise · mixed	Python · Triton	✅	◐	KernelBench-style agents (baseline)

Sparse linear algebra

SpMV/SpMM/SDDMM kernels and the matrix collections (SuiteSparse, DLMC) they run on. Irregular memory patterns — a hard, under-benchmarked motif.

Suite	Year · Org	Motifs	Languages / model	Verify	Ships	Used by (agents)
SuiteSparse Matrix Collection	2011 · Texas A&M (Davis); orig. U. Florida	sparse-LA	Matrix Market	—	—	Auto-SpMV
cuSPARSE	2014 · NVIDIA	sparse-LA	CUDA	◐	✅	—
TACO benchmarks	2017 · MIT CSAIL (Kjolstad)	sparse-LA	C++	✅	✅	—
ASpT	2019 · Academic (Hong et al.)	sparse-LA	CUDA · OpenMP	◐	✅	—
DLMC (Deep Learning Matrix Collection)	2020 · Google Research / Stanford	sparse-LA	Matrix Market	—	—	Sputnik, Magicube, SparseTIR
GE-SpMM / dgSPARSE	2020 · Tsinghua (Huang et al.)	sparse-LA · graph-traversal	CUDA	◐	✅	—
Sputnik	2020 · Google Research / Stanford	sparse-LA	CUDA/C++	✅	✅	—
cuSPARSELt	2020 · NVIDIA	sparse-LA	CUDA	◐	✅	—
vectorSparse	2021 · Academic (Chen et al.)	sparse-LA	CUDA	◐	✅	—
Magicube	2022 · ETH Zurich (Li et al.)	sparse-LA	CUDA	◐	✅	—
SparseTIR	2023 · U. Washington (SAMPL)	sparse-LA	TVM TensorIR · CUDA	✅	✅	—
VENOM / Spatha	2023 · Universidad de Malaga (Castro et al.)	sparse-LA	CUDA	◐	✅	—

Stencils, FFT & PDE

Structured-grid stencils, FFT/spectral kernels, and their DSLs (Halide, Devito). Classic memory-bound optimization targets with clean verification.

Suite	Year · Org	Motifs	Languages / model	Verify	Ships	Used by (agents)
benchFFT (FFTW)	2003 · MIT (Frigo, Johnson)	spectral	C	✅	✅	—
Stencil Probe	2007 · UC Berkeley (Kamil et al.)	structured-grid	C	◐	✅	—
Pochoir	2011 · MIT (Tang et al.)	structured-grid	C++	✅	✅	—
Halide benchmarks	2012 · MIT CSAIL	structured-grid · dense-LA	Halide DSL · CUDA	✅	✅	—
ExaStencils / ExaSlang	2014 · U. Erlangen / Passau	structured-grid	ExaSlang · CUDA · OpenMP · MPI	✅	✅	—
PolyMage	2015 · IISc Bangalore (Mullapudi et al.)	structured-grid	Python DSL · C	✅	✅	—
Devito benchmarks	2016 · Imperial College London	structured-grid	Python DSL · C	✅	✅	—
StencilGen (Artemis)	2018 · Ohio State (Rawat et al.)	structured-grid	CUDA	✅	✅	—
AN5D	2020 · U. Tokyo / RIKEN (Matsumura et al.)	structured-grid	CUDA	✅	✅	—
heFFTe	2020 · ICL U. Tennessee	spectral	C++ · CUDA · HIP	✅	✅	—

Graph analytics (irregular)

BFS/PageRank/connected-components with strong optimized baselines (Gunrock, GAPBS). Use to test agents beyond dense regular loops.

Suite	Year · Org	Motifs	Languages / model	Verify	Ships	Used by (agents)
LonestarGPU / Lonestar	2012 · UT Austin (ISS)	graph-traversal · unstructured-grid	CUDA · C++	✅	✅	—
Pannotia	2013 · AMD Research + U. Virginia	graph-traversal	OpenCL · HIP · CUDA	✅	✅	—
GAP Benchmark Suite	2015 · UC Berkeley (Beamer)	graph-traversal · sparse-LA	C++ · OpenMP	✅	✅	—
GraphBIG	2015 · Georgia Tech + IBM	graph-traversal	C++ · CUDA	✅	✅	—
Gunrock	2016 · UC Davis (Owens)	graph-traversal	CUDA/C++	✅	✅	—
GARDENIA	2017 · NUDT (Chen)	graph-traversal · sparse-LA	CUDA · OpenCL · OpenMP	✅	✅	—
GBBS	2020 · MIT / CMU (Dhulipala et al.)	graph-traversal	C++	✅	✅	—
Indigo3	2024 · Texas State (Burtscher)	graph-traversal	C · OpenMP · CUDA · HIP	✅	✅	—

Dense loop nests (PolyBench lineage)

Regular affine loop kernels (GEMM-like, stencils) with trivially checkable outputs — the easiest substrate to verify, and the most-used in agent papers.

Suite	Year · Org	Motifs	Languages / model	Verify	Ships	Used by (agents)
PolyBench/C	2010 · OSU / CSU (Pouchet, Yuki)	dense-LA · structured-grid · mixed	C	✅	✅	Performance-Aligned LLMs, ComPilot, ACCeLLiuM
PolyBench/Fortran	2012 · OSU (Pouchet, Narayan)	dense-LA · structured-grid	Fortran	✅	✅	—
PolyBench/GPU	2012 · U. Delaware (Grauer-Gray, Cavazos)	dense-LA · structured-grid	CUDA · OpenCL · HMPP · OpenACC	✅	✅	MIREncoder
PolyBench-ACC	2013 · Cavazos Lab, U. Delaware	dense-LA · structured-grid · mixed	CUDA · OpenCL · OpenACC · OpenMP · HMPP	✅	✅	CUDAnalyst, MEP
PolyBench-RAJA	2016 · U. Delaware (Killian)	dense-LA · structured-grid	C++ (RAJA)	✅	✅	—
PolyBench-NN	2019 · IIT Hyderabad (IITH-Compilers)	dense-LA · structured-grid	C	✅	✅	—
PolyBench/Python	2021 · UDC-GAC (U. da Coruna)	dense-LA · structured-grid	Python · NumPy	✅	✅	—

HPC proxy & mini-apps

DOE/NASA mini-apps (NPB, XSBench, LULESH): small, science-representative, almost all ship a built-in verification figure-of-merit.

Suite	Year · Org	Motifs	Languages / model	Verify	Ships	Used by (agents)
NPB (NAS Parallel Benchmarks)	1991 · NASA Advanced Supercomputing	dense-LA · sparse-LA · spectral · structured-grid · mixed	Fortran · C	✅	✅	CUDAnalyst, AutoParLLM
miniMD	2008 · Sandia (Mantevo)	n-body	C++ · Kokkos · OpenMP	✅	✅	—
LULESH	2010 · LLNL	unstructured-grid · n-body	C++ · CUDA · RAJA · OpenMP	✅	✅	—
miniFE	2011 · Sandia (Mantevo)	sparse-LA · unstructured-grid	C++ · CUDA · Kokkos · OpenMP	✅	✅	LLMPerf-Opt
CoMD	2012 · ExMatEx / ECP-CoPA	n-body	C · CUDA · OpenCL · OpenMP	✅	✅	—
PENNANT	2012 · LANL	unstructured-grid	C++ · CUDA · MPI	✅	✅	—
Nekbone	2013 · Argonne (Nek5000 team)	dense-LA · structured-grid	Fortran · CUDA-Fortran · OpenACC	✅	✅	—
HPGMG	2014 · LBNL	structured-grid	C · CUDA · OpenMP	✅	✅	—
RSBench	2014 · Argonne (ANL-CESAR)	monte-carlo · mixed	C · CUDA · OpenCL · SYCL · OpenMP-target	✅	✅	LLMPerf-Opt
XSBench	2014 · Argonne (ANL-CESAR)	monte-carlo · mixed	C · CUDA · OpenCL · SYCL · OpenMP-target	✅	✅	CUDAnalyst, LLMPerf-Opt, LASSI, OMPar, ParEval-Repo
miniAMR	2014 · Sandia (Mantevo)	structured-grid	C · MPI · OpenMP	✅	◐	—
BabelStream	2015 · U. Bristol (UoB-HPC)	structured-grid · data-movement	CUDA · HIP · SYCL · OpenCL · OpenMP · Kokkos · RAJA · TBB · Thrust	✅	✅	—
SimpleMOC	2015 · Argonne (CESAR) + MIT	monte-carlo · dense-LA	C · CUDA · OpenCL · OpenMP-target	✅	◐	ParEval-Repo
Quicksilver	2016 · LLNL	monte-carlo	C++ · CUDA · OpenMP	✅	✅	—
SU3_Bench	2019 · LBNL / NERSC	dense-LA	C · CUDA · HIP · SYCL · OpenMP-target	✅	✅	LASSI, OMPar

Classic GPU suites (pre-DL canon)

Rodinia, SHOC, Parboil, HeCBench: the broad-coverage GPU canon. HeCBench alone gives one kernel in 4+ programming models.

Suite	Year · Org	Motifs	Languages / model	Verify	Ships	Used by (agents)
CUDA Samples	2008 · NVIDIA	dense-LA · spectral · reduction-scan · mixed	CUDA	◐	✅	—
Mars	2008 · HKUST / Microsoft	mapreduce	CUDA	✅	◐	—
Parboil	2008 · UIUC (IMPACT)	dense-LA · sparse-LA · structured-grid · graph-traversal · n-body · mapreduce	C · CUDA · OpenCL · OpenMP	✅	✅	MIREncoder
GPGPU-Sim ISPASS-2009	2009 · UBC (Aamodt)	dense-LA · graph-traversal · mixed	CUDA	✅	✅	—
Rodinia	2009 · U. Virginia (Skadron / LAVA)	dense-LA · structured-grid · unstructured-grid · graph-traversal · dynamic-programming · mixed	C · CUDA · OpenCL · OpenMP	✅	✅	AutoParLLM, MIREncoder
SHOC	2010 · ORNL + U. Tennessee	dense-LA · sparse-LA · spectral · structured-grid · graph-traversal · reduction-scan · mixed	CUDA · OpenCL · MPI	✅	✅	MIREncoder
OpenDwarfs	2012 · Virginia Tech (Synergy)	dense-LA · sparse-LA · spectral · n-body · structured-grid · unstructured-grid · graph-traversal · dynamic-programming · combinational-logic · branch-and-bound · graphical-models · finite-state-machine	OpenCL	✅	✅	—
Hetero-Mark	2016 · Northeastern (NUCAR)	dense-LA · graph-traversal · mixed	CUDA · HIP · HC · OpenCL	✅	✅	—
Chai	2017 · UIUC (IMPACT) + U. Cordoba	dense-LA · graph-traversal · mixed	CUDA · OpenCL	✅	✅	—
Tartan	2018 · PNNL + ORNL	data-movement · mixed	CUDA · MPI	—	◐	—
Mirovia	2019 · UT Austin / VMware	dense-LA · structured-grid · mixed	CUDA	✅	✅	—
Tango	2019 · MoCA Lab (SJSU / UC Merced)	dense-LA · attention · mixed	CUDA · OpenCL	✅	✅	—
Altis	2020 · UT Austin (SCEA) / VMware	dense-LA · structured-grid · mixed	CUDA	✅	✅	—
SYCL-Bench	2020 · U. Salerno (unisa-hpc)	dense-LA · structured-grid · reduction-scan · mixed	SYCL	◐	✅	—
HeCBench	2021 · ORNL (Jin, Vetter)	dense-LA · sparse-LA · structured-grid · monte-carlo · graph-traversal · mixed	CUDA · HIP · SYCL · OpenMP-target	✅	✅	LASSI, OMPar, LLMPerf-Opt, Bolet et al.

Peak & roofline probes

STREAM, ERT, mixbench, gpu-burn: measure what the hardware can actually do. The calibration layer any ceiling-relative metric depends on.

Suite	Year · Org	Motifs	Languages / model	Verify	Ships	Used by (agents)
STREAM	1995 · John McCalpin (U. Virginia / TACC)	structured-grid · data-movement	C · Fortran	✅	✅	—
HPL (High Performance Linpack)	2000 · Netlib / ICL U. Tennessee	dense-LA	C · MPI · BLAS	✅	✅	—
HPCC (HPC Challenge)	2003 · ICL U. Tennessee	dense-LA · spectral · mapreduce · data-movement · mixed	C · MPI · BLAS	✅	✅	—
HPCG	2013 · Sandia / ICL (Heroux, Dongarra)	sparse-LA · structured-grid	C++ · MPI · OpenMP	✅	✅	—
gpu-burn	2013 · Ville Timonen	dense-LA	C++/CUDA	✅	◐	—
clpeak	2014 · Krishnaraj Bhat	mixed	C++ · OpenCL	—	◐	—
Empirical Roofline Toolkit (ERT)	2015 · LBNL CRD / Berkeley	mixed	C · MPI · OpenMP · CUDA	◐	✅	—
mixbench	2015 · U. Athens (Konstantinidis)	mixed	CUDA · OpenCL · HIP · SYCL · OpenMP	◐	✅	—
gpumembench	2016 · U. Athens (Konstantinidis)	data-movement	CUDA · OpenCL	◐	✅	—
HPL-MxP (HPL-AI)	2019 · ICL U. Tennessee	dense-LA	C/C++ · CUDA · HIP · MPI	✅	◐	—
nvbandwidth	2022 · NVIDIA	data-movement	C++/CUDA	◐	✅	—

Compiler & HLS test suites

LLVM test-suite, SPEC ACCEL/OMP, MachSuite: codegen substrates with strict correctness baked in.

Suite	Year · Org	Motifs	Languages / model	Verify	Ships	Used by (agents)
LLVM test-suite	2004 · LLVM community	dense-LA · structured-grid · mixed	C · C++ · CUDA	✅	✅	—
CHStone	2008 · Ritsumeikan U.	combinational-logic · dense-LA · mixed	C	✅	✅	—
SPEC OMP2012	2012 · SPEC/HPG	dense-LA · structured-grid · n-body · mixed	C · C++ · Fortran · OpenMP	✅	✅	—
MachSuite	2014 · Harvard (Reagen et al.)	dense-LA · sparse-LA · spectral · graph-traversal · dynamic-programming · mixed	C	✅	✅	—
SPEC ACCEL	2014 · SPEC/HPG	dense-LA · structured-grid · mixed	OpenCL · OpenACC · OpenMP-target	✅	✅	—
Rosetta (HLS)	2018 · Cornell (Zhang et al.)	dense-LA · structured-grid · mixed	C++ HLS · OpenCL	✅	✅	—

NPU & emerging accelerators

Ascend, Cambricon, Tenstorrent, IPU substrates — where vendor-kernel scarcity makes agents most valuable and baselines weakest.

Suite	Year · Org	Motifs	Languages / model	Verify	Ships	Used by (agents)
Graphcore IPU benchmarks	2020 · Graphcore + U. Bristol	dense-LA · attention · mixed	Poplar C++ · PopLibs	◐	✅	—
Cambricon mlu-ops	2022 · Cambricon	dense-LA · elementwise · mixed	BANG C	◐	✅	—
Tenstorrent tt-metal	2023 · Tenstorrent	dense-LA · attention · mixed	C++ (TT-Metalium) · Python (TT-NN)	◐	✅	—
Ascend cann-ops	2024 · Huawei	dense-LA · elementwise · mixed	Ascend C	✅	✅	AscendOptimizer

④ Who can I compare against?

Which kernel-agent works already evaluate on each classic task source. Two agents are directly comparable only if they share a task source and a measurement method. Agents that ship re-runnable kernels (✅) are auditable; the rest are trust-me numbers.

Substrate	Family	Used by (kernel agents)	Ships kernels
AKG (Auto Kernel Generator)	tensor-compiler	AKG kernel Agent	✅
Ascend cann-ops	emerging-accelerator	AscendOptimizer	✅
CUTLASS Profiler	tensor-compiler	KernelAgent, MANTIS, CUDABench	✅
DLMC (Deep Learning Matrix Collection)	sparse-la	Sputnik, Magicube, SparseTIR	—
HeCBench	classic-gpu-suite	LASSI, OMPar, LLMPerf-Opt, Bolet et al.	✅
Liger-Kernel benchmarks	dl-micro	TritonBench	✅
NPB (NAS Parallel Benchmarks)	hpc-mini-app	CUDAnalyst, AutoParLLM	✅
Parboil	classic-gpu-suite	MIREncoder	✅
PolyBench-ACC	polyhedral	CUDAnalyst, MEP	✅
PolyBench/C	polyhedral	Performance-Aligned LLMs, ComPilot, ACCeLLiuM	✅
PolyBench/GPU	polyhedral	MIREncoder	✅
RSBench	hpc-mini-app	LLMPerf-Opt	✅
Rodinia	classic-gpu-suite	AutoParLLM, MIREncoder	✅
SGLang benchmarks	serving-inference	Astra	—
SHOC	classic-gpu-suite	MIREncoder	✅
SU3_Bench	hpc-mini-app	LASSI, OMPar	✅
SimpleMOC	hpc-mini-app	ParEval-Repo	◐
SuiteSparse Matrix Collection	sparse-la	Auto-SpMV	—
TorchInductor / Dynamo suite	dl-system	KernelBench-style agents (baseline)	◐
Triton tutorial kernels	dl-micro	TritonBench, GEAK, KernelLLM, AutoTriton	✅
XSBench	hpc-mini-app	CUDAnalyst, LLMPerf-Opt, LASSI, OMPar, ParEval-Repo	✅
cuDNN / cuBLAS	dl-micro	AI CUDA Engineer, KernelBench, TritonBench	✅
miniFE	hpc-mini-app	LLMPerf-Opt	✅
vLLM benchmarks	serving-inference	FlashInfer-Bench	—

The four most-used anchors — PolyBench · NPB · XSBench · HeCBench — are where a new HPC-kernel agent should report at least once.

Appendix A — motif coverage matrix

Motifs are the ~17 fundamental computation patterns (dense linear algebra, stencils, graph traversal, attention, …) from Berkeley's classic taxonomy. Rows = motifs, columns = task-source groups, cells = how many suites exercise that combination. Empty rows are coverage gaps — motifs almost no benchmark tests (and a caution against "our agent generalizes" claims). This matrix is the task-selection tool we use when designing balanced benchmark suites.

Motif \ Family	DLOP	TC	SERV	DLSYS	SPARSE	STEN	GRAPH	POLY	HPC	GPU	NUM	COMP	ACCEL	Σ
dense-LA	6	9	2	11		1		7	4	13	4	6	4	67
sparse-LA					12		2		2	4	1	1		22
spectral						2			1	3	1	1		8
n-body									3	2		1		6
structured-grid						8		7	5	8	2	4		34
unstructured-grid							1		3	2				6
monte-carlo									4	1				5
graph-traversal					1		8			8		1		18
dynamic-programming										2		1		3
combinational-logic										1		1		2
branch-and-bound										1				1
graphical-models										1				1
finite-state-machine										1				1
mapreduce										2	1			3
attention	5	1	6	8						1			2	23
reduction-scan	4	1								3				8
elementwise	6	2		2									2	12
data-movement									1	1	4			6
mixed	3	4		10				2	3	12	4	6	4	48

_{DLOP = DL operators & vendor baselines

TC = Tensor compilers & autotuners

SERV = LLM serving benchmarks

DLSYS = Whole-model DL benchmarks

SPARSE = Sparse linear algebra

STEN = Stencils, FFT & PDE

GRAPH = Graph analytics (irregular)

POLY = Dense loop nests (PolyBench lineage)

HPC = HPC proxy & mini-apps

GPU = Classic GPU suites (pre-DL canon)

NUM = Peak & roofline probes

COMP = Compiler & HLS test suites

ACCEL = NPU & emerging accelerators}

Appendix B — how this list is built

Data layer: data/benchmarks.yaml (one entry per benchmark, multi-tag facets: layer, abstraction, motifs, bottleneck, hardware, verify, ships, status — full vocabularies in SCHEMA.md) + data/scorecard.yaml (methodology grades with evidence links). scripts/generate.py validates and renders everything; --check runs in CI.
Inclusion scope: anything a kernel agent could plausibly be evaluated on, fine-tuned against, or asked to optimize/translate.
Provenance & dedup: suites repackaging others (SPEC ACCEL ⊂ Parboil/Rodinia/NPB; KernelBench derivatives) are marked, not double-counted. A status field tracks liveness — catalogs rot.
Design lineage: Berkeley's 13 computational motifs, the Herten et al. HPC benchmark survey YAML-to-tables pattern, and the methodology-scorecard layer introduced here.

Glossary

Term	Plain meaning
kernel agent	An LLM-driven system that writes/optimizes GPU kernels, usually in a generate→profile→refine loop
substrate / task source	A classic benchmark suite used as raw material for agents — not itself designed to score them
correctness oracle	The machinery deciding "is this kernel's output right?" — its strictness decides whether cheating kernels score
motif	A fundamental computation pattern (dense LA, stencil, graph traversal, attention…); the coverage vocabulary
polyhedral (suite)	Kernels that are regular nested loops over arrays (GEMM-like) — easy to analyze, verify, and transform
proxy/mini-app	A few-thousand-line stand-in for a real scientific application (e.g. XSBench ≈ a reactor code's hot loop)
verify ✅ / ships ✅	Suite has a built-in correctness check / ships re-runnable reference kernels (anti-gaming, auditability)
roofline / SOL	The hardware ceiling (compute or bandwidth) a kernel could at best reach; "speed-of-light" in NVIDIA jargon
pass@k / budget	Score given k attempts; without a declared budget, best-of-many and single-shot results are incomparable

Related lists

awesome-kernel-agent — the agents, DSLs, datasets, and tooling.
Herten et al., An HPC Benchmark Survey and Taxonomy (IJHPCA 2026).
Awesome-LLM4Kernel · awesome-LLM-driven-kernel-generation.

Contributing

Edit data/benchmarks.yaml (new entries) or data/scorecard.yaml (methodology grades — the most valuable PR: read a harness's source, grade it against the criteria, cite the evidence). Then python3 scripts/generate.py and commit the regenerated files. See CONTRIBUTING.md.

Citation

If this catalog or its methodology scorecard is useful in your research, please cite:

@misc{you2026awesomekernelbenchmark,
  author       = {Lianzhong You},
  title        = {Awesome Kernel Benchmark: An Evidence-Graded Catalog of
                  Benchmarks for LLM Kernel Agents},
  year         = {2026},
  publisher    = {GitHub},
  url          = {https://github.com/youyve/awesome-kernel-benchmark}
}

License

Catalog content: CC-BY-4.0 (attribution required) · code in scripts/: MIT · each listed benchmark retains its own license (per-entry license field).

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.github/workflows		.github/workflows
data		data
scripts		scripts
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
README.zh.md		README.zh.md
SCHEMA.md		SCHEMA.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome Kernel Benchmark

Contents

① What should I evaluate my agent on?

② How does each benchmark actually measure? — the methodology scorecard

③ Where do I get kernels/tasks to optimize?

DL operators & vendor baselines

Tensor compilers & autotuners

LLM serving benchmarks

Whole-model DL benchmarks

Sparse linear algebra

Stencils, FFT & PDE

Graph analytics (irregular)

Dense loop nests (PolyBench lineage)

HPC proxy & mini-apps

Classic GPU suites (pre-DL canon)

Peak & roofline probes

Compiler & HLS test suites

NPU & emerging accelerators

④ Who can I compare against?

Appendix A — motif coverage matrix

Appendix B — how this list is built

Glossary

Related lists

Contributing

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Awesome Kernel Benchmark

Contents

① What should I evaluate my agent on?

② How does each benchmark actually measure? — the methodology scorecard

③ Where do I get kernels/tasks to optimize?

DL operators & vendor baselines

Tensor compilers & autotuners

LLM serving benchmarks

Whole-model DL benchmarks

Sparse linear algebra

Stencils, FFT & PDE

Graph analytics (irregular)

Dense loop nests (PolyBench lineage)

HPC proxy & mini-apps

Classic GPU suites (pre-DL canon)

Peak & roofline probes

Compiler & HLS test suites

NPU & emerging accelerators

④ Who can I compare against?

Appendix A — motif coverage matrix

Appendix B — how this list is built

Glossary

Related lists

Contributing

Citation

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages