AMD PACE is a high-performance LLM inference engine built from the ground up for AMD EPYC CPUs. It is a PyTorch C++ extension that combines custom AVX512 kernels, CPU-native KV cache management, and a production-ready serving stack to deliver maximum throughput on AMD server-class hardware.
🔥 Check out our blog post: AMD PACE: High-Performance Platform Aware Compute Engine — deep dive into the architecture, optimizations, and benchmarks showing 1.6x autoregressive and 3.2x speculative decoding throughput speedup over vLLM on 5th Gen AMD EPYC™ processors.
NOTE: AMD PACE is designed and tested for systems with AVX512 or higher support. On systems lacking AVX512, performance may degrade significantly due to fallback to slower reference implementations, or the library might not function as intended.
-
SlabPool Attention — A CPU-native KV cache and attention backend engineered for AMD EPYC processors. SlabPool manages all sequences in a single pre-allocated BF16 tensor with O(1) slab allocation, L2-aware block sizing, and a unified attention dispatcher that selects the optimal kernel path per sequence — GQA-aware decode with online softmax, multi-token decode, or tiled prefill — all within one OMP dispatch. Handles offline and online inference, continuous batching, sliding window, and sink attention through a single entry point. More in SlabAttention.md.
-
Inference Server —
pace-serverprovides a full serving stack with a router/engine architecture, continuous batching, multi-instance NUMA-aware execution, and built-in metrics. The launcher automatically partitions CPU cores across engine instances and binds memory to the local NUMA node for optimal data locality. More in InferenceServer.md. -
Paged Attention — An implementation of vLLM-style paged KV cache on CPU. Memory is allocated in fixed-size pages, eliminating fragmentation from variable-length sequences and enabling efficient memory sharing. Fully integrated with the PACE serving stack and all supported models.
-
Fused AVX512 Kernels — PACE ships a suite of fused operators that eliminate intermediate memory traffic and keep data in registers/cache: fused Add+RMSNorm and Add+LayerNorm, fused RoPE, fused QKV projections, and a fused MLP kernel (via TPP/libXSMM). These are the default operators for all supported models.
-
Broad Model Support — Llama (up to 3.3), Qwen2/2.5, Phi3/4, Gemma 3, GPT-J, OPT, and GPT-OSS, all running in BF16 with the same operator and backend framework. Adding a new architecture is a single-file effort. More in LLM.md.
-
Speculative Decoding (PARD) — Built-in PARallel Draft Model Adaptation that runs a smaller draft model ahead of the target, verifying speculated tokens in a single parallel forward pass. PARD can deliver up to 5x throughput improvement over standard autoregressive decoding. More in SpeculativeDecoding.md.
- Installation
- Inference Server
- More about AMD PACE
- Models Supported
- Examples
- Performance Guide
- Benchmarks
- SlabPool Attention
- Contributing to AMD PACE
- Tests
- Known Limitations
- External Dependencies
- Resources
To install AMD PACE, follow the instructions below:
NOTE: AMD PACE will need gcc>=12 and make installed.
On ubuntu, they can be installed with
sudo apt install build-essential gcc-12 g++-12
-
We recommend to use miniforge environment for installing AMD PACE. Install miniforge from here. Once miniforge is installed, create a environment with python 3.12 as follows:
conda create -n pace-env-py3.12 python=3.12 -y conda activate pace-env-py3.12NOTE: AMD PACE is tested to work with Python 3.10 through 3.13. Python 3.12 is recommended for the best compatibility with dependencies.
-
Install the required dependencies for AMD PACE as follows:
pip install -r requirements.txt -
Build AMD PACE from source as follows:
pip install -r build_requirements.txt [-v] .This will build AMD PACE and install it in the current environment. The
-voption is optional and can be used to enable verbose output during the build process.NOTE: It uses the new way of building packages with
pip, for more details refer to PEP 517. Thebuild_requirements.txtshould be passed in during installation to ensure that the build environment is set up correctly, please refer to PEP 518 for more details.For developers who need to build AMD PACE frequently, using pip with
--no-build-isolationis recommended to avoid unnecessary overhead of creating isolated environments for each build. This speeds up the build process significantly. Make sure to have all the required dependencies installed in your environment before using this option.pip install --no-build-isolation [-v] .NOTE: Building AMD PACE, especially the oneDNN component, can require significant memory. If your system does not have enough RAM, the build process may fail or your machine may run out of memory.
The following models are supported by AMD PACE:
The examples/ directory contains runnable scripts and notebooks to get started with PACE:
- PACE LLM Basic - basic offline generation example
- PACE LLM Streamer - streaming text generation example
- PACE GPT-OSS Chat Notebook - offline GPT-OSS chat workflow with chat templating and final-answer extraction
- PACE Sarvam Translate Quickstart - offline translation notebook for Sarvam Translate
- PACE Server Basic - minimal inference server example
- Server Playbook - end-to-end inference server notebook
- Speculative Server Playbook - speculative decoding server notebook
Benchmarks for AMD PACE are available in the benchmarks directory. The benchmarks include:
To enable verbose mode, set the environment variable PACE_LOG_LEVEL. The following levels are supported:
| Level | Environment Variable |
|---|---|
| Debug | export PACE_LOG_LEVEL=debug |
| Profile | export PACE_LOG_LEVEL=profile |
| Info | export PACE_LOG_LEVEL=info |
| Warning | export PACE_LOG_LEVEL=warning |
| Error | export PACE_LOG_LEVEL=error |
| None | export PACE_LOG_LEVEL=none |
NOTE: By default, the log level is set to info.
We welcome contributions to AMD PACE! Please see docs/Contributing.md for guidelines on adding operators, creating core functions, code style, testing, and submitting PRs.
The following feature combinations are not yet supported in this release and may not work as expected:
| Feature | Unsupported Configuration | Supported Alternative |
|---|---|---|
| GPT-OSS | JIT attention backend, PAGED cache type | SLAB_POOL cache with SLAB attention backend |
| PARD speculative decoding (offline) | PAGED cache type | BMC or SLAB_POOL cache types |
| PARD speculative decoding (server) | PAGED or SLAB_POOL cache types | BMC cache type |
| Library | Version | Description |
|---|---|---|
| PyTorch | v2.9.0 | Core framework (also a runtime dependency) |
| oneDNN | v3.11 | JIT-compiled kernels for attention, norms, and linear ops |
| FBGEMM | v1.2.0 | Quantized EmbeddingBag kernels |
| libXSMM | c14cbc6 | Tensor Processing Primitives (TPP) for MLPs and linear ops |
| AOCL-DLP | 3256da1 | AMD AOCL Deep Learning Primitives for GEMMs |
| Package | Version |
|---|---|
| transformers | 4.55.2 |
| safetensors | >= 0.5.2 |
| huggingface-hub | 0.35.0 |
| fastapi | 0.115.12 |
| uvicorn | 0.34.2 |
| prometheus_client | 0.23.1 |
See requirements.txt for the full list of Python dependencies.