AMD PACE v1.1.0
Blog post: AMD PACE: High-Performance Platform Aware Compute Engine — architecture deep dive, optimization techniques, and benchmarks on 5th Gen AMD EPYC™ processors.
Performance Optimizations
- SlabPool Attention: A new CPU-native KV cache and unified AVX-512 attention backend. SlabPool manages all sequences in a single pre-allocated BF16 tensor with O(1) slab allocation, L2-aware auto-tuned block sizes, and a dispatcher that selects the optimal kernel per sequence — GQA-aware decode with online softmax, multi-token decode, or BRGeMM-tiled prefill.
- Paged Attention: vLLM-style paged KV cache on CPU with block-level memory management. Supports offline and server modes for all models where the PAGED cache type is supported (see Known Limitations).
- Fused AVX-512 Kernels: New suite of fused operators — fused Add+RMSNorm and Add+LayerNorm, fused RoPE, fused QKV projections, and fused MLP via libXSMM TPP — eliminating intermediate memory traffic and keeping data in registers/cache.
- AOCL-DLP Backend: AMD AOCL Deep Learning Primitives backend for BF16 linear operations with fused activations (ReLU, GELU, SiLU) and fused element-wise multiply.
- Server Decode Optimizations: Batched decode, vectorized block table population, simplified MaskCache with content-aware buffer reuse, and automatic NUMA-aware process affinity.
- BMC KV Cache Improvements: Optimized cache writes and added BRGeMM tiled prefill attention for the BMC backend.
Functionality
- New Models: Gemma 3 (causal and conditional generation) and GPT-OSS.
- Inference Server Enhancements: Multi-instance serving with automatic NUMA core partitioning, continuous prefill-first scheduler, batch decoding, and Prometheus metrics integration (TTFT, TPOT, request rates).
- Speculative Decoding (PARD): Extensible speculative decoding module with full serving integration, supporting batched multi-token verification.
- Pluggable Operator Framework: Python operator framework with 5 interchangeable backends (NATIVE, JIT, TPP, IMBPS, AOCLDLP) and automatic fallback. Operators can be configured per-type via
OperatorConfig. - Penalties & Generation Config: Support for repetition penalty, frequency penalty, temperature, top-k, top-p, and min-p sampling with configurable generation parameters.
Validation
- Upgraded Dependencies: PyTorch v2.9.0, oneDNN v3.11, GCC 14 build support.
- Correctness Tests: Model correctness tests comparing PACE outputs against HuggingFace reference implementations.
- Documentation: Comprehensive overhaul of all docs — PerformanceGuide, Contributing, LLM, InferenceServer, PythonOps, SpeculativeDecoding — with server playbooks and example notebooks.
- Validated Models:
- PARD models
amd/PARD-Llama-3.2-1Bamd/PARD-Qwen2.5-0.5Bamd/PARD-DeepSeek-R1-Distill-Qwen-1.5B
- LLM models
meta-llama/Llama-3.1-8Bmeta-llama/Llama-3.2-3BQwen/Qwen2-7B-Instructmicrosoft/phi-4facebook/opt-6.7bEleutherAI/gpt-j-6bdeepseek-ai/DeepSeek-R1-Distill-Qwen-7Bgoogle/gemma-3-4b-it(new)google/gemma-3-12b-it(new)openai/gpt-oss-20b(new)
- PARD models
Known Limitations
| Feature | Unsupported Configuration | Supported Alternative |
|---|---|---|
| GPT-OSS | JIT attention backend, PAGED cache type | SLAB_POOL cache with SLAB attention backend |
| PARD speculative decoding (offline) | PAGED cache type | BMC or SLAB_POOL cache types |
| PARD speculative decoding (server) | PAGED or SLAB_POOL cache types | BMC cache type |