AMD PACE v1.1.0

Blog post: AMD PACE: High-Performance Platform Aware Compute Engine — architecture deep dive, optimization techniques, and benchmarks on 5th Gen AMD EPYC™ processors.

Performance Optimizations

SlabPool Attention: A new CPU-native KV cache and unified AVX-512 attention backend. SlabPool manages all sequences in a single pre-allocated BF16 tensor with O(1) slab allocation, L2-aware auto-tuned block sizes, and a dispatcher that selects the optimal kernel per sequence — GQA-aware decode with online softmax, multi-token decode, or BRGeMM-tiled prefill.
Paged Attention: vLLM-style paged KV cache on CPU with block-level memory management. Supports offline and server modes for all models where the PAGED cache type is supported (see Known Limitations).
Fused AVX-512 Kernels: New suite of fused operators — fused Add+RMSNorm and Add+LayerNorm, fused RoPE, fused QKV projections, and fused MLP via libXSMM TPP — eliminating intermediate memory traffic and keeping data in registers/cache.
AOCL-DLP Backend: AMD AOCL Deep Learning Primitives backend for BF16 linear operations with fused activations (ReLU, GELU, SiLU) and fused element-wise multiply.
Server Decode Optimizations: Batched decode, vectorized block table population, simplified MaskCache with content-aware buffer reuse, and automatic NUMA-aware process affinity.
BMC KV Cache Improvements: Optimized cache writes and added BRGeMM tiled prefill attention for the BMC backend.

Functionality

New Models: Gemma 3 (causal and conditional generation) and GPT-OSS.
Inference Server Enhancements: Multi-instance serving with automatic NUMA core partitioning, continuous prefill-first scheduler, batch decoding, and Prometheus metrics integration (TTFT, TPOT, request rates).
Speculative Decoding (PARD): Extensible speculative decoding module with full serving integration, supporting batched multi-token verification.
Pluggable Operator Framework: Python operator framework with 5 interchangeable backends (NATIVE, JIT, TPP, IMBPS, AOCLDLP) and automatic fallback. Operators can be configured per-type via OperatorConfig.
Penalties & Generation Config: Support for repetition penalty, frequency penalty, temperature, top-k, top-p, and min-p sampling with configurable generation parameters.

Validation

Upgraded Dependencies: PyTorch v2.9.0, oneDNN v3.11, GCC 14 build support.
Correctness Tests: Model correctness tests comparing PACE outputs against HuggingFace reference implementations.
Documentation: Comprehensive overhaul of all docs — PerformanceGuide, Contributing, LLM, InferenceServer, PythonOps, SpeculativeDecoding — with server playbooks and example notebooks.
Validated Models:
- PARD models
  - amd/PARD-Llama-3.2-1B
  - amd/PARD-Qwen2.5-0.5B
  - amd/PARD-DeepSeek-R1-Distill-Qwen-1.5B
- LLM models
  - meta-llama/Llama-3.1-8B
  - meta-llama/Llama-3.2-3B
  - Qwen/Qwen2-7B-Instruct
  - microsoft/phi-4
  - facebook/opt-6.7b
  - EleutherAI/gpt-j-6b
  - deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
  - google/gemma-3-4b-it (new)
  - google/gemma-3-12b-it (new)
  - openai/gpt-oss-20b (new)

Known Limitations

Feature	Unsupported Configuration	Supported Alternative
GPT-OSS	JIT attention backend, PAGED cache type	SLAB_POOL cache with SLAB attention backend
PARD speculative decoding (offline)	PAGED cache type	BMC or SLAB_POOL cache types
PARD speculative decoding (server)	PAGED or SLAB_POOL cache types	BMC cache type

Performance Optimizations

CPU-Optimized Kernels: Engineered for AMD Zen4 CPUs and beyond, utilizing CPU-friendly cache and kernel optimizations for significant performance gains in LLM workloads.

Speculative Decoding: Features a built-in implementation of PARallel Draft Model Adaptation (PARD), delivering up to 5x throughput improvement over standard autoregressive methods.

Data Parallelism: Achieves up to 25x end-to-end speedup by serving multiple requests concurrently across multiple model instances, maximizing hardware utilization.

Functionality

Model Support: Initial support for Large Language Models (LLMs) and INT8 DLRMv2.

Inference Server: Includes a ready-to-use, high-performance inference server for easy deployment and testing.

Logging Control: Easily configure log verbosity using the PACE_LOG_LEVEL environment variable for better debugging and profiling.

Validation

Benchmarks: The benchmarks directory contains scripts for evaluating both performance and accuracy for supported LLMs.

Testing Suite: A dedicated test suite is included to ensure correctness and stability.

Validated Environment: Tested for Python 3.9+ (3.12 recommended) and gcc>=12 to ensure compatibility and reliability.

Validated Models:

PARD models
- amd/PARD-Llama-3.2-1B
- amd/PARD-Qwen2.5-0.5B
- amd/PARD-DeepSeek-R1-Distill-Qwen-1.5B
LLM models
- meta-llama/Llama-3.1-8B
- meta-llama/Llama-3.2-3B
- Qwen/Qwen2-7B-Instruct
- microsoft/phi-4
- facebook/opt-6.7b
- EleutherAI/gpt-j-6b
- deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
Recsys model
- INT8 DLRM v2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

AMD PACE v1.1.0

Performance Optimizations

Functionality

Validation

Known Limitations

Uh oh!

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Performance Optimizations

Functionality

Validation

Uh oh!

Releases: amd/AMD-PACE

AMD PACE v1.1.0

AMD PACE v1.1.0

Performance Optimizations

Functionality

Validation

Known Limitations

Uh oh!

AMD PACE v1.0

Performance Optimizations

Functionality

Validation

Uh oh!