-
Notifications
You must be signed in to change notification settings - Fork 23
Add vLLM optimization guide and skill #32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
4672221
3dc0dc5
e49fc00
8bc119a
751a92d
8a32f4b
de30726
14d64c5
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,312 @@ | ||||||
| # vLLM on Intel Xeon Processors | ||||||
|
|
||||||
| This guide provides recommendations for running vLLM on Intel Xeon processors. | ||||||
|
|
||||||
| ## Upstream First | ||||||
|
|
||||||
| Intel invests significant efforts upstreaming code optimizations and documentation directly to the official vLLM repositories. Those upstream contributions form the foundation of Intel Xeon CPU performance in vLLM. This guide is only a small extension of that work—collecting practical deployment tips in one place. Users should always consult the official documentation. | ||||||
|
|
||||||
| - [vLLM CPU installation guide](https://docs.vllm.ai/en/stable/getting_started/installation/cpu/) | ||||||
| - [vLLM SLM/LLM Recipes](https://recipes.vllm.ai/) | ||||||
| - [vLLM Benchmarking](https://docs.vllm.ai/en/stable/benchmarking/cli/) | ||||||
|
|
||||||
| ## Use with AI Coding Agents | ||||||
|
|
||||||
| This recipe ships a companion [Agent Skill](./skill/SKILL.md) (`vllm-xeon-cpu`) that lets AI coding agents — GitHub Copilot, Claude Code, and other `AGENTS.md`-aware tools — deploy, tune, validate, and benchmark vLLM on Intel Xeon CPUs on a customer's behalf. The skill is a self-contained, markdown-only payload under [`skill/`](./skill/) that you copy into your own workspace or user profile. | ||||||
|
|
||||||
| > **Folder name must match `name`.** When you install the skill, the destination folder **must** be named `vllm-xeon-cpu` (matching the `name:` field in the skill's frontmatter). Otherwise the agent will not discover it. | ||||||
|
|
||||||
| Install the skill once per workspace or user profile. Pick the install path for your agent runtime: | ||||||
|
|
||||||
| | Runtime | Install path | Notes | | ||||||
| | --- | --- | --- | | ||||||
| | GitHub Copilot (workspace) | `.github/skills/vllm-xeon-cpu/` | Shared with everyone working in the repo and with the Copilot coding agent on PRs / issues. | | ||||||
| | GitHub Copilot (personal) | `~/.copilot/skills/vllm-xeon-cpu/` | Available across all your workspaces; not shared. | | ||||||
| | Claude Code (workspace) | `.claude/skills/vllm-xeon-cpu/` | Shared via the repo. | | ||||||
|
|
||||||
| ### GitHub Copilot — Repo Workspace | ||||||
|
|
||||||
| ```bash | ||||||
| mkdir -p .github/skills/vllm-xeon-cpu | ||||||
| curl -L https://github.com/intel/optimization-zone/archive/refs/heads/main.tar.gz \ | ||||||
| | tar -xz --strip-components=4 -C .github/skills/vllm-xeon-cpu \ | ||||||
| optimization-zone-main/software/vllm/skill | ||||||
| ``` | ||||||
|
|
||||||
| ### GitHub Copilot — User profile | ||||||
|
|
||||||
| ```bash | ||||||
| mkdir -p ~/.copilot/skills/vllm-xeon-cpu | ||||||
| curl -L https://github.com/intel/optimization-zone/archive/refs/heads/main.tar.gz \ | ||||||
| | tar -xz --strip-components=4 -C ~/.copilot/skills/vllm-xeon-cpu \ | ||||||
| optimization-zone-main/software/vllm/skill | ||||||
| ``` | ||||||
|
|
||||||
| ### Claude Code — Repo Workspace | ||||||
|
|
||||||
| ```bash | ||||||
| mkdir -p .claude/skills/vllm-xeon-cpu | ||||||
| curl -L https://github.com/intel/optimization-zone/archive/refs/heads/main.tar.gz \ | ||||||
| | tar -xz --strip-components=4 -C .claude/skills/vllm-xeon-cpu \ | ||||||
| optimization-zone-main/software/vllm/skill | ||||||
| ``` | ||||||
|
|
||||||
| After install, invoke from chat with `/vllm-xeon-cpu` or let the agent auto-load the skill when your request matches keywords like "vLLM", "Xeon". | ||||||
|
|
||||||
| ## Intel Xeon SLM/LLM Sizing Guidance | ||||||
|
|
||||||
| For guidance around SLM/LLM sizing on Intel Xeon CPUs, please see our Xeon Processor Advisor Tool & AI Software Catalog: | ||||||
|
|
||||||
| - [Cloud Intel Xeon AI Performance Advisor](https://xeonprocessoradvisor.intel.com/csp-ai-performance-advisor) | ||||||
| - [On-prem Intel Xeon AI Performance Advisor](https://xeonprocessoradvisor.intel.com/on-prem-ai-performance-advisor) | ||||||
| - [Intel AI Software Catalog - Model Guidance](https://swcatalog.intel.com/models) | ||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. First two requires end user to provide email address. Lets move the third bullet to the top, since that link has direct access to content. |
||||||
|
|
||||||
| ## vLLM Requirements Guidance | ||||||
|
|
||||||
| | Item | Guidance | | ||||||
| | --- | --- | | ||||||
| | OS | Linux | | ||||||
| | Python | 3.10 through 3.13 | | ||||||
| | vLLM | `v0.17.0` or newer | | ||||||
| | Intel AMX Xeon CPU Flags | 4th Gen or newer with `amx_tile`, `amx_bf16`, and `amx_int8` for best BF16/INT8 performance | | ||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
|
||||||
| ## Performance Guidance | ||||||
|
|
||||||
| | Setting | Guidance | Why it matters | | ||||||
| | --- | --- | --- | | ||||||
| | `--dtype=bfloat16` | Use `bfloat16` on Intel Xeon with Intel AMX | Enables the preferred CPU dtype and AMX | | ||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does this enable the preferred CPU dtype and enables Intel AMX acceleration? |
||||||
| | `VLLM_CPU_KVCACHE_SPACE` | `20` to `40` GiB or larger | Larger values allow more concurrency and context, but must fit per NUMA node. | | ||||||
| | `VLLM_CPU_OMP_THREADS_BIND` | `auto` | Binds OpenMP worker threads to NUMA-local cores. Use ranges such as `0-31\|32-63` for manual control, `auto` preferred. | | ||||||
| | `VLLM_CPU_NUM_OF_RESERVED_CPU` | `1` | Sets a core for API serving, tokenization, networking, logging, and OS work. | | ||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
| | `--tensor-parallel-size` | Use default for single NUMA or set to NUMA node count | Keeps model shards close to local memory; current vLLM CPU releases do not support `6`. | | ||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
| | `--max-num-batched-tokens` | Online: `2048`; offline: `4096` | Maximum number of batched tokens per iteration. Tune for prefill throughput and time to first token. | | ||||||
| | `--max-num-seqs` | Online: `128`; offline: `256` | Maximum number of sequences per iteration. Tune for decode throughput and inter-token latency. | | ||||||
| | `VLLM_CPU_SGL_KERNEL` | `0`, or try `1` for low-latency SLM serving | Experimental x86 small-batch kernels; requires AMX, BF16 weights, and compatible shapes. | | ||||||
|
|
||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you elaborate if the above performance guidance settings are environment settings? or some flags/arguments for a benchmark? |
||||||
| ## Utility Tools | ||||||
|
|
||||||
| Install the small host tools used by the commands below: | ||||||
|
|
||||||
| ```bash | ||||||
| sudo apt-get update | ||||||
| sudo apt-get install -y --no-install-recommends curl git jq numactl htop python3-venv python3-full g++ python3-dev | ||||||
| ``` | ||||||
|
|
||||||
| ## Hardware Validation | ||||||
|
|
||||||
| Validate the CPU model, core count, thread count, NUMA topology, and important flags such as `avx512f`, `avx2`, `amx_tile`, `amx_bf16`, `amx_int8`, and `avx512_bf16`. | ||||||
|
|
||||||
| ```bash | ||||||
| lscpu | grep -E "Model name|Socket|Core|Thread|NUMA node|Flags" | ||||||
| lscpu | grep -E "avx512f|avx2|amx_(tile|bf16|int8)|avx512_bf16" | ||||||
| numactl --hardware | ||||||
| ``` | ||||||
|
|
||||||
| ## Fast Path: Docker | ||||||
|
|
||||||
| <details> | ||||||
| <summary>(Optional) Install Docker on Ubuntu 24.04</summary> | ||||||
|
|
||||||
| ```bash | ||||||
| sudo apt-get update | ||||||
| sudo apt-get install -y docker.io | ||||||
| sudo systemctl enable --now docker | ||||||
| sudo usermod -aG docker $USER | ||||||
| newgrp docker # apply group without re-login | ||||||
| ``` | ||||||
|
|
||||||
| </details> | ||||||
|
|
||||||
| ```bash | ||||||
| export HF_TOKEN=your_hf_token_here # <<<=== Required for gated Hugging Face models and faster downloads. | ||||||
| export VLLM_VERSION=0.20.2 # <<<=== Update this for newer releases! Check! | ||||||
| docker pull vllm/vllm-openai-cpu:v${VLLM_VERSION}-x86_64 | ||||||
|
|
||||||
| docker run --rm \ | ||||||
| --name vllm-cpu \ | ||||||
| --security-opt seccomp=unconfined \ | ||||||
| --cap-add SYS_NICE \ | ||||||
| --shm-size=8g \ | ||||||
| -p 8000:8000 \ | ||||||
| -e HF_TOKEN="${HF_TOKEN}" \ | ||||||
| -e VLLM_CPU_KVCACHE_SPACE=20 \ | ||||||
| -e VLLM_CPU_OMP_THREADS_BIND=auto \ | ||||||
| -e VLLM_CPU_NUM_OF_RESERVED_CPU=1 \ | ||||||
| vllm/vllm-openai-cpu:v${VLLM_VERSION}-x86_64 \ | ||||||
| RedHatAI/Qwen3-4B-Instruct-2507-quantized.w8a8 \ | ||||||
| --dtype=bfloat16 \ | ||||||
| --max-num-batched-tokens 2048 \ | ||||||
| --max-num-seqs 128 | ||||||
| ``` | ||||||
|
|
||||||
| `SYS_NICE` and `seccomp=unconfined` allow vLLM's NUMA memory policy calls inside Docker. Without them, serving can still work, but NUMA placement may be weaker and logs can show `get_mempolicy: Operation not permitted`. | ||||||
|
|
||||||
| ## Validate the OpenAI-compatible endpoint | ||||||
|
|
||||||
| **Open a new terminal or use a remote system.** | ||||||
|
|
||||||
| ```bash | ||||||
| curl http://localhost:8000/v1/chat/completions \ | ||||||
| -H "Content-Type: application/json" \ | ||||||
| -d '{ | ||||||
| "model": "RedHatAI/Qwen3-4B-Instruct-2507-quantized.w8a8", | ||||||
| "messages": [{"role": "user", "content": "Give three CPU inference tuning tips."}], | ||||||
| "max_tokens": 128 | ||||||
| }' | ||||||
| ``` | ||||||
|
|
||||||
| ## Benchmarking Guidance | ||||||
|
|
||||||
| This summarizes the official benchmarking and tuning guidance from the vLLM documentation, with a CPU focus. Always consult the [official benchmarking docs](https://docs.vllm.ai/en/latest/benchmarking/cli/) for the latest recommendations and tools. | ||||||
|
|
||||||
| Start the Docker. If it is running in the foreground, open another terminal for these checks: | ||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
|
||||||
| ```bash | ||||||
| # Docker path, because the server above is named vllm-cpu. | ||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
| docker exec vllm-cpu vllm collect-env | ||||||
|
|
||||||
| sudo curl -s http://localhost:8000/v1/models | jq . | ||||||
| SERVER_PID=$(pgrep -f 'vllm serve|api_server' | head -n 1) | ||||||
| numastat -p "${SERVER_PID}" | ||||||
| ``` | ||||||
|
|
||||||
| ### Benchmark | ||||||
|
|
||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you add a few lines on what is vllm bench serve before the docker commands. Is it a subset of the Benchmark Suite you have there later? |
||||||
| Use `vllm bench serve` to measure TTFT, TPOT, and throughput against the running server. Warm up with `--num-warmups` to avoid measuring JIT compilation overhead. | ||||||
|
|
||||||
| If you started the server with Docker (as shown above), run the benchmark **inside the container**: | ||||||
|
|
||||||
| ```bash | ||||||
| docker exec vllm-cpu vllm bench serve \ | ||||||
| --model RedHatAI/Qwen3-4B-Instruct-2507-quantized.w8a8 \ | ||||||
| --dataset-name random \ | ||||||
| --random-input-len 128 \ | ||||||
| --random-output-len 128 \ | ||||||
| --num-prompts 100 \ | ||||||
| --num-warmups 5 \ | ||||||
| --request-rate inf \ | ||||||
| --save-result \ | ||||||
| --result-dir ./bench-results \ | ||||||
| --percentile-metrics ttft,tpot,itl | ||||||
| ``` | ||||||
|
|
||||||
| If you installed vLLM natively (via `pip install vllm`), run directly on the host: | ||||||
|
|
||||||
| ```bash | ||||||
| vllm bench serve \ | ||||||
| --model RedHatAI/Qwen3-4B-Instruct-2507-quantized.w8a8 \ | ||||||
| --dataset-name random \ | ||||||
| --random-input-len 128 \ | ||||||
| --random-output-len 128 \ | ||||||
| --num-prompts 100 \ | ||||||
| --num-warmups 5 \ | ||||||
| --request-rate inf \ | ||||||
| --save-result \ | ||||||
| --result-dir ./bench-results \ | ||||||
| --percentile-metrics ttft,tpot,itl | ||||||
| ``` | ||||||
|
|
||||||
| > **Troubleshooting: "Failed to infer device type"** — This error means vLLM's platform detection cannot find the CPU backend. The most common cause is installing the generic (CUDA) wheel from PyPI via `pip install vllm` instead of the CPU-specific wheel. The CPU wheel includes `+cpu` in its version string (e.g., `0.20.2+cpu`), which the platform detector requires. Fix by reinstalling the CPU wheel directly: | ||||||
| > | ||||||
| > ```bash | ||||||
| > export VLLM_VERSION=0.20.2 | ||||||
| > pip install --force-reinstall --extra-index-url https://download.pytorch.org/whl/cpu \ | ||||||
| > "https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}+cpu-cp38-abi3-manylinux_2_35_x86_64.whl" | ||||||
| > ``` | ||||||
|
|
||||||
| ### Concurrency Sweep | ||||||
|
|
||||||
| With `--request-rate inf`, all prompts fire simultaneously so `--num-prompts` directly controls concurrency. Sweep to see how latency and throughput scale under increasing batch pressure: | ||||||
|
|
||||||
| ```bash | ||||||
| for N in 10 50 100 200 500; do | ||||||
| vllm bench serve \ | ||||||
| --model RedHatAI/Qwen3-4B-Instruct-2507-quantized.w8a8 \ | ||||||
| --dataset-name random \ | ||||||
| --random-input-len 128 \ | ||||||
| --random-output-len 128 \ | ||||||
| --num-prompts "${N}" \ | ||||||
| --num-warmups 5 \ | ||||||
| --request-rate inf \ | ||||||
| --save-result \ | ||||||
| --result-dir ./bench-results \ | ||||||
| --percentile-metrics ttft,tpot,itl | ||||||
| done | ||||||
| ``` | ||||||
|
|
||||||
| ### Testing & Tuning Methodology | ||||||
|
|
||||||
| - Test with different input/output lengths to understand how the model performs under different prompt and generation sizes. For example, try `--random-input-len` and `--random-output-len` values of `64`, `128`, `256`, and `512`. | ||||||
| - Test with different user concurrency levels using `--num-prompts` values of `10`, `50`, `100`, `200`, and `500` with `--request-rate inf`. | ||||||
| - Use one known-good model and change one knob at a time. Track TTFT, TPOT, output tokens per second, requests per second, peak RSS, NUMA locality, and OOM events. | ||||||
| - Vary only one of `VLLM_CPU_KVCACHE_SPACE`, `VLLM_CPU_OMP_THREADS_BIND`, `--max-num-batched-tokens`, `--max-num-seqs`, or `--block-size` per run. Compare results across runs using the saved JSON files in `./bench-results`. | ||||||
|
|
||||||
| ### Using the vLLM Benchmark Suite | ||||||
|
|
||||||
| The vLLM source tree includes a full performance benchmark harness at `.buildkite/performance-benchmarks/scripts/run-performance-benchmarks.sh`. This is the same script used in vLLM's CI to gate regressions. It reads a JSON test definition, generates concrete benchmark commands, and (optionally) executes them. | ||||||
|
|
||||||
| Prepare the environment and run a dry-run first to inspect the generated commands without executing them: | ||||||
|
|
||||||
| ```bash | ||||||
| export HF_TOKEN=your_hf_token_here # <<<=== Required for gated Hugging Face models and faster downloads. | ||||||
| export VLLM_VERSION=0.20.2 | ||||||
| python3 -m venv ~/vllm-venv | ||||||
| source ~/vllm-venv/bin/activate | ||||||
| pip install --extra-index-url https://download.pytorch.org/whl/cpu \ | ||||||
| "https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}+cpu-cp38-abi3-manylinux_2_35_x86_64.whl" \ | ||||||
| tabulate pandas | ||||||
| ``` | ||||||
|
|
||||||
| Clone the source tree (or reuse the checkout from a source build): | ||||||
|
|
||||||
| ```bash | ||||||
| git clone https://github.com/vllm-project/vllm.git vllm_source | ||||||
| cd vllm_source | ||||||
| export VLLM_TARGET_DEVICE=cpu | ||||||
| ``` | ||||||
|
|
||||||
| Run a dry-run first to inspect the generated commands without executing them: | ||||||
|
|
||||||
| ```bash | ||||||
| source ~/vllm-venv/bin/activate | ||||||
| HF_TOKEN="${HF_TOKEN}" \ | ||||||
| ON_CPU=1 \ | ||||||
| SERVING_JSON=serving-tests-cpu-text.json \ | ||||||
| DRY_RUN=1 \ | ||||||
| MODEL_FILTER=meta-llama/Llama-3.1-8B-Instruct \ | ||||||
| DTYPE_FILTER=bfloat16 \ | ||||||
| bash .buildkite/performance-benchmarks/scripts/run-performance-benchmarks.sh | ||||||
| ``` | ||||||
| To execute the benchmark (remove `DRY_RUN=1`): | ||||||
|
|
||||||
| ```bash | ||||||
| source ~/vllm-venv/bin/activate | ||||||
| HF_TOKEN="${HF_TOKEN}" \ | ||||||
| ON_CPU=1 \ | ||||||
| SERVING_JSON=serving-tests-cpu-text.json \ | ||||||
| MODEL_FILTER=meta-llama/Llama-3.1-8B-Instruct \ | ||||||
| DTYPE_FILTER=bfloat16 \ | ||||||
| bash .buildkite/performance-benchmarks/scripts/run-performance-benchmarks.sh | ||||||
| ``` | ||||||
|
|
||||||
| Key environment variables: | ||||||
|
|
||||||
| | Variable | Purpose | | ||||||
| | -------- | ------- | | ||||||
| | `HF_TOKEN` | Hugging Face token — required by the script's `check_hf_token` gate | | ||||||
| | `ON_CPU` | Set to `1` to use CPU-specific test configs | | ||||||
| | `SERVING_JSON` | JSON file defining test matrix (e.g., `serving-tests-cpu-text.json`) | | ||||||
| | `DRY_RUN` | Set to `1` to generate commands without executing | | ||||||
| | `MODEL_FILTER` | Run only benchmarks matching this model ID | | ||||||
| | `DTYPE_FILTER` | Run only benchmarks matching this dtype (e.g., `bfloat16`) | | ||||||
|
|
||||||
| > **Note:** The `MODEL_FILTER` value must match an entry in the JSON test definition. If the model is not pre-curated in the CPU test JSON, you can add an entry or use the `vllm bench serve` approach above instead. | ||||||
|
|
||||||
| ## References | ||||||
|
|
||||||
| - [vLLM CPU installation guide](https://docs.vllm.ai/en/stable/getting_started/installation/cpu/) | ||||||
| - [vLLM CPU-supported models](https://docs.vllm.ai/en/stable/models/hardware_supported_models/cpu/) | ||||||
| - [vLLM optimization and tuning guide](https://docs.vllm.ai/en/stable/configuration/optimization/) | ||||||
| - [vLLM bench serve CLI](https://docs.vllm.ai/en/latest/cli/bench/serve.html) | ||||||
| - [vLLM bench latency CLI](https://docs.vllm.ai/en/latest/cli/bench/latency.html) | ||||||
| - [vLLM Intel quantization support](https://docs.vllm.ai/en/stable/features/quantization/inc/) | ||||||
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, Thanks for the contribution!
Consider adding a table of contents and breaking down the document into sections/subsections.
There are three broad topics being covered -
vLLM on Xeon guidance,
AI agent skill installation and benchmarking workflow.
Probably we can move the "Use with AI Coding Agents" towards the end, so we don't confuse and redirect someone who lands on the page for immediate tuning guidance.