Add vLLM optimization guide and skill by lucasmelogithub · Pull Request #32 · intel/optimization-zone

lucasmelogithub · 2026-05-29T15:05:31Z

This pull request adds comprehensive support and documentation for running vLLM (an efficient LLM inference engine) on Intel Xeon CPUs. It introduces a new vLLM section to the main documentation, provides a detailed user guide for vLLM CPU deployment and tuning, and supplies an AI agent skill for automating vLLM setup and benchmarking on Xeon systems.

The most important changes are:

Documentation and User Guidance:

Added a new entry for vLLM under the software section in the main README.md, making vLLM CPU guidance discoverable alongside other AI workloads.
Created software/vllm/README.md, a thorough guide covering installation, hardware validation, Docker-based deployment, performance tuning, benchmarking, and troubleshooting for vLLM on Intel Xeon CPUs. This guide includes practical tips, recommended settings, and links to upstream documentation.

AI Agent Skill for Automation:

Introduced software/vllm/skill/SKILL.md, an agent skill enabling AI coding agents (like GitHub Copilot and Claude Code) to automate vLLM deployment, tuning, hardware validation, and benchmarking on Xeon CPUs. The skill provides step-by-step procedures and best practices, and is designed for easy integration with agent workflows.

rsiyer-intel · 2026-05-29T23:52:58Z

+
+- [Cloud Intel Xeon AI Performance Advisor](https://xeonprocessoradvisor.intel.com/csp-ai-performance-advisor)
+- [On-prem Intel Xeon AI Performance Advisor](https://xeonprocessoradvisor.intel.com/on-prem-ai-performance-advisor)
+- [Intel AI Software Catalog - Model Guidance](https://swcatalog.intel.com/models)


First two requires end user to provide email address. Lets move the third bullet to the top, since that link has direct access to content.

rsiyer-intel · 2026-05-29T23:54:04Z

@@ -0,0 +1,312 @@
+# vLLM on Intel Xeon Processors


Hi, Thanks for the contribution!

Consider adding a table of contents and breaking down the document into sections/subsections.

There are three broad topics being covered -
vLLM on Xeon guidance,
AI agent skill installation and benchmarking workflow.

Probably we can move the "Use with AI Coding Agents" towards the end, so we don't confuse and redirect someone who lands on the page for immediate tuning guidance.

rsiyer-intel · 2026-05-29T23:56:10Z

+| `--max-num-batched-tokens` | Online: `2048`; offline: `4096` | Maximum number of batched tokens per iteration. Tune for prefill throughput and time to first token. |
+| `--max-num-seqs` | Online: `128`; offline: `256` | Maximum number of sequences per iteration. Tune for decode throughput and inter-token latency. |
+| `VLLM_CPU_SGL_KERNEL` | `0`, or try `1` for low-latency SLM serving | Experimental x86 small-batch kernels; requires AMX, BF16 weights, and compatible shapes. |
+


Can you elaborate if the above performance guidance settings are environment settings? or some flags/arguments for a benchmark?

rsiyer-intel · 2026-05-30T00:00:40Z

+| OS | Linux |
+| Python | 3.10 through 3.13 |
+| vLLM | `v0.17.0` or newer |
+| Intel AMX Xeon CPU Flags | 4th Gen or newer with `amx_tile`, `amx_bf16`, and `amx_int8` for best BF16/INT8 performance |


Suggested change

| Intel AMX Xeon CPU Flags | 4th Gen or newer with `amx_tile`, `amx_bf16`, and `amx_int8` for best BF16/INT8 performance |

| Intel AMX related Xeon CPU Flags | 4th Gen Intel Xeon or newer with `amx_tile`, `amx_bf16`, and `amx_int8` for best BF16/INT8 performance |

rsiyer-intel · 2026-05-30T00:04:46Z

+
+| Setting | Guidance | Why it matters |
+| --- | --- | --- |
+| `--dtype=bfloat16` | Use `bfloat16` on Intel Xeon with Intel AMX | Enables the preferred CPU dtype and AMX |


Does this enable the preferred CPU dtype and enables Intel AMX acceleration?

rsiyer-intel · 2026-05-30T00:06:10Z

+| `--dtype=bfloat16` | Use `bfloat16` on Intel Xeon with Intel AMX | Enables the preferred CPU dtype and AMX |
+| `VLLM_CPU_KVCACHE_SPACE` | `20` to `40` GiB or larger | Larger values allow more concurrency and context, but must fit per NUMA node. |
+| `VLLM_CPU_OMP_THREADS_BIND` | `auto` | Binds OpenMP worker threads to NUMA-local cores. Use ranges such as `0-31\|32-63` for manual control, `auto` preferred. |
+| `VLLM_CPU_NUM_OF_RESERVED_CPU` | `1` | Sets a core for API serving, tokenization, networking, logging, and OS work. |


rsiyer-intel · 2026-05-30T00:08:49Z

+| `VLLM_CPU_KVCACHE_SPACE` | `20` to `40` GiB or larger | Larger values allow more concurrency and context, but must fit per NUMA node. |
+| `VLLM_CPU_OMP_THREADS_BIND` | `auto` | Binds OpenMP worker threads to NUMA-local cores. Use ranges such as `0-31\|32-63` for manual control, `auto` preferred. |
+| `VLLM_CPU_NUM_OF_RESERVED_CPU` | `1` | Sets a core for API serving, tokenization, networking, logging, and OS work. |
+| `--tensor-parallel-size` | Use default for single NUMA or set to NUMA node count | Keeps model shards close to local memory; current vLLM CPU releases do not support `6`. |


Suggested change

| `--tensor-parallel-size` | Use default for single NUMA or set to NUMA node count | Keeps model shards close to local memory; current vLLM CPU releases do not support `6`. |

| `--tensor-parallel-size` | Use the default for a single NUMA node, or set it to the NUMA node count | Keeps model shards close to local memory; current vLLM CPU releases do not support --tensor-parallel-size=6. |

rsiyer-intel · 2026-05-30T00:10:35Z

+
+This summarizes the official benchmarking and tuning guidance from the vLLM documentation, with a CPU focus. Always consult the [official benchmarking docs](https://docs.vllm.ai/en/latest/benchmarking/cli/) for the latest recommendations and tools.
+
+Start the Docker. If it is running in the foreground, open another terminal for these checks:


Suggested change

Start the Docker. If it is running in the foreground, open another terminal for these checks:

Start the Docker container. If it is running in the foreground, open another terminal for these checks:

rsiyer-intel · 2026-05-30T00:12:17Z

+Start the Docker. If it is running in the foreground, open another terminal for these checks:
+
+```bash
+# Docker path, because the server above is named vllm-cpu.


Suggested change

# Docker path, because the server above is named vllm-cpu.

# Docker path, because the container above is named vllm-cpu.

rsiyer-intel · 2026-05-30T00:36:48Z

+```
+
+### Benchmark
+


Can you add a few lines on what is vllm bench serve before the docker commands. Is it a subset of the Benchmark Suite you have there later?
Also, the execution context changes frequently between host/native install, docker container, virtualenv. Someone may land up running host command in container or vice versa.
For example, Concurrency sweep is provided for native environment.
Also, are the utility tools, h/w validation etc. applicable before running the vLLM benchmark suite as well? or just for vllm bench serve?
Once you have sections/subsections hierarchy it will be more clear.

lucasmelogithub added 8 commits May 5, 2026 13:31

Initial vLLM Recipe

4672221

Enahnced Readme

3dc0dc5

Enahnced Readme, TOC

e49fc00

Enahnced Readme, make it simpler.

8bc119a

Enhance docs

751a92d

Enhance Readme and update skill

8a32f4b

Fix title

de30726

Fix wording

14d64c5

rsiyer-intel reviewed May 29, 2026

View reviewed changes

rsiyer-intel reviewed May 30, 2026

View reviewed changes

rsiyer-intel requested changes May 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add vLLM optimization guide and skill#32

Add vLLM optimization guide and skill#32
lucasmelogithub wants to merge 8 commits into
intel:mainfrom
lucasmelogithub:lmelo-vllm

lucasmelogithub commented May 29, 2026

Uh oh!

rsiyer-intel May 29, 2026

Uh oh!

rsiyer-intel May 29, 2026 •

edited

Loading

Uh oh!

rsiyer-intel May 29, 2026

Uh oh!

rsiyer-intel May 30, 2026 •

edited

Loading

Uh oh!

rsiyer-intel May 30, 2026

Uh oh!

rsiyer-intel May 30, 2026

Uh oh!

rsiyer-intel May 30, 2026

Uh oh!

rsiyer-intel May 30, 2026

Uh oh!

rsiyer-intel May 30, 2026

Uh oh!

rsiyer-intel May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	\| Intel AMX Xeon CPU Flags \| 4th Gen or newer with `amx_tile`, `amx_bf16`, and `amx_int8` for best BF16/INT8 performance \|
	\| Intel AMX related Xeon CPU Flags \| 4th Gen Intel Xeon or newer with `amx_tile`, `amx_bf16`, and `amx_int8` for best BF16/INT8 performance \|

	\| `VLLM_CPU_NUM_OF_RESERVED_CPU` \| `1` \| Sets a core for API serving, tokenization, networking, logging, and OS work. \|
	\| `VLLM_CPU_NUM_OF_RESERVED_CPU` \| `1` \| Reserves one core for API serving, tokenization, networking, logging, and OS work. \|

	\| `--tensor-parallel-size` \| Use default for single NUMA or set to NUMA node count \| Keeps model shards close to local memory; current vLLM CPU releases do not support `6`. \|
	\| `--tensor-parallel-size` \| Use the default for a single NUMA node, or set it to the NUMA node count \| Keeps model shards close to local memory; current vLLM CPU releases do not support --tensor-parallel-size=6. \|


		This summarizes the official benchmarking and tuning guidance from the vLLM documentation, with a CPU focus. Always consult the [official benchmarking docs](https://docs.vllm.ai/en/latest/benchmarking/cli/) for the latest recommendations and tools.

		Start the Docker. If it is running in the foreground, open another terminal for these checks:

	Start the Docker. If it is running in the foreground, open another terminal for these checks:
	Start the Docker container. If it is running in the foreground, open another terminal for these checks:

	# Docker path, because the server above is named vllm-cpu.
	# Docker path, because the container above is named vllm-cpu.

		```

		### Benchmark

Conversation

lucasmelogithub commented May 29, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rsiyer-intel May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rsiyer-intel May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rsiyer-intel May 29, 2026 •

edited

Loading

rsiyer-intel May 30, 2026 •

edited

Loading