diff --git a/README.md b/README.md
index bffdb79..fd0bd37 100644
--- a/README.md
+++ b/README.md
@@ -44,6 +44,7 @@ We aim to provide a dynamic resource where users can find the latest optimizatio
- [ResNet50 – Computer Vision](software/tensorflow/computer-vision-resnet50/README.md)
- [BERT – NLP](software/tensorflow/nlp-transformers-bert/README.md)
- [RGAT – Graph Neural Networks](software/tensorflow/graph-neural-networks-rgat/README.md)
+ - [vLLM](software/vllm/README.md)
- Workloads
- [Cassandra Stress](workloads/cassandra-stress/README.md)
- [HPC](workloads/hpc/README.md)
diff --git a/software/vllm/README.md b/software/vllm/README.md
new file mode 100644
index 0000000..9a473be
--- /dev/null
+++ b/software/vllm/README.md
@@ -0,0 +1,312 @@
+# vLLM on Intel Xeon Processors
+
+This guide provides recommendations for running vLLM on Intel Xeon processors.
+
+## Upstream First
+
+Intel invests significant efforts upstreaming code optimizations and documentation directly to the official vLLM repositories. Those upstream contributions form the foundation of Intel Xeon CPU performance in vLLM. This guide is only a small extension of that work—collecting practical deployment tips in one place. Users should always consult the official documentation.
+
+- [vLLM CPU installation guide](https://docs.vllm.ai/en/stable/getting_started/installation/cpu/)
+- [vLLM SLM/LLM Recipes](https://recipes.vllm.ai/)
+- [vLLM Benchmarking](https://docs.vllm.ai/en/stable/benchmarking/cli/)
+
+## Use with AI Coding Agents
+
+This recipe ships a companion [Agent Skill](./skill/SKILL.md) (`vllm-xeon-cpu`) that lets AI coding agents — GitHub Copilot, Claude Code, and other `AGENTS.md`-aware tools — deploy, tune, validate, and benchmark vLLM on Intel Xeon CPUs on a customer's behalf. The skill is a self-contained, markdown-only payload under [`skill/`](./skill/) that you copy into your own workspace or user profile.
+
+> **Folder name must match `name`.** When you install the skill, the destination folder **must** be named `vllm-xeon-cpu` (matching the `name:` field in the skill's frontmatter). Otherwise the agent will not discover it.
+
+Install the skill once per workspace or user profile. Pick the install path for your agent runtime:
+
+| Runtime | Install path | Notes |
+| --- | --- | --- |
+| GitHub Copilot (workspace) | `.github/skills/vllm-xeon-cpu/` | Shared with everyone working in the repo and with the Copilot coding agent on PRs / issues. |
+| GitHub Copilot (personal) | `~/.copilot/skills/vllm-xeon-cpu/` | Available across all your workspaces; not shared. |
+| Claude Code (workspace) | `.claude/skills/vllm-xeon-cpu/` | Shared via the repo. |
+
+### GitHub Copilot — Repo Workspace
+
+```bash
+mkdir -p .github/skills/vllm-xeon-cpu
+curl -L https://github.com/intel/optimization-zone/archive/refs/heads/main.tar.gz \
+ | tar -xz --strip-components=4 -C .github/skills/vllm-xeon-cpu \
+ optimization-zone-main/software/vllm/skill
+```
+
+### GitHub Copilot — User profile
+
+```bash
+mkdir -p ~/.copilot/skills/vllm-xeon-cpu
+curl -L https://github.com/intel/optimization-zone/archive/refs/heads/main.tar.gz \
+ | tar -xz --strip-components=4 -C ~/.copilot/skills/vllm-xeon-cpu \
+ optimization-zone-main/software/vllm/skill
+```
+
+### Claude Code — Repo Workspace
+
+```bash
+mkdir -p .claude/skills/vllm-xeon-cpu
+curl -L https://github.com/intel/optimization-zone/archive/refs/heads/main.tar.gz \
+ | tar -xz --strip-components=4 -C .claude/skills/vllm-xeon-cpu \
+ optimization-zone-main/software/vllm/skill
+```
+
+After install, invoke from chat with `/vllm-xeon-cpu` or let the agent auto-load the skill when your request matches keywords like "vLLM", "Xeon".
+
+## Intel Xeon SLM/LLM Sizing Guidance
+
+For guidance around SLM/LLM sizing on Intel Xeon CPUs, please see our Xeon Processor Advisor Tool & AI Software Catalog:
+
+- [Cloud Intel Xeon AI Performance Advisor](https://xeonprocessoradvisor.intel.com/csp-ai-performance-advisor)
+- [On-prem Intel Xeon AI Performance Advisor](https://xeonprocessoradvisor.intel.com/on-prem-ai-performance-advisor)
+- [Intel AI Software Catalog - Model Guidance](https://swcatalog.intel.com/models)
+
+## vLLM Requirements Guidance
+
+| Item | Guidance |
+| --- | --- |
+| OS | Linux |
+| Python | 3.10 through 3.13 |
+| vLLM | `v0.17.0` or newer |
+| Intel AMX Xeon CPU Flags | 4th Gen or newer with `amx_tile`, `amx_bf16`, and `amx_int8` for best BF16/INT8 performance |
+
+## Performance Guidance
+
+| Setting | Guidance | Why it matters |
+| --- | --- | --- |
+| `--dtype=bfloat16` | Use `bfloat16` on Intel Xeon with Intel AMX | Enables the preferred CPU dtype and AMX |
+| `VLLM_CPU_KVCACHE_SPACE` | `20` to `40` GiB or larger | Larger values allow more concurrency and context, but must fit per NUMA node. |
+| `VLLM_CPU_OMP_THREADS_BIND` | `auto` | Binds OpenMP worker threads to NUMA-local cores. Use ranges such as `0-31\|32-63` for manual control, `auto` preferred. |
+| `VLLM_CPU_NUM_OF_RESERVED_CPU` | `1` | Sets a core for API serving, tokenization, networking, logging, and OS work. |
+| `--tensor-parallel-size` | Use default for single NUMA or set to NUMA node count | Keeps model shards close to local memory; current vLLM CPU releases do not support `6`. |
+| `--max-num-batched-tokens` | Online: `2048`; offline: `4096` | Maximum number of batched tokens per iteration. Tune for prefill throughput and time to first token. |
+| `--max-num-seqs` | Online: `128`; offline: `256` | Maximum number of sequences per iteration. Tune for decode throughput and inter-token latency. |
+| `VLLM_CPU_SGL_KERNEL` | `0`, or try `1` for low-latency SLM serving | Experimental x86 small-batch kernels; requires AMX, BF16 weights, and compatible shapes. |
+
+## Utility Tools
+
+Install the small host tools used by the commands below:
+
+```bash
+sudo apt-get update
+sudo apt-get install -y --no-install-recommends curl git jq numactl htop python3-venv python3-full g++ python3-dev
+```
+
+## Hardware Validation
+
+Validate the CPU model, core count, thread count, NUMA topology, and important flags such as `avx512f`, `avx2`, `amx_tile`, `amx_bf16`, `amx_int8`, and `avx512_bf16`.
+
+```bash
+lscpu | grep -E "Model name|Socket|Core|Thread|NUMA node|Flags"
+lscpu | grep -E "avx512f|avx2|amx_(tile|bf16|int8)|avx512_bf16"
+numactl --hardware
+```
+
+## Fast Path: Docker
+
+
+(Optional) Install Docker on Ubuntu 24.04
+
+```bash
+sudo apt-get update
+sudo apt-get install -y docker.io
+sudo systemctl enable --now docker
+sudo usermod -aG docker $USER
+newgrp docker # apply group without re-login
+```
+
+
+
+```bash
+export HF_TOKEN=your_hf_token_here # <<<=== Required for gated Hugging Face models and faster downloads.
+export VLLM_VERSION=0.20.2 # <<<=== Update this for newer releases! Check!
+docker pull vllm/vllm-openai-cpu:v${VLLM_VERSION}-x86_64
+
+docker run --rm \
+ --name vllm-cpu \
+ --security-opt seccomp=unconfined \
+ --cap-add SYS_NICE \
+ --shm-size=8g \
+ -p 8000:8000 \
+ -e HF_TOKEN="${HF_TOKEN}" \
+ -e VLLM_CPU_KVCACHE_SPACE=20 \
+ -e VLLM_CPU_OMP_THREADS_BIND=auto \
+ -e VLLM_CPU_NUM_OF_RESERVED_CPU=1 \
+ vllm/vllm-openai-cpu:v${VLLM_VERSION}-x86_64 \
+ RedHatAI/Qwen3-4B-Instruct-2507-quantized.w8a8 \
+ --dtype=bfloat16 \
+ --max-num-batched-tokens 2048 \
+ --max-num-seqs 128
+```
+
+`SYS_NICE` and `seccomp=unconfined` allow vLLM's NUMA memory policy calls inside Docker. Without them, serving can still work, but NUMA placement may be weaker and logs can show `get_mempolicy: Operation not permitted`.
+
+## Validate the OpenAI-compatible endpoint
+
+**Open a new terminal or use a remote system.**
+
+```bash
+curl http://localhost:8000/v1/chat/completions \
+ -H "Content-Type: application/json" \
+ -d '{
+ "model": "RedHatAI/Qwen3-4B-Instruct-2507-quantized.w8a8",
+ "messages": [{"role": "user", "content": "Give three CPU inference tuning tips."}],
+ "max_tokens": 128
+ }'
+```
+
+## Benchmarking Guidance
+
+This summarizes the official benchmarking and tuning guidance from the vLLM documentation, with a CPU focus. Always consult the [official benchmarking docs](https://docs.vllm.ai/en/latest/benchmarking/cli/) for the latest recommendations and tools.
+
+Start the Docker. If it is running in the foreground, open another terminal for these checks:
+
+```bash
+# Docker path, because the server above is named vllm-cpu.
+docker exec vllm-cpu vllm collect-env
+
+sudo curl -s http://localhost:8000/v1/models | jq .
+SERVER_PID=$(pgrep -f 'vllm serve|api_server' | head -n 1)
+numastat -p "${SERVER_PID}"
+```
+
+### Benchmark
+
+Use `vllm bench serve` to measure TTFT, TPOT, and throughput against the running server. Warm up with `--num-warmups` to avoid measuring JIT compilation overhead.
+
+If you started the server with Docker (as shown above), run the benchmark **inside the container**:
+
+```bash
+docker exec vllm-cpu vllm bench serve \
+ --model RedHatAI/Qwen3-4B-Instruct-2507-quantized.w8a8 \
+ --dataset-name random \
+ --random-input-len 128 \
+ --random-output-len 128 \
+ --num-prompts 100 \
+ --num-warmups 5 \
+ --request-rate inf \
+ --save-result \
+ --result-dir ./bench-results \
+ --percentile-metrics ttft,tpot,itl
+```
+
+If you installed vLLM natively (via `pip install vllm`), run directly on the host:
+
+```bash
+vllm bench serve \
+ --model RedHatAI/Qwen3-4B-Instruct-2507-quantized.w8a8 \
+ --dataset-name random \
+ --random-input-len 128 \
+ --random-output-len 128 \
+ --num-prompts 100 \
+ --num-warmups 5 \
+ --request-rate inf \
+ --save-result \
+ --result-dir ./bench-results \
+ --percentile-metrics ttft,tpot,itl
+```
+
+> **Troubleshooting: "Failed to infer device type"** — This error means vLLM's platform detection cannot find the CPU backend. The most common cause is installing the generic (CUDA) wheel from PyPI via `pip install vllm` instead of the CPU-specific wheel. The CPU wheel includes `+cpu` in its version string (e.g., `0.20.2+cpu`), which the platform detector requires. Fix by reinstalling the CPU wheel directly:
+>
+> ```bash
+> export VLLM_VERSION=0.20.2
+> pip install --force-reinstall --extra-index-url https://download.pytorch.org/whl/cpu \
+> "https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}+cpu-cp38-abi3-manylinux_2_35_x86_64.whl"
+> ```
+
+### Concurrency Sweep
+
+With `--request-rate inf`, all prompts fire simultaneously so `--num-prompts` directly controls concurrency. Sweep to see how latency and throughput scale under increasing batch pressure:
+
+```bash
+for N in 10 50 100 200 500; do
+ vllm bench serve \
+ --model RedHatAI/Qwen3-4B-Instruct-2507-quantized.w8a8 \
+ --dataset-name random \
+ --random-input-len 128 \
+ --random-output-len 128 \
+ --num-prompts "${N}" \
+ --num-warmups 5 \
+ --request-rate inf \
+ --save-result \
+ --result-dir ./bench-results \
+ --percentile-metrics ttft,tpot,itl
+done
+```
+
+### Testing & Tuning Methodology
+
+- Test with different input/output lengths to understand how the model performs under different prompt and generation sizes. For example, try `--random-input-len` and `--random-output-len` values of `64`, `128`, `256`, and `512`.
+- Test with different user concurrency levels using `--num-prompts` values of `10`, `50`, `100`, `200`, and `500` with `--request-rate inf`.
+- Use one known-good model and change one knob at a time. Track TTFT, TPOT, output tokens per second, requests per second, peak RSS, NUMA locality, and OOM events.
+- Vary only one of `VLLM_CPU_KVCACHE_SPACE`, `VLLM_CPU_OMP_THREADS_BIND`, `--max-num-batched-tokens`, `--max-num-seqs`, or `--block-size` per run. Compare results across runs using the saved JSON files in `./bench-results`.
+
+### Using the vLLM Benchmark Suite
+
+The vLLM source tree includes a full performance benchmark harness at `.buildkite/performance-benchmarks/scripts/run-performance-benchmarks.sh`. This is the same script used in vLLM's CI to gate regressions. It reads a JSON test definition, generates concrete benchmark commands, and (optionally) executes them.
+
+Prepare the environment and run a dry-run first to inspect the generated commands without executing them:
+
+```bash
+export HF_TOKEN=your_hf_token_here # <<<=== Required for gated Hugging Face models and faster downloads.
+export VLLM_VERSION=0.20.2
+python3 -m venv ~/vllm-venv
+source ~/vllm-venv/bin/activate
+pip install --extra-index-url https://download.pytorch.org/whl/cpu \
+ "https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}+cpu-cp38-abi3-manylinux_2_35_x86_64.whl" \
+ tabulate pandas
+```
+
+Clone the source tree (or reuse the checkout from a source build):
+
+```bash
+git clone https://github.com/vllm-project/vllm.git vllm_source
+cd vllm_source
+export VLLM_TARGET_DEVICE=cpu
+```
+
+Run a dry-run first to inspect the generated commands without executing them:
+
+```bash
+source ~/vllm-venv/bin/activate
+HF_TOKEN="${HF_TOKEN}" \
+ON_CPU=1 \
+SERVING_JSON=serving-tests-cpu-text.json \
+DRY_RUN=1 \
+MODEL_FILTER=meta-llama/Llama-3.1-8B-Instruct \
+DTYPE_FILTER=bfloat16 \
+ bash .buildkite/performance-benchmarks/scripts/run-performance-benchmarks.sh
+```
+To execute the benchmark (remove `DRY_RUN=1`):
+
+```bash
+source ~/vllm-venv/bin/activate
+HF_TOKEN="${HF_TOKEN}" \
+ON_CPU=1 \
+SERVING_JSON=serving-tests-cpu-text.json \
+MODEL_FILTER=meta-llama/Llama-3.1-8B-Instruct \
+DTYPE_FILTER=bfloat16 \
+ bash .buildkite/performance-benchmarks/scripts/run-performance-benchmarks.sh
+```
+
+Key environment variables:
+
+| Variable | Purpose |
+| -------- | ------- |
+| `HF_TOKEN` | Hugging Face token — required by the script's `check_hf_token` gate |
+| `ON_CPU` | Set to `1` to use CPU-specific test configs |
+| `SERVING_JSON` | JSON file defining test matrix (e.g., `serving-tests-cpu-text.json`) |
+| `DRY_RUN` | Set to `1` to generate commands without executing |
+| `MODEL_FILTER` | Run only benchmarks matching this model ID |
+| `DTYPE_FILTER` | Run only benchmarks matching this dtype (e.g., `bfloat16`) |
+
+> **Note:** The `MODEL_FILTER` value must match an entry in the JSON test definition. If the model is not pre-curated in the CPU test JSON, you can add an entry or use the `vllm bench serve` approach above instead.
+
+## References
+
+- [vLLM CPU installation guide](https://docs.vllm.ai/en/stable/getting_started/installation/cpu/)
+- [vLLM CPU-supported models](https://docs.vllm.ai/en/stable/models/hardware_supported_models/cpu/)
+- [vLLM optimization and tuning guide](https://docs.vllm.ai/en/stable/configuration/optimization/)
+- [vLLM bench serve CLI](https://docs.vllm.ai/en/latest/cli/bench/serve.html)
+- [vLLM bench latency CLI](https://docs.vllm.ai/en/latest/cli/bench/latency.html)
+- [vLLM Intel quantization support](https://docs.vllm.ai/en/stable/features/quantization/inc/)
diff --git a/software/vllm/skill/SKILL.md b/software/vllm/skill/SKILL.md
new file mode 100644
index 0000000..a66278f
--- /dev/null
+++ b/software/vllm/skill/SKILL.md
@@ -0,0 +1,183 @@
+---
+name: vllm-xeon-cpu
+description: "Deploy, tune, validate, and benchmark vLLM on Intel Xeon CPUs (CPU-only inference, no GPU). USE FOR: serving and performance optimizing LLMs on Intel Xeon, vLLM CPU install, CPU inference tuning, AMX bfloat16 setup, NUMA pinning, VLLM_CPU_KVCACHE_SPACE, VLLM_CPU_OMP_THREADS_BIND, --dtype=bfloat16, vllm/vllm-openai-cpu Docker image, hardware validation for AMX (amx_tile, amx_bf16, amx_int8), KV cache sizing per NUMA node, --max-num-batched-tokens / --max-num-seqs tuning, vllm bench serve on CPU, TTFT/TPOT measurement. DO NOT USE FOR: GPU vLLM (use upstream vLLM docs), training, quantization tuning beyond INT8/AWQ pointers, model architecture selection (use Intel Xeon AI Performance Advisor), non-Xeon CPUs, vLLM source build deep-dives."
+---
+
+# vLLM on Intel Xeon CPUs
+
+- **Skill version**: 1.0
+- **Tested against vLLM**: `v0.20.2`
+- **Minimum vLLM**: `v0.17.0`
+
+## Upstream First
+
+Intel upstreams Xeon CPU optimizations directly to vLLM. This skill encodes deployment, tuning, validation, and a short benchmarking walkthrough — always consult upstream for the latest:
+
+- [vLLM CPU installation guide](https://docs.vllm.ai/en/stable/getting_started/installation/cpu/)
+- [vLLM SLM/LLM Recipes](https://recipes.vllm.ai/)
+- [vLLM optimization and tuning](https://docs.vllm.ai/en/stable/configuration/optimization/)
+- [vLLM bench serve CLI](https://docs.vllm.ai/en/latest/cli/bench/serve.html)
+
+## When to Use
+
+Invoke this skill when the user wants to:
+- Deploy or serve vLLM on an Intel Xeon CPU (no GPU).
+- Tune CPU-serving performance knobs (KV cache, OMP bind, batched tokens, num seqs).
+- Validate Xeon hardware (AMX flags, NUMA topology) before deploying.
+- Run a minimal CPU benchmark to measure TTFT / TPOT / throughput.
+
+**Do not use** for GPU vLLM, model training, deep quantization tuning, model selection (point users at the [Intel Xeon AI Performance Advisor](https://xeonprocessoradvisor.intel.com/csp-ai-performance-advisor)), or non-Xeon CPUs.
+
+## Prerequisites
+
+| Item | Requirement |
+| --- | --- |
+| OS | Linux |
+| Python | 3.10–3.13 (only if not using Docker) |
+| CPU | 4th Gen Intel Xeon or newer; must expose `amx_tile`, `amx_bf16`, `amx_int8` for best BF16/INT8 performance |
+| Tools | `curl`, `numactl`, `jq`, `g++`, `python3-dev` (`sudo apt-get install -y --no-install-recommends curl git jq numactl htop g++ python3-dev`). `g++` and Python headers are required by PyTorch inductor to JIT-compile CPU kernels. |
+| Docker | Recent Docker with `--cap-add SYS_NICE` and `--security-opt seccomp=unconfined` permitted |
+
+## Procedure 1 — Validate Hardware
+
+Goal: confirm Xeon generation, AMX support, and NUMA topology before deploying.
+
+1. Inspect CPU model, sockets, cores, threads, NUMA nodes:
+ ```bash
+ lscpu | grep -E "Model name|Socket|Core|Thread|NUMA node"
+ ```
+2. Check for AMX and AVX-512 flags:
+ ```bash
+ lscpu | grep -E "amx_(tile|bf16|int8)|avx512_bf16|avx512f|avx2"
+ ```
+ - **All of `amx_tile`, `amx_bf16`, `amx_int8` present** → proceed; BF16 AMX kernels will activate.
+ - **AMX missing** --> Warn user. vLLM will still run, but inference throughput will be substantially lower. Recommend a 4th Gen Xeon (Sapphire Rapids) or newer.
+3. Inspect NUMA topology:
+ ```bash
+ numactl --hardware
+ ```
+ Record the NUMA node count `N` — it drives `--tensor-parallel-size` and KV cache sizing.
+
+## Procedure 2 — Deploy (Docker Fast Path)
+
+Goal: serve a model via the official `vllm/vllm-openai-cpu` image with Xeon-tuned env vars.
+
+0. (Optional) Install Docker if not present (Ubuntu 24.04):
+ ```bash
+ sudo apt-get update
+ sudo apt-get install -y docker.io
+ sudo systemctl enable --now docker
+ sudo usermod -aG docker $USER
+ newgrp docker # apply group without re-login
+ ```
+1. Pin a release tag (do not use `latest-x86_64` in production):
+ ```bash
+ export VLLM_VERSION=0.20.2 # update to the latest release that meets the minimum above
+ docker pull vllm/vllm-openai-cpu:v${VLLM_VERSION}-x86_64
+ ```
+2. Run the container with the required Xeon env vars and Docker capabilities:
+ ```bash
+ docker run --rm \
+ --name vllm-cpu \
+ --security-opt seccomp=unconfined \
+ --cap-add SYS_NICE \
+ --shm-size=8g \
+ -p 8000:8000 \
+ -e HF_TOKEN="${HF_TOKEN}" \
+ -e VLLM_CPU_KVCACHE_SPACE=40 \
+ -e VLLM_CPU_OMP_THREADS_BIND=auto \
+ -e VLLM_CPU_NUM_OF_RESERVED_CPU=1 \
+ vllm/vllm-openai-cpu:v${VLLM_VERSION}-x86_64 \
+ \
+ --dtype=bfloat16 \
+ --max-num-batched-tokens 2048 \
+ --max-num-seqs 128
+ ```
+ - `SYS_NICE` + `seccomp=unconfined` enable vLLM's NUMA memory-policy calls. Without them serving still works but logs may show `get_mempolicy: Operation not permitted` and NUMA placement weakens.
+ - `VLLM_CPU_KVCACHE_SPACE` is in GiB **per NUMA node** — must fit in node-local memory.
+ - `VLLM_CPU_OMP_THREADS_BIND=auto` binds OpenMP workers to NUMA-local cores. For manual control use ranges like `0-31|32-63`.
+ - `VLLM_CPU_NUM_OF_RESERVED_CPU=1` keeps a core free for API serving, tokenization, networking, and OS work.
+3. Validate the OpenAI-compatible endpoint:
+ ```bash
+ curl http://localhost:8000/v1/chat/completions \
+ -H "Content-Type: application/json" \
+ -d '{
+ "model": "",
+ "messages": [{"role": "user", "content": "Give three CPU inference tuning tips."}],
+ "max_tokens": 128
+ }'
+ ```
+
+## Procedure 3 — Tune Performance Knobs
+
+Goal: improve TTFT / TPOT / throughput methodically. **Change one knob per run** and compare.
+
+1. Pick the use case:
+ - **Online serving** → start `--max-num-batched-tokens 2048`, `--max-num-seqs 128`.
+ - **Offline batch** → start `--max-num-batched-tokens 4096`, `--max-num-seqs 256`.
+2. Set `--tensor-parallel-size`:
+ - Single NUMA node → leave default.
+ - Multi NUMA → set to the NUMA node count `N` from Procedure 1.
+ - **`--tensor-parallel-size=6` is currently unsupported on CPU; avoid it.**
+3. Size `VLLM_CPU_KVCACHE_SPACE` (GiB per NUMA node):
+ - Larger value → more concurrency / longer context, but must fit in node-local RAM.
+ - If the server OOMs or pages, halve and retry.
+4. Bind OpenMP threads with `VLLM_CPU_OMP_THREADS_BIND`:
+ - Prefer `auto`. Use manual ranges (`0-31|32-63`) only when `auto` mis-pins (verify with `numastat -p $(pgrep -f 'vllm serve|api_server' | head -n1)`).
+5. (Experimental) Low-latency small-batch serving:
+ - `VLLM_CPU_SGL_KERNEL=1` enables x86 small-batch kernels. Requires AMX, BF16 weights, and compatible shapes.
+6. Quantization (when functional quality is acceptable):
+ - Try INT8 or AWQ to reduce weight memory and memory-bandwidth pressure. Validate quality before promoting.
+
+Full knob reference: [tuning matrix](./references/tuning-matrix.md).
+
+## Procedure 4 — Benchmark (Minimal CPU Walkthrough)
+
+Goal: measure TTFT, TPOT, and throughput against the running server with a reproducible warm-up.
+
+1. Confirm the server is reachable and inspect environment:
+ ```bash
+ curl -s http://localhost:8000/v1/models | jq .
+ docker exec vllm-cpu vllm collect-env # or `vllm collect-env` for native installs
+ SERVER_PID=$(pgrep -f 'vllm serve|api_server' | head -n 1)
+ numastat -p "${SERVER_PID}" # verify NUMA locality
+ ```
+2. Run `vllm bench serve` with warm-ups (warm-ups avoid measuring JIT/compile overhead):
+ ```bash
+ vllm bench serve \
+ --model \
+ --dataset-name random \
+ --random-input-len 128 \
+ --random-output-len 128 \
+ --num-prompts 100 \
+ --num-warmups 5 \
+ --request-rate inf \
+ --save-result \
+ --result-dir ./bench-results \
+ --percentile-metrics ttft,tpot,itl
+ ```
+3. Sweep methodically — **one variable per run**:
+ - Vary input/output lengths: `64`, `128`, `256`, `512`.
+ - Vary concurrency via `--num-prompts`: `10`, `100`, `1000`.
+ - Track TTFT, TPOT, output tokens/sec, requests/sec, peak RSS, NUMA locality, and any OOM events.
+ - When tuning, change only one of `VLLM_CPU_KVCACHE_SPACE`, `VLLM_CPU_OMP_THREADS_BIND`, `--max-num-batched-tokens`, `--max-num-seqs`, or `--block-size` per run. Compare results across runs using the saved JSON files in `./bench-results`.
+4. For the full CI-grade harness (`run-performance-benchmarks.sh` with `ON_CPU=1`, `SERVING_JSON`, `DRY_RUN`, `MODEL_FILTER`, `DTYPE_FILTER`), see the upstream [vLLM benchmarking docs](https://docs.vllm.ai/en/latest/benchmarking/cli/).
+
+## Output the Agent Should Produce
+
+After running these procedures, return to the user:
+- The hardware validation summary (Xeon generation, AMX flags present, NUMA node count).
+- The exact `docker run` command used, with values chosen for their hardware.
+- Any tuning recommendations with the **one knob changed** per recommendation and the expected metric impact.
+- Benchmark numbers (TTFT, TPOT, throughput) with the corresponding configuration.
+
+## References
+
+- [Tuning matrix](./references/tuning-matrix.md) — full env-var / CLI knob table with guard rails.
+- [vLLM CPU installation guide](https://docs.vllm.ai/en/stable/getting_started/installation/cpu/)
+- [vLLM CPU-supported models](https://docs.vllm.ai/en/stable/models/hardware_supported_models/cpu/)
+- [vLLM optimization and tuning guide](https://docs.vllm.ai/en/stable/configuration/optimization/)
+- [vLLM Intel quantization support](https://docs.vllm.ai/en/stable/features/quantization/inc/)
+- [Intel Xeon AI Performance Advisor (cloud)](https://xeonprocessoradvisor.intel.com/csp-ai-performance-advisor)
+- [Intel Xeon AI Performance Advisor (on-prem)](https://xeonprocessoradvisor.intel.com/on-prem-ai-performance-advisor)
+- [Intel AI Software Catalog — Model Guidance](https://swcatalog.intel.com/models)
diff --git a/software/vllm/skill/references/tuning-matrix.md b/software/vllm/skill/references/tuning-matrix.md
new file mode 100644
index 0000000..5467e32
--- /dev/null
+++ b/software/vllm/skill/references/tuning-matrix.md
@@ -0,0 +1,34 @@
+# vLLM Xeon CPU Tuning Matrix
+
+Full reference for environment variables and CLI flags relevant to vLLM CPU serving on Intel Xeon, with guard rails. Use alongside [SKILL.md](../SKILL.md) Procedure 3.
+
+## Environment Variables
+
+| Variable | Recommended | Guidance | Why it matters |
+| --- | --- | --- | --- |
+| `VLLM_CPU_KVCACHE_SPACE` | `40` (GiB) | Per **NUMA node**. Increase for more concurrency / longer context; must fit in node-local memory. Halve if the server OOMs or pages. | KV cache is the dominant CPU memory consumer; under-sizing throttles batching, over-sizing causes paging or OOM. |
+| `VLLM_CPU_OMP_THREADS_BIND` | `auto` | Binds OpenMP workers to NUMA-local cores. Manual ranges look like `0-31\|32-63` (one range per NUMA node). Verify with `numastat -p `. | Cross-NUMA memory traffic kills decode throughput. |
+| `VLLM_CPU_NUM_OF_RESERVED_CPU` | `1` | Reserves cores for the API server, tokenization, networking, logging, and OS work. Raise on noisy hosts. | Prevents OS / serving overhead from preempting OMP workers. |
+| `VLLM_CPU_SGL_KERNEL` | `0` (try `1` for low-latency SLM) | Experimental x86 small-batch kernels. Requires AMX, BF16 weights, and compatible shapes. | Can reduce latency for small-batch serving, but is shape-sensitive. |
+| `HF_TOKEN` | *(secret)* | Required for gated Hugging Face models. | Authentication. |
+
+## CLI Flags (`vllm serve` / Docker CMD)
+
+| Flag | Recommended | Guidance | Why it matters |
+| --- | --- | --- | --- |
+| `--dtype=bfloat16` | always on AMX-capable Xeon | Enables AMX BF16 kernels — the preferred CPU dtype. | Largest single performance lever on 4th Gen+ Xeon. |
+| `--tensor-parallel-size` | default for single NUMA; `N` for `N` NUMA nodes | Keeps shards local to NUMA memory. **`6` is currently unsupported on CPU.** | Wrong value forces cross-NUMA traffic or fails to start. |
+| `--max-num-batched-tokens` | `2048` online / `4096` offline | Cap on batched tokens per iteration. Higher → better prefill throughput, worse TTFT. | Tradeoff between TTFT and prefill throughput. |
+| `--max-num-seqs` | `128` online / `256` offline | Cap on concurrent sequences. Higher → better decode throughput, worse ITL. | Tradeoff between ITL and decode throughput. |
+| `--block-size` | leave default until baseline is recorded | Tune only after KV cache / OMP / batched-tokens are stable. | Interacts with KV cache layout; change last. |
+
+## Guard Rails
+
+- **One knob per run.** Vary only one of `VLLM_CPU_KVCACHE_SPACE`, `VLLM_CPU_OMP_THREADS_BIND`, `--max-num-batched-tokens`, `--max-num-seqs`, or `--block-size` between benchmark runs. Save results to JSON and compare.
+- **Per-NUMA fit.** `VLLM_CPU_KVCACHE_SPACE` is per NUMA node — total memory consumption is `value × NUMA_node_count`. Confirm against `numactl --hardware` output.
+- **NUMA locality check.** After starting the server: `numastat -p $(pgrep -f 'vllm serve|api_server' | head -n1)`. Memory should be concentrated on the expected node(s); large `other_node` numbers indicate mis-binding.
+- **Docker capabilities.** Without `--cap-add SYS_NICE` and `--security-opt seccomp=unconfined`, vLLM cannot set NUMA memory policy; you will see `get_mempolicy: Operation not permitted` in logs and weaker placement.
+- **Unsupported TP.** `--tensor-parallel-size=6` is currently unsupported on CPU. Use `2`, `4`, or `8` depending on socket / NUMA layout.
+- **Reserved cores.** With `VLLM_CPU_NUM_OF_RESERVED_CPU=1`, OMP workers will land on the remaining cores. If serving latency spikes under load, raise the reserved count before re-running benchmarks.
+- **AMX absent.** If `lscpu` does not list `amx_tile`, `amx_bf16`, `amx_int8`, BF16 throughput collapses to AVX-512 paths. Warn the user and recommend 4th Gen Xeon or newer instead of further tuning.
+- **Quantization order.** Validate functional quality with BF16 first; only then evaluate INT8 / AWQ to reduce weight memory and bandwidth.