Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
6d138a4
[ROCm][CI] Add ROCm Docker Hub registry cache and weekly cleanup pipe…
AndreasKaratzas Mar 13, 2026
301adfa
[ROCm][CI] Add ROCm Docker Hub registry cache and weekly cleanup pipe…
AndreasKaratzas Mar 13, 2026
334f25e
[ROCm][CI] Add ROCm Docker Hub registry cache and weekly cleanup pipe…
AndreasKaratzas Mar 13, 2026
0101524
[ROCm][CI] Add ROCm Docker Hub registry cache and weekly cleanup pipe…
AndreasKaratzas Mar 14, 2026
586ba38
[ROCm][CI] Add ROCm Docker Hub registry cache and weekly cleanup pipe…
AndreasKaratzas Mar 14, 2026
0f3bc3b
Merge remote-tracking branch 'origin/main' into akaratza_optimize_doc…
AndreasKaratzas Mar 14, 2026
8728583
[ROCm][CI] Add ROCm Docker Hub registry cache and weekly cleanup pipe…
AndreasKaratzas Mar 14, 2026
e2d5bf7
[ROCm][CI] Add ROCm Docker Hub registry cache and weekly cleanup pipe…
AndreasKaratzas Mar 15, 2026
f841d98
[ROCm][CI] Add ROCm Docker Hub registry cache and weekly cleanup pipe…
AndreasKaratzas Mar 15, 2026
a4cf9c4
[ROCm][CI] Wire CI_BASE_IMAGE into bake targets and Jinja build steps
AndreasKaratzas Mar 26, 2026
36ba5ea
[ROCm][CI] Wire CI_BASE_IMAGE into bake targets and Jinja build steps
AndreasKaratzas Mar 26, 2026
d84d006
[ROCm][CI] Wire CI_BASE_IMAGE into bake targets and Jinja build steps
AndreasKaratzas Mar 26, 2026
e751468
[ROCm][CI] Wire CI_BASE_IMAGE into bake targets and Jinja build steps
AndreasKaratzas Mar 26, 2026
419f8e3
[ROCm][CI] Add ROCm wheel export and Buildkite artifact upload for CI…
AndreasKaratzas Mar 31, 2026
153946d
[ROCm][CI] Switch cache export to mode=min to avoid large layer timeout
AndreasKaratzas Mar 31, 2026
355f4e6
Enforce AMD ci_base readiness and publish arch wheel artifacts
AndreasKaratzas Apr 1, 2026
e228b76
[ROCm][CI] Add AMD legacy-mode switch and remove legacy wheel alias
AndreasKaratzas Apr 1, 2026
537116d
Rename to multi-arch
AndreasKaratzas Apr 1, 2026
8164877
[ROCm] Align ROCm wheel export target with CI bake settings
AndreasKaratzas Apr 1, 2026
39e7710
[ROCm][CI] Fix ROCm num workers
AndreasKaratzas Apr 1, 2026
f8893ca
Use Docker Hub-only ROCm cache and verify rebuilt image labels
AndreasKaratzas Apr 1, 2026
5eba8f5
Retry AMD jobs only on agent-side infrastructure failures
AndreasKaratzas Apr 1, 2026
e9bca8c
Simplify AMD pipeline template and rename ROCm bake script paths
AndreasKaratzas Apr 3, 2026
7943c8a
Simplify AMD pipeline template and rename ROCm bake script paths
AndreasKaratzas Apr 3, 2026
8e2c26e
Refine AMD ROCm CI builds, suites, and legacy runner validation
AndreasKaratzas Apr 3, 2026
9be2f95
Refine AMD ROCm CI builds, suites, and legacy runner validation
AndreasKaratzas Apr 3, 2026
ab83bd8
Merge remote-tracking branch 'origin/main' into akaratza_optimize_doc…
AndreasKaratzas Apr 3, 2026
69dd00e
Refine AMD ROCm CI build
AndreasKaratzas Apr 3, 2026
26a837f
Refine AMD ROCm CI build
AndreasKaratzas Apr 3, 2026
df824e1
[CI] Remapped test dependencies
AndreasKaratzas Apr 12, 2026
48e0ff5
[ROCm][CI] Improving docker layer caching
AndreasKaratzas Apr 12, 2026
5326c2b
[ROCm][CI] Cleanup unecessary weekly ci_base cron and remap correctly…
AndreasKaratzas Apr 13, 2026
4c12724
[ROCm][CI] Deleted redundant copy of bake file and synced the vllm te…
AndreasKaratzas Apr 13, 2026
6e0250a
Merge remote-tracking branch 'origin/main' into akaratza_optimize_doc…
AndreasKaratzas Apr 19, 2026
e6cf0a3
Included torch nightly tests
AndreasKaratzas Apr 19, 2026
8a9eead
Merge remote-tracking branch 'origin/main' into akaratza_optimize_doc…
AndreasKaratzas Apr 28, 2026
19b6f3c
Removed torch nightly initiative
AndreasKaratzas Apr 28, 2026
88607b8
Use ROCm test artifacts in AMD pipeline
AndreasKaratzas Apr 28, 2026
fd5b90d
Build ROCm test image with artifact package
AndreasKaratzas Apr 29, 2026
d9365ff
Use ci_base for ROCm artifact test warmup
AndreasKaratzas Apr 29, 2026
c7d3369
Preserve ROCm full-image fallback for artifact tests
AndreasKaratzas Apr 29, 2026
5635184
[ROCm] Route AMD artifact jobs through ci_base
AndreasKaratzas May 5, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
100 changes: 72 additions & 28 deletions buildkite/bootstrap-amd.sh
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,32 @@ if [[ -z "${COV_ENABLED:-}" ]]; then
COV_ENABLED=0
fi

# ---------------------------------------------------------------------------
# Helper functions
# ---------------------------------------------------------------------------
fetch_origin_ref() {
local ref="$1"
git fetch --no-tags --depth=50 origin "${ref}:refs/remotes/origin/${ref}" >/dev/null 2>&1 || \
git fetch --no-tags origin "${ref}:refs/remotes/origin/${ref}" >/dev/null 2>&1
}

get_pr_labels() {
if [[ "${BUILDKITE_PULL_REQUEST:-false}" == "false" ]]; then
return 0
fi

curl -fsSL "https://api.github.com/repos/vllm-project/vllm/pulls/$BUILDKITE_PULL_REQUEST" 2>/dev/null | \
jq -r '.labels[].name' 2>/dev/null || true
}

join_file_diff() {
if [[ -z "${1:-}" ]]; then
return 0
fi

printf '%s\n' "$1" | tr -d '\r' | paste -sd'|' -
}

# ---------------------------------------------------------------------------
# Git setup: ensure origin/main is available and compute merge base once.
# On K8s (blobless clones with --filter=blob:none), origin/main may not be
Expand All @@ -35,9 +61,14 @@ fi
# ---------------------------------------------------------------------------
git config --global --add safe.directory "$(pwd)" 2>/dev/null || true

if git rev-parse --is-shallow-repository 2>/dev/null | grep -q "true"; then
echo "Shallow repository detected, deepening history..."
git fetch --no-tags --deepen=50 origin >/dev/null 2>&1 || true
fi

if ! git rev-parse --verify origin/main >/dev/null 2>&1; then
echo "origin/main not found, fetching..."
git fetch origin main --depth=1 2>/dev/null || git fetch origin main || true
fetch_origin_ref main || true
fi

if [[ -z "${MERGE_BASE_COMMIT:-}" ]]; then
Expand All @@ -49,15 +80,11 @@ if [[ -z "${MERGE_BASE_COMMIT:-}" ]]; then
fi
fi

# ---------------------------------------------------------------------------
# Helper functions
# ---------------------------------------------------------------------------

fail_fast() {
DISABLE_LABEL="ci-no-fail-fast"
# If BUILDKITE_PULL_REQUEST != "false", then we check the PR labels using curl and jq
if [ "$BUILDKITE_PULL_REQUEST" != "false" ]; then
PR_LABELS=$(curl -s "https://api.github.com/repos/vllm-project/vllm/pulls/$BUILDKITE_PULL_REQUEST" | jq -r '.labels[].name')
PR_LABELS=$(get_pr_labels)
if [[ $PR_LABELS == *"$DISABLE_LABEL"* ]]; then
echo false
else
Expand All @@ -72,7 +99,7 @@ check_run_all_label() {
RUN_ALL_LABEL="ready-run-all-tests"
# If BUILDKITE_PULL_REQUEST != "false", then we check the PR labels using curl and jq
if [ "$BUILDKITE_PULL_REQUEST" != "false" ]; then
PR_LABELS=$(curl -s "https://api.github.com/repos/vllm-project/vllm/pulls/$BUILDKITE_PULL_REQUEST" | jq -r '.labels[].name')
PR_LABELS=$(get_pr_labels)
if [[ $PR_LABELS == *"$RUN_ALL_LABEL"* ]]; then
echo true
else
Expand Down Expand Up @@ -107,16 +134,20 @@ upload_pipeline() {
# Install minijinja
ls .buildkite || buildkite-agent annotate --style error 'Please merge upstream main branch for buildkite CI'
curl -sSfL https://github.com/mitsuhiko/minijinja/releases/download/2.3.1/minijinja-cli-installer.sh | sh
source "$HOME/.cargo/env"
TEMPLATE_PATH=".buildkite/test-template-amd.j2"
CARGO_ENV="${CARGO_HOME:-$HOME/.cargo}/env"
if [[ ! -f "$CARGO_ENV" ]]; then
echo "Error: Cargo env file not found at $CARGO_ENV"
exit 1
fi
# shellcheck disable=SC1090
source "$CARGO_ENV"

if [[ $BUILDKITE_PIPELINE_SLUG == "fastcheck" ]]; then
AMD_MIRROR_HW="amdtentative"
curl -o .buildkite/test-template.j2 \
"https://raw.githubusercontent.com/vllm-project/ci-infra/$VLLM_CI_BRANCH/buildkite/test-template-amd.j2?$(date +%s)"
else
curl -o .buildkite/test-template.j2 \
"https://raw.githubusercontent.com/vllm-project/ci-infra/$VLLM_CI_BRANCH/buildkite/test-template-amd.j2?$(date +%s)"
fi
curl -fsSL -o "$TEMPLATE_PATH" \
"https://raw.githubusercontent.com/vllm-project/ci-infra/$VLLM_CI_BRANCH/buildkite/test-template-amd.j2?$(date +%s)"


# (WIP) Use pipeline generator instead of jinja template
Expand All @@ -137,7 +168,7 @@ upload_pipeline() {
(
set -x
# Output pipeline.yaml with all blank lines removed
minijinja-cli test-template.j2 test-amd.yaml \
minijinja-cli test-template-amd.j2 test-amd.yaml \
-D branch="$BUILDKITE_BRANCH" \
-D list_file_diff="$LIST_FILE_DIFF" \
-D run_all="$RUN_ALL" \
Expand All @@ -160,9 +191,26 @@ upload_pipeline() {
# ---------------------------------------------------------------------------
# Compute file diff
# ---------------------------------------------------------------------------
file_diff=$(get_diff)
if [[ $BUILDKITE_BRANCH == "main" ]] && ! git rev-parse --verify HEAD~1 >/dev/null 2>&1; then
echo "HEAD~1 not available on main, fetching one more commit..."
git fetch --no-tags --deepen=1 origin >/dev/null 2>&1 || true
fi

diff_unavailable=0
if [[ $BUILDKITE_BRANCH == "main" ]] && ! git rev-parse --verify HEAD~1 >/dev/null 2>&1; then
echo "WARNING: Could not resolve HEAD~1 on main, falling back to run_all=1"
RUN_ALL=1
diff_unavailable=1
fi

if [[ $BUILDKITE_BRANCH == "main" ]]; then
file_diff=$(get_diff_main)
if [[ $diff_unavailable -eq 1 ]]; then
file_diff=""
else
file_diff=$(get_diff_main)
fi
else
file_diff=$(get_diff)
fi

# ----------------------------------------------------------------------
Expand All @@ -183,13 +231,13 @@ if [[ "${DOCS_ONLY_DISABLE}" != "1" ]]; then
docs_only=0
break
fi
done < <(printf '%s\n' "$file_diff" | tr ' ' '\n' | tr -d '\r')
done < <(printf '%s\n' "$file_diff" | tr -d '\r')

if [[ "$docs_only" -eq 1 ]]; then
buildkite-agent annotate ":memo: CI skipped — docs/Markdown/mkdocs-only changes detected

\`\`\`
$(printf '%s\n' "$file_diff" | tr ' ' '\n')
$(printf '%s\n' "$file_diff" | tr -d '\r')
\`\`\`" --style "info" || true
echo "[docs-only] All changes are docs/**, *.md, or mkdocs.yaml. Exiting before pipeline upload."
exit 0
Expand All @@ -206,25 +254,21 @@ patterns=(
"docker/Dockerfile.rocm_base"
"CMakeLists.txt"
"requirements/common.txt"
"requirements/cuda.txt"
"requirements/build.txt"
"requirements/test.txt"
"requirements/rocm.txt"
"requirements/rocm-build.txt"
"requirements/rocm-test.txt"
"requirements/build/rocm.txt"
"requirements/test/rocm.txt"
"setup.py"
"csrc/"
"cmake/"
)

ignore_patterns=(
"csrc/cpu"
"csrc/rocm"
"cmake/hipify.py"
"cmake/cpu_extension.cmake"
)

for file in $file_diff; do
while IFS= read -r file; do
[[ -z "$file" ]] && continue
# First check if file matches any pattern
matches_pattern=0
for pattern in "${patterns[@]}"; do
Expand All @@ -250,7 +294,7 @@ for file in $file_diff; do
break
fi
fi
done
done < <(printf '%s\n' "$file_diff" | tr -d '\r')

# Check for ready-run-all-tests label
LABEL_RUN_ALL=$(check_run_all_label)
Expand Down Expand Up @@ -279,7 +323,7 @@ fi
if [[ $RUN_ALL -eq 1 ]]; then
LIST_FILE_DIFF="run_all"
else
LIST_FILE_DIFF=$(echo "$file_diff" | tr ' ' '|')
LIST_FILE_DIFF=$(join_file_diff "$file_diff")
fi

upload_pipeline
38 changes: 31 additions & 7 deletions buildkite/pipeline_generator/buildkite_step.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,9 @@ class BuildkiteCommandStep(BaseModel):
def to_yaml(self):
return {
"label": self.label,
"key": self.key,
"group": self.group,
"agents": self.agents,
"commands": self.commands,
"depends_on": self.depends_on,
"soft_fail": self.soft_fail,
Expand Down Expand Up @@ -267,16 +269,16 @@ def convert_group_step_to_buildkite_step(

# Create AMD mirror step and its block step if specified/applicable
if step.mirror and step.mirror.get("amd"):
amd_block_step = None
amd_step = _create_amd_mirror_step(step, step_commands, step.mirror["amd"])
# Block step depends on the shared AMD image build.
mirror_build_dep = amd_step.depends_on[0] if amd_step.depends_on else "image-build-amd"
amd_block_step = BuildkiteBlockStep(
block=f"Run AMD: {step.label}",
depends_on=["image-build-amd"],
depends_on=[mirror_build_dep],
key=f"block-amd-{_generate_step_key(step.label)}",
)
amd_mirror_steps.append(amd_block_step)
amd_step = _create_amd_mirror_step(step, step_commands, step.mirror["amd"])
if amd_block_step:
amd_step.depends_on.extend([amd_block_step.key])
amd_step.depends_on.append(amd_block_step.key)
amd_mirror_steps.append(amd_step)

buildkite_group_steps.append(
Expand Down Expand Up @@ -304,6 +306,14 @@ def _step_should_run(step: Step, list_file_diff: List[str]) -> bool:
return False
global_config = get_global_config()
if step.key and step.key.startswith("image-build"):
# The shared AMD image build stays on-demand for non-main branches,
# except on scheduled nightlies where it should run automatically.
if (
step.key == "image-build-amd"
and global_config["branch"] != "main"
and global_config["nightly"] != "1"
):
return False
return True
if global_config["nightly"] == "1":
return True
Expand Down Expand Up @@ -377,16 +387,30 @@ def _create_amd_mirror_step(step: Step, original_commands: List[str], amd: Dict[
DeviceType.AMD_MI355_8: AgentQueue.AMD_MI355_8,
}

build_dep = "image-build-amd"

amd_queue = amd_queue_map.get(amd_device)
if not amd_queue:
raise ValueError(f"Invalid AMD device: {amd_device}. Valid devices: {list(amd_queue_map.keys())}")

return BuildkiteCommandStep(
label=amd_label,
commands=[amd_command_wrapped],
depends_on=["image-build-amd"],
depends_on=[build_dep],
agents={"queue": amd_queue},
env={"DOCKER_BUILDKIT": "1", "VLLM_TEST_COMMANDS": amd_commands_str},
env={
"DOCKER_BUILDKIT": "1",
# Agent hooks read DOCKER_IMAGE_NAME before run-amd-test.py starts.
# Keep the hook warmup on ci_base; the runner uses the full image
# only if ci_base or artifact setup fails before tests begin.
"DOCKER_IMAGE_NAME": "rocm/vllm-dev:ci_base",
"VLLM_CI_BASE_IMAGE": "rocm/vllm-dev:ci_base",
"VLLM_CI_FALLBACK_IMAGE": "rocm/vllm-ci:$BUILDKITE_COMMIT",
"VLLM_CI_USE_ARTIFACTS": "1",
"VLLM_CI_ARTIFACT_GLOB": "artifacts/vllm-rocm-install/vllm-rocm-install.tar.gz",
"VLLM_CI_RESULTS_ROOT": "/home/buildkite-agent/huggingface/amd-ci-results",
"VLLM_TEST_COMMANDS": amd_commands_str,
},
priority=200,
soft_fail=False,
retry=None,
Expand Down
Loading