Skip to content

[Releases] [ROCm] Enable Nightly Docker Image and Wheel Releases for ROCm#37283

Merged
tjtanaa merged 29 commits into
vllm-project:mainfrom
tjtanaa:nightly-rocm
Mar 26, 2026
Merged

[Releases] [ROCm] Enable Nightly Docker Image and Wheel Releases for ROCm#37283
tjtanaa merged 29 commits into
vllm-project:mainfrom
tjtanaa:nightly-rocm

Conversation

@tjtanaa
Copy link
Copy Markdown
Collaborator

@tjtanaa tjtanaa commented Mar 17, 2026

Purpose

Address #36703

This PR enables Nightly Docker Image and Wheel Releases for ROCm

This PR can still stand on its own as it will build the base image and populate the cache if the base docker image is not found in the cache.

User Experience Details

Docker

Following the CUDA Release pipeline, we are only keeping the docker image of latest 14 commits.

The docker image on dockerhub will be released with the following tag pattern:

  • Base Docker Image (for traceability):
    • vllm/vllm-openai-rocm:base-nightly
    • vllm/vllm-openai-rocm:base-nightly-<commit>
  • vLLM OpenAI Docker Image:
    • vllm/vllm-openai-rocm:nightly
    • vllm/vllm-openai-rocm:nightly-<commit>

Example log:

[2026-03-23T17:42:32Z] + docker tag public.ecr.aws/q9t5s3a7/vllm-release-repo:1227c9527d573e09-rocm-base vllm/vllm-openai-rocm:base-nightly
[2026-03-23T17:42:32Z] + docker tag public.ecr.aws/q9t5s3a7/vllm-release-repo:1227c9527d573e09-rocm-base vllm/vllm-openai-rocm:base-nightly-57e207873b521dcbaba50b37153b0dd0b5883636
[2026-03-23T17:42:32Z] + docker tag public.ecr.aws/q9t5s3a7/vllm-release-repo:57e207873b521dcbaba50b37153b0dd0b5883636-rocm vllm/vllm-openai-rocm:nightly
[2026-03-23T17:42:32Z] + docker tag public.ecr.aws/q9t5s3a7/vllm-release-repo:57e207873b521dcbaba50b37153b0dd0b5883636-rocm vllm/vllm-openai-rocm:nightly-57e207873b521dcbaba50b37153b0dd0b5883636

Wheel

The ROCm wheels are located at the path:

  • https://wheels.vllm.ai/rocm/nightly/<variant> now https://wheels.vllm.ai/rocm/nightly/rocm700
  • https://wheels.vllm.ai/rocm/36c72b2191380fa3809928f7b29880a499177457/rocm700/

Test plan

Trigger the pipeline

Test results

Mock results https://buildkite.com/vllm/release-pipeline-shadow/builds/3264/steps/canvas

Enhancement (Future plan)

Enhancement: ci-infra PR vllm-project/ci-infra#297 to ensure the Dockerfile.rocm_base cache is always pre-populated.

Archive

Test Plan

Step 1: Trigger Ci-Infra PR vllm-project/ci-infra#297 is triggered to populate the cache

Step 2: Trigger this PR

  • Ensure that this Pipeline reuses the cached docker image and wheels from Step 1.

Test Result

Step 1: Trigger Ci-Infra PR vllm-project/ci-infra#297 is triggered to populate the cache

https://buildkite.com/vllm/amd-ci/builds/6593/steps/canvas?sid=019cfb35-7dd4-4c22-8543-1b403a11356e&tab=output

The generated based docker image and wheels are

[2026-03-17T09:56:01Z] ROCm Base Image Build/Reuse
[2026-03-17T09:56:01Z]   Cache Key: b58dc988fa0856d2-9d3bce57
[2026-03-17T09:56:01Z]   ECR Cache Tag: public.ecr.aws/q9t5s3a7/vllm-release-repo:b58dc988fa0856d2-9d3bce57-rocm-base
[2026-03-17T09:56:01Z]   ECR Commit Tag: public.ecr.aws/q9t5s3a7/vllm-release-repo:0e4701ff0f6802aef64d449f7e8ab3c6599ca6e3-b58dc988fa0856d2-9d3bce57-rocm-base

[2026-03-17T09:56:04Z] Tagged public.ecr.aws/q9t5s3a7/vllm-release-repo:b58dc988fa0856d2-9d3bce57-rocm-base as public.ecr.aws/q9t5s3a7/vllm-release-repo:0e4701ff0f6802aef64d449f7e8ab3c6599ca6e3-b58dc988fa0856d2-9d3bce57-rocm-base in ECR (no pull required)
[2026-03-17T09:56:04Z] Base image ready: public.ecr.aws/q9t5s3a7/vllm-release-repo:0e4701ff0f6802aef64d449f7e8ab3c6599ca6e3-b58dc988fa0856d2-9d3bce57-rocm-base

Step 2: Trigger this PR

The results are in https://buildkite.com/vllm/release-pipeline-shadow/builds/3262/steps/canvas

Building docker vLLM image, the cached base docker image is downloaded and reused https://buildkite.com/vllm/release-pipeline-shadow/builds/3262/steps/canvas?sid=019d0e39-a357-44d8-a6f8-43ff0e15b1ac&tab=output

[2026-03-21T05:59:30Z] Pulling base Docker image from ECR: public.ecr.aws/q9t5s3a7/vllm-release-repo:b58dc988fa0856d2-9d3bce57-rocm-base
[2026-03-21T05:59:30Z] b58dc988fa0856d2-9d3bce57-rocm-base: Pulling from q9t5s3a7/vllm-release-repo

Implementation Details

Challenges:

The ROCm releases depends on custom dependencies that are specified in Dockerfile.rocm_base. The build time of this docker file takes more than 3 hrs even after sccache enabled. The majority of the build time comes from torch and amd-aiter (if we choose to enable prebuilt again).

Current implementation reuses the caching logic from #32264 to cache the base docker image and dependency wheels (created from Dockerfile.rocm_base).

Ci-infra PR vllm-project/ci-infra#297 to ensure the Dockerfile.rocm_base cache is always pre-populated.


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Hongxia Yang and others added 14 commits February 17, 2026 16:14
Signed-off-by: Hongxia Yang <hongxiay.yang@amd.com>
Signed-off-by: Hongxia Yang <hongxiay.yang@amd.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
…se-repo' in registry with id 'q9t5s3a7' exceeds the maximum allowed number of tags per image which is '1000'

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
…se-repo' in registry with id 'q9t5s3a7' exceeds the maximum allowed number of tags per image which is '1000'

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
@tjtanaa tjtanaa self-assigned this Mar 17, 2026
@mergify mergify Bot added ci/build rocm Related to AMD ROCm labels Mar 17, 2026
@github-project-automation github-project-automation Bot moved this to Todo in AMD Mar 17, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant changes to the Buildkite release pipeline, primarily enabling nightly Docker image and wheel releases for ROCm. The changes involve a major refactoring of the ROCm build process, moving towards a more automated and cached approach using ECR for Docker images and S3 for wheels. Configuration is now centralized by extracting values directly from Dockerfile.rocm_base, which is a positive step for consistency. New scripts for ECR tag cleanup and Docker Hub nightly pushes have been added, enhancing pipeline management. However, there are several critical issues identified that prevent the intended functionality and require immediate attention.

Comment thread .buildkite/release-pipeline.yaml Outdated
Comment thread .buildkite/release-pipeline.yaml Outdated
Comment thread .buildkite/scripts/push-nightly-builds-rocm.sh
Comment thread .buildkite/release-pipeline.yaml Outdated
Comment thread .buildkite/release-pipeline.yaml
Comment thread .buildkite/release-pipeline.yaml
Comment thread .buildkite/scripts/cleanup-ecr-rocm-base-tags.sh Outdated
@tjtanaa
Copy link
Copy Markdown
Collaborator Author

tjtanaa commented Mar 17, 2026

I have added comments to my own code for the ease of review.

tjtanaa added 3 commits March 17, 2026 15:09
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
tjtanaa added 3 commits March 21, 2026 02:26
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
@tjtanaa tjtanaa added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 25, 2026
@tjtanaa
Copy link
Copy Markdown
Collaborator Author

tjtanaa commented Mar 25, 2026

Conditionally approving, pending the input from the infra folks regarding the potential additional load on the system

The release pipeline uses CPU agents from AWS and only runs on Release pipeline. It will not add additional load to AMD CI

@tjtanaa tjtanaa enabled auto-merge (squash) March 26, 2026 15:52
@tjtanaa tjtanaa merged commit 60af7b9 into vllm-project:main Mar 26, 2026
13 of 14 checks passed
@github-project-automation github-project-automation Bot moved this from Todo to Done in AMD Mar 26, 2026
RhizoNymph pushed a commit to RhizoNymph/vllm that referenced this pull request Mar 26, 2026
…ROCm (vllm-project#37283)

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: Hongxia Yang <hongxiay.yang@amd.com>
nithinvc pushed a commit to nithinvc/vllm that referenced this pull request Mar 27, 2026
…ROCm (vllm-project#37283)

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: Hongxia Yang <hongxiay.yang@amd.com>

Signed-off-by: Nithin Chalapathi <nithin.ch10@gmail.com>
JiantaoXu pushed a commit to JiantaoXu/vllm that referenced this pull request Mar 28, 2026
…ROCm (vllm-project#37283)

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: Hongxia Yang <hongxiay.yang@amd.com>
puririshi98 pushed a commit to puririshi98/vllm that referenced this pull request Apr 7, 2026
…ROCm (vllm-project#37283)

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: Hongxia Yang <hongxiay.yang@amd.com>
Signed-off-by: Rishi Puri <riship@nvidia.com>
mtparet pushed a commit to blackfuel-ai/vllm that referenced this pull request Apr 9, 2026
…ROCm (vllm-project#37283)

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: Hongxia Yang <hongxiay.yang@amd.com>
mystous pushed a commit to mystous/vllm_hybrid that referenced this pull request May 10, 2026
…ROCm (vllm-project#37283)

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: Hongxia Yang <hongxiay.yang@amd.com>
my-other-github-account pushed a commit to my-other-github-account/vllm that referenced this pull request May 15, 2026
…ROCm (vllm-project#37283)

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: Hongxia Yang <hongxiay.yang@amd.com>
my-other-github-account pushed a commit to my-other-github-account/vllm that referenced this pull request May 15, 2026
…ROCm (vllm-project#37283)

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: Hongxia Yang <hongxiay.yang@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants