[Releases] [ROCm] Enable Nightly Docker Image and Wheel Releases for ROCm#37283
Conversation
Signed-off-by: Hongxia Yang <hongxiay.yang@amd.com>
Signed-off-by: Hongxia Yang <hongxiay.yang@amd.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
…se-repo' in registry with id 'q9t5s3a7' exceeds the maximum allowed number of tags per image which is '1000' Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
…se-repo' in registry with id 'q9t5s3a7' exceeds the maximum allowed number of tags per image which is '1000' Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
There was a problem hiding this comment.
Code Review
This pull request introduces significant changes to the Buildkite release pipeline, primarily enabling nightly Docker image and wheel releases for ROCm. The changes involve a major refactoring of the ROCm build process, moving towards a more automated and cached approach using ECR for Docker images and S3 for wheels. Configuration is now centralized by extracting values directly from Dockerfile.rocm_base, which is a positive step for consistency. New scripts for ECR tag cleanup and Docker Hub nightly pushes have been added, enhancing pipeline management. However, there are several critical issues identified that prevent the intended functionality and require immediate attention.
|
I have added comments to my own code for the ease of review. |
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
The release pipeline uses CPU agents from AWS and only runs on |
…ROCm (vllm-project#37283) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Co-authored-by: Hongxia Yang <hongxiay.yang@amd.com>
…ROCm (vllm-project#37283) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Co-authored-by: Hongxia Yang <hongxiay.yang@amd.com> Signed-off-by: Nithin Chalapathi <nithin.ch10@gmail.com>
…ROCm (vllm-project#37283) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Co-authored-by: Hongxia Yang <hongxiay.yang@amd.com>
…ROCm (vllm-project#37283) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Co-authored-by: Hongxia Yang <hongxiay.yang@amd.com> Signed-off-by: Rishi Puri <riship@nvidia.com>
…ROCm (vllm-project#37283) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Co-authored-by: Hongxia Yang <hongxiay.yang@amd.com>
…ROCm (vllm-project#37283) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Co-authored-by: Hongxia Yang <hongxiay.yang@amd.com>
…ROCm (vllm-project#37283) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Co-authored-by: Hongxia Yang <hongxiay.yang@amd.com>
…ROCm (vllm-project#37283) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Co-authored-by: Hongxia Yang <hongxiay.yang@amd.com>
Purpose
Address #36703
This PR enables Nightly Docker Image and Wheel Releases for ROCm
This PR can still stand on its own as it will build the base image and populate the cache if the base docker image is not found in the cache.
User Experience Details
Docker
Following the CUDA Release pipeline, we are only keeping the docker image of latest 14 commits.
The docker image on dockerhub will be released with the following tag pattern:
vllm/vllm-openai-rocm:base-nightlyvllm/vllm-openai-rocm:base-nightly-<commit>vllm/vllm-openai-rocm:nightlyvllm/vllm-openai-rocm:nightly-<commit>Example log:
Wheel
The ROCm wheels are located at the path:
https://wheels.vllm.ai/rocm/nightly/<variant>nowhttps://wheels.vllm.ai/rocm/nightly/rocm700https://wheels.vllm.ai/rocm/36c72b2191380fa3809928f7b29880a499177457/rocm700/Test plan
Trigger the pipeline
Test results
Mock results https://buildkite.com/vllm/release-pipeline-shadow/builds/3264/steps/canvas
Enhancement (Future plan)
Enhancement: ci-infra PR vllm-project/ci-infra#297 to ensure the
Dockerfile.rocm_basecache is always pre-populated.Archive
Test Plan
Step 1: Trigger Ci-Infra PR vllm-project/ci-infra#297 is triggered to populate the cache
Step 2: Trigger this PR
Test Result
Step 1: Trigger Ci-Infra PR vllm-project/ci-infra#297 is triggered to populate the cache
https://buildkite.com/vllm/amd-ci/builds/6593/steps/canvas?sid=019cfb35-7dd4-4c22-8543-1b403a11356e&tab=output
The generated based docker image and wheels are
Step 2: Trigger this PR
The results are in https://buildkite.com/vllm/release-pipeline-shadow/builds/3262/steps/canvas
Building docker vLLM image, the cached base docker image is downloaded and reused https://buildkite.com/vllm/release-pipeline-shadow/builds/3262/steps/canvas?sid=019d0e39-a357-44d8-a6f8-43ff0e15b1ac&tab=output
Implementation Details
Challenges:
The ROCm releases depends on custom dependencies that are specified in
Dockerfile.rocm_base. The build time of this docker file takes more than 3 hrs even after sccache enabled. The majority of the build time comes fromtorchandamd-aiter(if we choose to enable prebuilt again).Current implementation reuses the caching logic from #32264 to cache the base docker image and dependency wheels (created from
Dockerfile.rocm_base).Ci-infra PR vllm-project/ci-infra#297 to ensure the
Dockerfile.rocm_basecache is always pre-populated.Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.