Direct, immediate, and local fix to UpSampleNearest2D fwd to work around the UINT32_MAX limit of HIP runtime by glen-amd · Pull Request #3082 · ROCm/pytorch

glen-amd · 2026-03-18T15:52:09Z

Please check the inline, detailed comments in the changed files.

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

…nd the UINT32_MAX limit of HIP runtime

rocm-repo-management-api · 2026-03-18T16:06:54Z

Jenkins build for c1fbfd13502a1870937f7e8f4995ff2827a738ae commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

jeffdaily · 2026-03-18T18:32:09Z

aten/src/ATen/native/cuda/CUDALoops.cuh

+#if 0
+  // TODO: The change below needs to work with
+  // a grid-strided loop in `vectorized_elementwise_kernel`.
+  // Simiar to: https://github.com/pytorch/pytorch/pull/169474
+#ifdef USE_ROCM
+  // Clamp the grid to ensure total threads (grid * num_threads)
+  // does not exceed the uint32_t limit of the HSA AQL packet.
+  // Use 4294967295 (UINT32_MAX) as the ceiling.
+  int64_t max_safe_grid = 4294967295LL / num_threads();
+  grid = std::min(grid, max_safe_grid);
+#endif
+#endif


Why did you add an #if 0 block? For commentary?

It's not required for the direct fix to the specific error we were working on. But it's identified as a potential (similar) issue while debugging the issue in question.

As the TODO comment indicates, to be fully functional, this change would need some corresponding changes to the other function.

So, I commented out this change (using #if 0) for now.

Please remove from PR.
We can add it back when necessary.
Otherwise. Looks good to me

rocm-repo-management-api · 2026-03-18T22:51:24Z

Jenkins build for 2a0a2b229e7e30577bedd61535f9d1343a667244 commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

CSkmd · 2026-04-06T17:12:49Z

PR Review Summary

Jira: ROCM-2926

Problem

HIP's HSA AQL dispatch packet stores grid_size_* as uint32_t, so gridDim.x * blockDim.x must be < 2^32 (UINT32_MAX = 4294967295). At large batch sizes (batch=16, resolution=1024×1024), the computed grid dimensions in upsample_nearest2d overflow this limit, resulting in hipErrorInvalidConfiguration. This is a regression between PyTorch 2.9 and 2.10 on ROCm 7.12.0 (tested on MI308X / gfx94X).

Code Quality Assessment

Aspect	Status
Root cause understanding	Correct and well-documented
Forward NHWC fix	Correct — grid clamped + kernel strided
Forward NCHW fix	Correct — grid clamped + kernel strided on X/Y
Backward pass	Not fixed (acknowledged in TODO comment)
`CUDALoops.cuh` `#if 0` block	Needs removal before merge
CI	Both builds failing
CUDA correctness	Stride loops are no-ops when grid covers full output — safe

Open Issues / Action Items

#if 0 block in CUDALoops.cuh must be removed. Reviewer @jerrymannil has explicitly requested this. Dead code should not be included — it can be added back in a separate PR if/when the corresponding grid-stride loop in vectorized_elementwise_kernel is implemented.
CI is failing (both runs). The failure reason needs to be investigated and resolved before the PR can merge.
Backward pass not fixed. The PR includes a TODO comment noting the backward NCHW path has a similar overflow risk. The backward NHWC path has a TORCH_CHECK(grad_input.numel() < INT_MAX) that partially guards it. Acceptable gap for a targeted fix but should be tracked as a follow-up.
No regression test added. A minimal test (e.g., F.interpolate on a large tensor that would have previously exceeded UINT32_MAX) would prevent future regressions.

Verdict

The fix is technically sound — the root cause analysis is correct, the grid-stride loop pattern is the right approach, and the CUDA/ROCm split is clean. However, the PR is not ready to merge because:

The #if 0 dead code in CUDALoops.cuh must be removed (per reviewer request)
CI is failing and needs investigation

rocm-repo-management-api · 2026-04-07T11:22:24Z

Jenkins build for 7d3a91f696975197c73afe49349e2f980f553672 commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

Detected error during Pytorch building:

      |     ^
/var/lib/jenkins/pytorch/third_party/fbgemm/external//composable_kernel/include/ck/tensor_operation/gpu/device/impl/device_grouped_gemm_multiple_d_xdl_cshuffle_tile_loop.hpp:65:5: warning: failed to meet occupancy target given by 'amdgpu-waves-per-eu' in '_ZN2ck16tensor_operation6device34kernel_grouped_gemm_multiple_d_xdlINS_34GridwiseGemmMultiD_xdl_cshuffle_v3INS_13tensor_layout4gemm8RowMajorENS5_11ColumnMajorENS_5TupleIJS6_S7_EEES6_DB8_SA_ffNS8_IJffEEEtNS0_12element_wise11PassThroughESD_NSC_16MultiplyMultiplyELNS1_18GemmSpecializationE0ELi256ELi256ELi224ELi128ELi16ELi16ELi16ELi16ELi8ELi7ENS_8SequenceIJLi8ELi32ELi1EEEENSG_IJLi1ELi0ELi2EEEESI_Li2ELi16ELi16ELb0ELi0ESH_SI_SI_Li2ELi16ELi16ELb0ELi0ELi2ELi1ENSG_IJLi1ELi64ELi1ELi4EEEENSG_IJLi8ELi8ELi1EEEELNS_26BlockGemmPipelineSchedulerE0ELNS_24BlockGemmPipelineVersionE2ESA_SA_SA_SA_Lb0EEENS1_25GroupedGemmKernelArgumentILi2EEELSF_0ESA_SA_SB_tS6_S7_S9_S6_Li128ENS_25OffsettedBlockToCTileMap2INS_39BlockToCTileMap_Grouped_M00_N0_M01AdaptILi8ELi256ELi224EEEEESS_SD_SD_SE_LSL_0ELSM_2EEEvPU3AS4KviT13_T14_T15_': desired occupancy was 2, final occupancy is 1 [-Wpass-failed]
2 warnings generated when compiling for gfx942.
[7246/8132] Linking CXX static library lib/libfbgemm_genai.a
[7247/8132] Building HIPCC object caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/hip/torch_hip_generated_ck_gemm_float.hip.o
FAILED: caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/hip/torch_hip_generated_ck_gemm_float.hip.o /var/lib/jenkins/pytorch/build/caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/hip/torch_hip_generated_ck_gemm_float.hip.o 
cd /var/lib/jenkins/pytorch/build/caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/hip && /opt/conda/envs/py_3.12/lib/python3.12/site-packages/cmake/data/bin/cmake -E make_directory /var/lib/jenkins/pytorch/build/caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/hip/. && /opt/conda/envs/py_3.12/lib/python3.12/site-packages/cmake/data/bin/cmake -D verbose:BOOL=OFF -D build_configuration:STRING=RELEASE -D generated_file:STRING=/var/lib/jenkins/pytorch/build/caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/hip/./torch_hip_generated_ck_gemm_float.hip.o -P /var/lib/jenkins/pytorch/build/caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/hip/torch_hip_generated_ck_gemm_float.hip.o.cmake
sccache: encountered fatal error
sccache: error: Failed to parse included file path
sccache: caused by: Failed to parse included file path
failed to execute:/opt/rocm/llvm/bin/clang++  --offload-arch=gfx90a --offload-arch=gfx908 --offload-arch=gfx942 -O3  -c -x hip /var/lib/jenkins/pytorch/aten/src/ATen/native/hip/ck_gemm_float.hip -o "/var/lib/jenkins/pytorch/build/caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/hip/./torch_hip_generated_ck_gemm_float.hip.o" --offload-compress -fclang-abi-compat=17 -DUSE_NCCL -DUSE_ROCM -D__HIP_PLATFORM_AMD__ -DUSE_FLASH_ATTENTION -DFLASHATTENTION_DISABLE_ALIBI -DFLASHATTENTION_DISABLE_SOFTCAP -DFLASH_NAMESPACE=pytorch_flash -DUNFUSE_FMA -DUSE_MEM_EFF_ATTENTION -DUSE_C10D_NCCL -DTORCH_HIP_BUILD_MAIN_LIB -DROCM_VERSION=70201 -DTORCH_HIP_VERSION=702 -DUSE_LAYERNORM_FAST_RECIPROCAL -DONNX_ML=1 -DONNXIFI_ENABLE_EXT=1 -DONNX_NAMESPACE=onnx_torch -DIDEEP_USE_MKL -DHAVE_MMAP=1 -D_FILE_OFFSET_BITS=64 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_POSIX_FALLOCATE=1 -DUSE_EXTERNAL_MZCRC -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -D__HIP_PLATFORM_AMD__=1 -DUSE_PROF_API=1 -DAT_PER_OPERATOR_HEADERS -DUSE_DISTRIBUTED -DUSE_C10D_GLOO -DUSE_RPC -DUSE_TENSORPIPE -D__HIP_PLATFORM_AMD__ -DHIPBLASLT_USE_ROCROLLER -DFMT_HEADER_ONLY=1 -fPIC -D__HIP_PLATFORM_AMD__=1 -DCUDA_HAS_FP16=1 -DUSE_ROCM -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DTORCH_HIP_VERSION=702 -Wno-shift-count-negative -Wno-shift-count-overflow -DCAFFE2_USE_MIOPEN -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_HIP -std=c++17 -DHIPBLAS_V2 -DHIP_ENABLE_WARP_SYNC_BUILTINS -DHIPBLASLT_OUTER_VEC -DUSE_ROCM_CK_GEMM -fno-gpu-rdc -I/var/lib/jenkins/pytorch/build/aten/src -I/var/lib/jenkins/pytorch/aten/src -I/var/lib/jenkins/pytorch/build -I/var/lib/jenkins/pytorch -I/opt/rocm-7.2.1/include -I/var/lib/jenkins/pytorch/build/third_party/gloo -I/var/lib/jenkins/pytorch/cmake/../third_party/gloo -I/var/lib/jenkins/pytorch/cmake/../third_party/tensorpipe/third_party/libuv/include -I/var/lib/jenkins/pytorch/cmake/../third_party/googletest/googlemock/include -I/var/lib/jenkins/pytorch/cmake/../third_party/googletest/googletest/include -I/var/lib/jenkins/pytorch/third_party/protobuf/src -I/opt/conda/envs/py_3.12/include -I/var/lib/jenkins/pytorch/third_party/XNNPACK/include -I/var/lib/jenkins/pytorch/third_party/ittapi/include -I/var/lib/jenkins/pytorch/cmake/../third_party/eigen -I/opt/rocm/include -I/opt/rocm-7.2.1/include -I/var/lib/jenkins/pytorch/third_party/ideep/mkl-dnn/include/oneapi/dnnl -I/var/lib/jenkins/pytorch/third_party/ideep/include -I/var/lib/jenkins/pytorch/third_party/ideep/mkl-dnn/include/oneapi/dnnl -I/opt/conda/envs/py_3.12/include -I/var/lib/jenkins/pytorch/nlohmann -I/var/lib/jenkins/pytorch/INTERFACE -I/var/lib/jenkins/pytorch/third_party/nlohmann/include -I/var/lib/jenkins/pytorch/moodycamel -I/var/lib/jenkins/pytorch/INTERFACE -I/var/lib/jenkins/pytorch/third_party/concurrentqueue -I/var/lib/jenkins/pytorch/aten/src/THH -I/var/lib/jenkins/pytorch/third_party/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/include -I/var/lib/jenkins/pytorch/third_party/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/common/include -I/var/lib/jenkins/pytorch/aten/src/ATen/hip -I/var/lib/jenkins/pytorch/aten/src/ATen/../../../third_party/composable_kernel/include -I/var/lib/jenkins/pytorch/aten/src/ATen/../../../third_party/composable_kernel/library/include -I/var/lib/jenkins/pytorch/aten/src/ATen/../../../third_party/composable_kernel/example/ck_tile/01_fmha -I/var/lib/jenkins/pytorch/build/caffe2/aten/src/ATen/composable_kernel -I/var/lib/jenkins/pytorch/aten/src/ATen/../../../third_party/aiter/csrc/include -I/var/lib/jenkins/pytorch/third_party/fmt/include -I/var/lib/jenkins/pytorch/aten/src -I/var/lib/jenkins/pytorch/build/caffe2/aten/src -I/var/lib/jenkins/pytorch/build/aten/src -I/var/lib/jenkins/pytorch/aten/src -I/var/lib/jenkins/pytorch/aten/src/ATen/.. -I/var/lib/jenkins/pytorch/torch/include -I/opt/rocm-7.2.1/include -I/opt/rocm/include -I/var/lib/jenkins/pytorch/c10/hip/../.. -I/var/lib/jenkins/pytorch/build -I/var/lib/jenkins/pytorch/c10/../ -I/var/lib/jenkins/pytorch/build -I/var/lib/jenkins/pytorch/torch/csrc/api -I/var/lib/jenkins/pytorch/torch/csrc/api/include -I/var/lib/jenkins/pytorch/third_party/protobuf/src -I/opt/conda/envs/py_3.12/include -I/opt/rocm-7.2.1/include -I/opt/rocm/include -I/opt/rocm-7.2.1/include -I/opt/rocm-7.2.1/include -I/opt/rocm-7.2.1/include -I/opt/rocm-7.2.1/include -I/opt/rocm-7.2.1/include -I/opt/rocm-7.2.1/include -I/opt/rocm-7.2.1/include -I/opt/rocm-7.2.1/include -I/opt/rocm-7.2.1/include/hiprand -I/opt/rocm-7.2.1/include -I/opt/rocm-7.2.1/include/rocrand -I/opt/rocm-7.2.1/include -I/opt/rocm-7.2.1/include -I/opt/rocm-7.2.1/include -I/opt/rocm-7.2.1/include -I/opt/rocm-7.2.1/include -I/opt/rocm/include -I/opt/rocm/include -I/opt/rocm-7.2.1/include -I/opt/rocm-7.2.1/include -I/opt/rocm-7.2.1/include -I/opt/rocm-7.2.1/include -I/opt/rocm/include -I/var/lib/jenkins/pytorch/build/third_party/gloo/hip -I/var/lib/jenkins/pytorch/build/aten/src -I/var/lib/jenkins/pytorch/aten/src -I/var/lib/jenkins/pytorch/build -I/var/lib/jenkins/pytorch -I/opt/rocm-7.2.1/include -I/var/lib/jenkins/pytorch/build/third_party/gloo -I/var/lib/jenkins/pytorch/cmake/../third_party/gloo -I/var/lib/jenkins/pytorch/cmake/../third_party/tensorpipe/third_party/libuv/include -I/var/lib/jenkins/pytorch/cmake/../third_party/googletest/googlemock/include -I/var/lib/jenkins/pytorch/cmake/../third_party/googletest/googletest/include -I/var/lib/jenkins/pytorch/third_party/protobuf/src -I/opt/conda/envs/py_3.12/include -I/var/lib/jenkins/pytorch/third_party/XNNPACK/include -I/var/lib/jenkins/pytorch/third_party/ittapi/include -I/var/lib/jenkins/pytorch/cmake/../third_party/eigen -I/opt/rocm/include -I/var/lib/jenkins/pytorch/third_party/ideep/mkl-dnn/include/oneapi/dnnl -I/var/lib/jenkins/pytorch/third_party/ideep/include -I/var/lib/jenkins/pytorch/nlohmann -I/var/lib/jenkins/pytorch/INTERFACE -I/var/lib/jenkins/pytorch/third_party/nlohmann/include -I/var/lib/jenkins/pytorch/moodycamel -I/var/lib/jenkins/pytorch/third_party/concurrentqueue

glen-amd added 2 commits March 18, 2026 07:44

Direct, immediate, and local fix to UpSampleNearest2D fwd to workarou…

edba2d0

…nd the UINT32_MAX limit of HIP runtime

Added a TODO in launch_vectorized_kernel as a broader workaround

c1fbfd1

jeffdaily reviewed Mar 18, 2026

View reviewed changes

glen-amd added 2 commits March 18, 2026 15:22

Added a couple of TODOs and refined comments

620f963

Grid-strided loop for the upsample_nearest2d_nhwc_out_frame kernel

2a0a2b2

glen-amd marked this pull request as ready for review March 24, 2026 11:15

Got rid of some code pieces that are commented out

7d3a91f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Direct, immediate, and local fix to UpSampleNearest2D fwd to work around the UINT32_MAX limit of HIP runtime#3082

Direct, immediate, and local fix to UpSampleNearest2D fwd to work around the UINT32_MAX limit of HIP runtime#3082
glen-amd wants to merge 5 commits intorelease/2.10from
fix-to-rocm-2926

glen-amd commented Mar 18, 2026

Uh oh!

rocm-repo-management-api bot commented Mar 18, 2026 •

edited

Loading

Uh oh!

jeffdaily Mar 18, 2026

Uh oh!

glen-amd Mar 18, 2026 •

edited

Loading

Uh oh!

jerrymannil Apr 2, 2026

Uh oh!

rocm-repo-management-api bot commented Mar 18, 2026 •

edited

Loading

Uh oh!

CSkmd commented Apr 6, 2026 •

edited by atlassian bot

Loading

Uh oh!

rocm-repo-management-api bot commented Apr 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

glen-amd commented Mar 18, 2026

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

rocm-repo-management-api bot commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeffdaily Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

glen-amd Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jerrymannil Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

rocm-repo-management-api bot commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CSkmd commented Apr 6, 2026 • edited by atlassian bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review Summary

Problem

Code Quality Assessment

Open Issues / Action Items

Verdict

Uh oh!

rocm-repo-management-api bot commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

rocm-repo-management-api bot commented Mar 18, 2026 •

edited

Loading

glen-amd Mar 18, 2026 •

edited

Loading

rocm-repo-management-api bot commented Mar 18, 2026 •

edited

Loading

CSkmd commented Apr 6, 2026 •

edited by atlassian bot

Loading

rocm-repo-management-api bot commented Apr 7, 2026 •

edited

Loading