Skip to content

Direct, immediate, and local fix to UpSampleNearest2D fwd to work around the UINT32_MAX limit of HIP runtime#3082

Open
glen-amd wants to merge 5 commits intorelease/2.10from
fix-to-rocm-2926
Open

Direct, immediate, and local fix to UpSampleNearest2D fwd to work around the UINT32_MAX limit of HIP runtime#3082
glen-amd wants to merge 5 commits intorelease/2.10from
fix-to-rocm-2926

Conversation

@glen-amd
Copy link
Copy Markdown

Please check the inline, detailed comments in the changed files.

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

@rocm-repo-management-api
Copy link
Copy Markdown

rocm-repo-management-api bot commented Mar 18, 2026

Jenkins build for c1fbfd13502a1870937f7e8f4995ff2827a738ae commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

Comment on lines +322 to +333
#if 0
// TODO: The change below needs to work with
// a grid-strided loop in `vectorized_elementwise_kernel`.
// Simiar to: https://github.com/pytorch/pytorch/pull/169474
#ifdef USE_ROCM
// Clamp the grid to ensure total threads (grid * num_threads)
// does not exceed the uint32_t limit of the HSA AQL packet.
// Use 4294967295 (UINT32_MAX) as the ceiling.
int64_t max_safe_grid = 4294967295LL / num_threads();
grid = std::min(grid, max_safe_grid);
#endif
#endif
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you add an #if 0 block? For commentary?

Copy link
Copy Markdown
Author

@glen-amd glen-amd Mar 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. It's not required for the direct fix to the specific error we were working on. But it's identified as a potential (similar) issue while debugging the issue in question.
  2. As the TODO comment indicates, to be fully functional, this change would need some corresponding changes to the other function.
  3. So, I commented out this change (using #if 0) for now.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove from PR.
We can add it back when necessary.
Otherwise. Looks good to me

@rocm-repo-management-api
Copy link
Copy Markdown

rocm-repo-management-api bot commented Mar 18, 2026

Jenkins build for 2a0a2b229e7e30577bedd61535f9d1343a667244 commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

@glen-amd glen-amd marked this pull request as ready for review March 24, 2026 11:15
@CSkmd
Copy link
Copy Markdown

CSkmd commented Apr 6, 2026

PR Review Summary

Jira: ROCM-2926


Problem

HIP's HSA AQL dispatch packet stores grid_size_* as uint32_t, so gridDim.x * blockDim.x must be < 2^32 (UINT32_MAX = 4294967295). At large batch sizes (batch=16, resolution=1024×1024), the computed grid dimensions in upsample_nearest2d overflow this limit, resulting in hipErrorInvalidConfiguration. This is a regression between PyTorch 2.9 and 2.10 on ROCm 7.12.0 (tested on MI308X / gfx94X).


Code Quality Assessment

Aspect Status
Root cause understanding Correct and well-documented
Forward NHWC fix Correct — grid clamped + kernel strided
Forward NCHW fix Correct — grid clamped + kernel strided on X/Y
Backward pass Not fixed (acknowledged in TODO comment)
CUDALoops.cuh #if 0 block Needs removal before merge
CI Both builds failing
CUDA correctness Stride loops are no-ops when grid covers full output — safe

Open Issues / Action Items

  1. #if 0 block in CUDALoops.cuh must be removed. Reviewer @jerrymannil has explicitly requested this. Dead code should not be included — it can be added back in a separate PR if/when the corresponding grid-stride loop in vectorized_elementwise_kernel is implemented.

  2. CI is failing (both runs). The failure reason needs to be investigated and resolved before the PR can merge.

  3. Backward pass not fixed. The PR includes a TODO comment noting the backward NCHW path has a similar overflow risk. The backward NHWC path has a TORCH_CHECK(grad_input.numel() < INT_MAX) that partially guards it. Acceptable gap for a targeted fix but should be tracked as a follow-up.

  4. No regression test added. A minimal test (e.g., F.interpolate on a large tensor that would have previously exceeded UINT32_MAX) would prevent future regressions.


Verdict

The fix is technically sound — the root cause analysis is correct, the grid-stride loop pattern is the right approach, and the CUDA/ROCm split is clean. However, the PR is not ready to merge because:

  • The #if 0 dead code in CUDALoops.cuh must be removed (per reviewer request)
  • CI is failing and needs investigation

@rocm-repo-management-api
Copy link
Copy Markdown

rocm-repo-management-api bot commented Apr 7, 2026

Jenkins build for 7d3a91f696975197c73afe49349e2f980f553672 commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

Detected error during Pytorch building:

      |     ^
/var/lib/jenkins/pytorch/third_party/fbgemm/external//composable_kernel/include/ck/tensor_operation/gpu/device/impl/device_grouped_gemm_multiple_d_xdl_cshuffle_tile_loop.hpp:65:5: warning: failed to meet occupancy target given by 'amdgpu-waves-per-eu' in '_ZN2ck16tensor_operation6device34kernel_grouped_gemm_multiple_d_xdlINS_34GridwiseGemmMultiD_xdl_cshuffle_v3INS_13tensor_layout4gemm8RowMajorENS5_11ColumnMajorENS_5TupleIJS6_S7_EEES6_DB8_SA_ffNS8_IJffEEEtNS0_12element_wise11PassThroughESD_NSC_16MultiplyMultiplyELNS1_18GemmSpecializationE0ELi256ELi256ELi224ELi128ELi16ELi16ELi16ELi16ELi8ELi7ENS_8SequenceIJLi8ELi32ELi1EEEENSG_IJLi1ELi0ELi2EEEESI_Li2ELi16ELi16ELb0ELi0ESH_SI_SI_Li2ELi16ELi16ELb0ELi0ELi2ELi1ENSG_IJLi1ELi64ELi1ELi4EEEENSG_IJLi8ELi8ELi1EEEELNS_26BlockGemmPipelineSchedulerE0ELNS_24BlockGemmPipelineVersionE2ESA_SA_SA_SA_Lb0EEENS1_25GroupedGemmKernelArgumentILi2EEELSF_0ESA_SA_SB_tS6_S7_S9_S6_Li128ENS_25OffsettedBlockToCTileMap2INS_39BlockToCTileMap_Grouped_M00_N0_M01AdaptILi8ELi256ELi224EEEEESS_SD_SD_SE_LSL_0ELSM_2EEEvPU3AS4KviT13_T14_T15_': desired occupancy was 2, final occupancy is 1 [-Wpass-failed]
2 warnings generated when compiling for gfx942.
[7246/8132] Linking CXX static library lib/libfbgemm_genai.a
[7247/8132] Building HIPCC object caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/hip/torch_hip_generated_ck_gemm_float.hip.o
FAILED: caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/hip/torch_hip_generated_ck_gemm_float.hip.o /var/lib/jenkins/pytorch/build/caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/hip/torch_hip_generated_ck_gemm_float.hip.o 
cd /var/lib/jenkins/pytorch/build/caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/hip && /opt/conda/envs/py_3.12/lib/python3.12/site-packages/cmake/data/bin/cmake -E make_directory /var/lib/jenkins/pytorch/build/caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/hip/. && /opt/conda/envs/py_3.12/lib/python3.12/site-packages/cmake/data/bin/cmake -D verbose:BOOL=OFF -D build_configuration:STRING=RELEASE -D generated_file:STRING=/var/lib/jenkins/pytorch/build/caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/hip/./torch_hip_generated_ck_gemm_float.hip.o -P /var/lib/jenkins/pytorch/build/caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/hip/torch_hip_generated_ck_gemm_float.hip.o.cmake
sccache: encountered fatal error
sccache: error: Failed to parse included file path
sccache: caused by: Failed to parse included file path
failed to execute:/opt/rocm/llvm/bin/clang++  --offload-arch=gfx90a --offload-arch=gfx908 --offload-arch=gfx942 -O3  -c -x hip /var/lib/jenkins/pytorch/aten/src/ATen/native/hip/ck_gemm_float.hip -o "/var/lib/jenkins/pytorch/build/caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/hip/./torch_hip_generated_ck_gemm_float.hip.o" --offload-compress -fclang-abi-compat=17 -DUSE_NCCL -DUSE_ROCM -D__HIP_PLATFORM_AMD__ -DUSE_FLASH_ATTENTION -DFLASHATTENTION_DISABLE_ALIBI -DFLASHATTENTION_DISABLE_SOFTCAP -DFLASH_NAMESPACE=pytorch_flash -DUNFUSE_FMA -DUSE_MEM_EFF_ATTENTION -DUSE_C10D_NCCL -DTORCH_HIP_BUILD_MAIN_LIB -DROCM_VERSION=70201 -DTORCH_HIP_VERSION=702 -DUSE_LAYERNORM_FAST_RECIPROCAL -DONNX_ML=1 -DONNXIFI_ENABLE_EXT=1 -DONNX_NAMESPACE=onnx_torch -DIDEEP_USE_MKL -DHAVE_MMAP=1 -D_FILE_OFFSET_BITS=64 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_POSIX_FALLOCATE=1 -DUSE_EXTERNAL_MZCRC -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -D__HIP_PLATFORM_AMD__=1 -DUSE_PROF_API=1 -DAT_PER_OPERATOR_HEADERS -DUSE_DISTRIBUTED -DUSE_C10D_GLOO -DUSE_RPC -DUSE_TENSORPIPE -D__HIP_PLATFORM_AMD__ -DHIPBLASLT_USE_ROCROLLER -DFMT_HEADER_ONLY=1 -fPIC -D__HIP_PLATFORM_AMD__=1 -DCUDA_HAS_FP16=1 -DUSE_ROCM -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DTORCH_HIP_VERSION=702 -Wno-shift-count-negative -Wno-shift-count-overflow -DCAFFE2_USE_MIOPEN -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_HIP -std=c++17 -DHIPBLAS_V2 -DHIP_ENABLE_WARP_SYNC_BUILTINS -DHIPBLASLT_OUTER_VEC -DUSE_ROCM_CK_GEMM -fno-gpu-rdc -I/var/lib/jenkins/pytorch/build/aten/src -I/var/lib/jenkins/pytorch/aten/src -I/var/lib/jenkins/pytorch/build -I/var/lib/jenkins/pytorch -I/opt/rocm-7.2.1/include -I/var/lib/jenkins/pytorch/build/third_party/gloo -I/var/lib/jenkins/pytorch/cmake/../third_party/gloo -I/var/lib/jenkins/pytorch/cmake/../third_party/tensorpipe/third_party/libuv/include -I/var/lib/jenkins/pytorch/cmake/../third_party/googletest/googlemock/include -I/var/lib/jenkins/pytorch/cmake/../third_party/googletest/googletest/include -I/var/lib/jenkins/pytorch/third_party/protobuf/src -I/opt/conda/envs/py_3.12/include -I/var/lib/jenkins/pytorch/third_party/XNNPACK/include -I/var/lib/jenkins/pytorch/third_party/ittapi/include -I/var/lib/jenkins/pytorch/cmake/../third_party/eigen -I/opt/rocm/include -I/opt/rocm-7.2.1/include -I/var/lib/jenkins/pytorch/third_party/ideep/mkl-dnn/include/oneapi/dnnl -I/var/lib/jenkins/pytorch/third_party/ideep/include -I/var/lib/jenkins/pytorch/third_party/ideep/mkl-dnn/include/oneapi/dnnl -I/opt/conda/envs/py_3.12/include -I/var/lib/jenkins/pytorch/nlohmann -I/var/lib/jenkins/pytorch/INTERFACE -I/var/lib/jenkins/pytorch/third_party/nlohmann/include -I/var/lib/jenkins/pytorch/moodycamel -I/var/lib/jenkins/pytorch/INTERFACE -I/var/lib/jenkins/pytorch/third_party/concurrentqueue -I/var/lib/jenkins/pytorch/aten/src/THH -I/var/lib/jenkins/pytorch/third_party/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/include -I/var/lib/jenkins/pytorch/third_party/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/common/include -I/var/lib/jenkins/pytorch/aten/src/ATen/hip -I/var/lib/jenkins/pytorch/aten/src/ATen/../../../third_party/composable_kernel/include -I/var/lib/jenkins/pytorch/aten/src/ATen/../../../third_party/composable_kernel/library/include -I/var/lib/jenkins/pytorch/aten/src/ATen/../../../third_party/composable_kernel/example/ck_tile/01_fmha -I/var/lib/jenkins/pytorch/build/caffe2/aten/src/ATen/composable_kernel -I/var/lib/jenkins/pytorch/aten/src/ATen/../../../third_party/aiter/csrc/include -I/var/lib/jenkins/pytorch/third_party/fmt/include -I/var/lib/jenkins/pytorch/aten/src -I/var/lib/jenkins/pytorch/build/caffe2/aten/src -I/var/lib/jenkins/pytorch/build/aten/src -I/var/lib/jenkins/pytorch/aten/src -I/var/lib/jenkins/pytorch/aten/src/ATen/.. -I/var/lib/jenkins/pytorch/torch/include -I/opt/rocm-7.2.1/include -I/opt/rocm/include -I/var/lib/jenkins/pytorch/c10/hip/../.. -I/var/lib/jenkins/pytorch/build -I/var/lib/jenkins/pytorch/c10/../ -I/var/lib/jenkins/pytorch/build -I/var/lib/jenkins/pytorch/torch/csrc/api -I/var/lib/jenkins/pytorch/torch/csrc/api/include -I/var/lib/jenkins/pytorch/third_party/protobuf/src -I/opt/conda/envs/py_3.12/include -I/opt/rocm-7.2.1/include -I/opt/rocm/include -I/opt/rocm-7.2.1/include -I/opt/rocm-7.2.1/include -I/opt/rocm-7.2.1/include -I/opt/rocm-7.2.1/include -I/opt/rocm-7.2.1/include -I/opt/rocm-7.2.1/include -I/opt/rocm-7.2.1/include -I/opt/rocm-7.2.1/include -I/opt/rocm-7.2.1/include/hiprand -I/opt/rocm-7.2.1/include -I/opt/rocm-7.2.1/include/rocrand -I/opt/rocm-7.2.1/include -I/opt/rocm-7.2.1/include -I/opt/rocm-7.2.1/include -I/opt/rocm-7.2.1/include -I/opt/rocm-7.2.1/include -I/opt/rocm/include -I/opt/rocm/include -I/opt/rocm-7.2.1/include -I/opt/rocm-7.2.1/include -I/opt/rocm-7.2.1/include -I/opt/rocm-7.2.1/include -I/opt/rocm/include -I/var/lib/jenkins/pytorch/build/third_party/gloo/hip -I/var/lib/jenkins/pytorch/build/aten/src -I/var/lib/jenkins/pytorch/aten/src -I/var/lib/jenkins/pytorch/build -I/var/lib/jenkins/pytorch -I/opt/rocm-7.2.1/include -I/var/lib/jenkins/pytorch/build/third_party/gloo -I/var/lib/jenkins/pytorch/cmake/../third_party/gloo -I/var/lib/jenkins/pytorch/cmake/../third_party/tensorpipe/third_party/libuv/include -I/var/lib/jenkins/pytorch/cmake/../third_party/googletest/googlemock/include -I/var/lib/jenkins/pytorch/cmake/../third_party/googletest/googletest/include -I/var/lib/jenkins/pytorch/third_party/protobuf/src -I/opt/conda/envs/py_3.12/include -I/var/lib/jenkins/pytorch/third_party/XNNPACK/include -I/var/lib/jenkins/pytorch/third_party/ittapi/include -I/var/lib/jenkins/pytorch/cmake/../third_party/eigen -I/opt/rocm/include -I/var/lib/jenkins/pytorch/third_party/ideep/mkl-dnn/include/oneapi/dnnl -I/var/lib/jenkins/pytorch/third_party/ideep/include -I/var/lib/jenkins/pytorch/nlohmann -I/var/lib/jenkins/pytorch/INTERFACE -I/var/lib/jenkins/pytorch/third_party/nlohmann/include -I/var/lib/jenkins/pytorch/moodycamel -I/var/lib/jenkins/pytorch/third_party/concurrentqueue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants