cuda: add simple cudaMalloc/cudaFree allocator as opt-in (workaround for #2038)#2044
Open
T0nd3 wants to merge 1 commit into
Open
cuda: add simple cudaMalloc/cudaFree allocator as opt-in (workaround for #2038)#2044T0nd3 wants to merge 1 commit into
T0nd3 wants to merge 1 commit into
Conversation
…MT#2038) Issue OpenNMT#2038 reports that `del model` deadlocks during cleanup on ROCm 7.2.1 + Windows + gfx1100. The trace points at the hipcub CachingDeviceAllocator's DeviceFree path — both the per-block hipEventRecord (recached branch) and the synchronous hipFree (non-recached branch) call into the ROCm runtime, and at least one of them hangs there indefinitely during a `del model`. The CudaAsyncAllocator already exists as an alternative, but it's disabled on Windows (OpenNMT#1072 comment) and a quick check on the local ROCm 7.2 wheels confirms hipMallocAsync still misbehaves there (invalid vector subscript during the first generate() call), so that's not a usable fallback. Add a third allocator option, "simple" / "none", that is a stateless cudaMalloc / cudaFree wrapper. It has no cache, no per-block ready events, and no per-block associated streams, so it can't trip any of the state-tracking code paths in CachingDeviceAllocator that the upstream runtime bug is sensitive to. The trade-off is that every allocation becomes a fresh cudaMalloc — fine for typical inference workloads that allocate once and reuse, more costly for workloads that allocate often. Selected via `CT2_CUDA_ALLOCATOR=simple` (or `none`). Default behaviour is unchanged. Verified locally on RX 7900 XTX (gfx1100, ROCm 7.2.0, Windows 11): - Whisper-medium load + inference + `del model` succeeds in both allocator modes (deadlock not reproducible on 7.2.0; reporter sees it on 7.2.1). - Existing flash-attention pytest suite (15 tests) passes under `CT2_CUDA_ALLOCATOR=simple` with identical results.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Addresses #2038 (
del modeldeadlocks on ROCm 7.2.1 + Windows + gfx1100 inside the hipcub CachingDeviceAllocator's free path).Adds a third option for `CT2_CUDA_ALLOCATOR`:
The default behaviour is unchanged. The `simple` allocator is an opt-in workaround for users hitting the deadlock until the ROCm runtime bug is fixed upstream.
Why not just hardcode the workaround on Windows-HIP?
I checked whether `hipMallocAsync` would work as a drop-in (it's currently disabled on Windows via `CT2_USE_ASYNC_ALLOC = !_WIN32` per #1072). On ROCm 7.2.0 + RX 7900 XTX + Windows 11, the first `generate()` call still crashes with `IndexError: invalid vector subscript` — so `hipMallocAsync` isn't usable as a Windows default yet. Reverted the experiment, kept the existing guard.
The hipcub deadlock itself isn't reproducible on ROCm 7.2.0 in my local setup (the reporter sees it on 7.2.1), so I can't fix the bug at its root in this repo. The opt-in `simple` allocator just routes around every code path the upstream bug is sensitive to: no cached blocks, no `hipEventRecord`, no per-block streams.
Trade-off
Test plan
cc @sssshhhhhh @jordimas
Closes #2038 once confirmed by the reporter.