OpenNMT · T0nd3 · May 11, 2026 · May 12, 2026
diff --git a/docs/building_rocm_windows.md b/docs/building_rocm_windows.md
@@ -0,0 +1,286 @@
+# Building with ROCm on Windows
+
+This guide describes how to build CTranslate2 with AMD GPU support (ROCm/HIP) on Windows. It was validated on the following system:
+
+| Component | Version |
+| --- | --- |
+| OS | Windows 11 (Build 26200) |
+| GPU | AMD Radeon RX 7900 XTX (gfx1100, RDNA 3) |
+| ROCm | 7.2.0 |
+| Python | 3.11 |
+
+## Supported GPUs
+
+ROCm on Windows supports AMD RDNA 2 and RDNA 3 GPUs (RX 6000 and RX 7000 series). The HIP architecture target for each GPU can be found in the [ROCm GPU compatibility matrix](https://rocm.docs.amd.com/en/latest/compatibility/compatibility-matrix.html).
+
+Common targets:
+
+| GPU | HIP architecture |
+| --- | --- |
+| RX 6700, 6800, 6900 | `gfx1030` |
+| RX 7600, 7700 | `gfx1102` |
+| RX 7800, 7900 | `gfx1101` |
+| RX 7900 XTX, 7900 XT | `gfx1100` |
+
+## Prerequisites
+
+### 1. Visual Studio Build Tools 2022
+
+The MSVC compiler, CMake, and Ninja are all provided by the Visual Studio Build Tools workload. No separate CMake installation is needed.
+
+Download the bootstrapper from Microsoft and run a silent installation:
+
+```powershell
+Invoke-WebRequest -Uri "https://aka.ms/vs/17/release/vs_BuildTools.exe" -OutFile vs_BuildTools.exe
+
+.\vs_BuildTools.exe --quiet --wait --norestart `
+    --add Microsoft.VisualStudio.Workload.VCTools `
+    --add Microsoft.VisualStudio.Component.VC.CMake.Project `
+    --add Microsoft.VisualStudio.Component.Windows11SDK.22621 `
+    --includeRecommended
+```
+
+```{note}
+Pass all arguments as a single string. Passing them as a PowerShell array (`-ArgumentList @(...)`) causes the installer to exit with error code 87 (invalid parameter).
+```
+
+Verify the installation:
+
+```powershell
+& "C:\Program Files (x86)\Microsoft Visual Studio\Installer\vswhere.exe" -products * -format json
+```
+
+### 2. ROCm via Python wheels
+
+AMD distributes ROCm for Windows as Python wheels. This is the method used by the official CTranslate2 CI and does not require the AMD HIP SDK installer.
+
+```powershell
+pip install --no-cache-dir `
+    https://repo.radeon.com/rocm/windows/rocm-rel-7.2/rocm_sdk_core-7.2.0.dev0-py3-none-win_amd64.whl `
+    https://repo.radeon.com/rocm/windows/rocm-rel-7.2/rocm_sdk_devel-7.2.0.dev0-py3-none-win_amd64.whl `
+    https://repo.radeon.com/rocm/windows/rocm-rel-7.2/rocm_sdk_libraries_custom-7.2.0.dev0-py3-none-win_amd64.whl `
+    https://repo.radeon.com/rocm/windows/rocm-rel-7.2/rocm-7.2.0.dev0.tar.gz
+
+rocm-sdk init
+```
+
+The `rocm-sdk init` command extracts the development headers and compiler to a local directory. Retrieve the path for later use:
+
+```powershell
+$env:ROCM_PATH = python -c "from rocm_sdk._devel import get_devel_root; print(get_devel_root())"
+```
+
+This installs AMD Clang (the HIP compiler), HIP headers, `hipblas.dll`, `amdhip64_7.dll`, and the AMDGCN bitcode libraries needed for GPU kernel compilation.
+
+#### Why not the official AMD HIP SDK Installer?
+
+AMD also ships a standalone HIP SDK for Windows as an `.exe` installer (separate from the pip wheels). Don't use it for CTranslate2 — at the time of writing, the installer lags the pip wheels by one release and the runtime libraries are not ABI-compatible. Concretely:
+
+| Distribution | Latest version | Runtime DLL name |
+| --- | --- | --- |
+| Python wheels (`rocm-sdk-*`) | 7.2.0 | `hipblas.dll` |
+| AMD HIP SDK Installer (`.exe`) | 7.1.1 | `libhipblas.dll` |
+
+A CTranslate2 wheel built against the 7.2 Python wheels is linked against `hipblas.dll`. Installing the 7.1.1 SDK provides `libhipblas.dll` instead, and the dynamic loader fails with an unhelpful "module not found" error at `import ctranslate2`. See [issue #2016](https://github.com/OpenNMT/CTranslate2/issues/2016) for the full discussion.
+
+**Recommendation:** use the Python wheels shown above. They are self-contained, used by the official CI, and do not require any system-wide installation.
+
+If you can only use the AMD HIP SDK Installer (for example, because you are bundling CTranslate2 inside a PyInstaller `.exe` and want a single Windows-installer experience for end users), you have two options until AMD ships a 7.2 installer:
+
+1. Build a CTranslate2 wheel against ROCm 7.1.1 yourself. This requires using the matching 7.1 SDK wheels (or the 7.1 installer) when running the build, *not* the 7.2 wheels above.
+2. Wait for the upstream AMD HIP SDK Installer to be updated to 7.2.
+
+### 3. Intel oneAPI MKL and oneDNN
+
+CTranslate2 uses oneDNN for its CPU backend on Windows. oneDNN in turn requires Intel MKL for optimal performance.
+
+**Install Intel MKL (devel component only):**
+
+Download the offline installer (≈ 2.5 GB) and install only the `mkl.devel` component:
+
+```bat
+:: Extract the installer
+intel-oneapi-base-toolkit-2025.3.0.372_offline.exe -s -x -f oneapi_extracted
+
+:: Install only the MKL development files
+oneapi_extracted\bootstrapper.exe -s --action install ^
+    --components=intel.oneapi.win.mkl.devel ^
+    --eula=accept ^
+    -p=NEED_VS2017_INTEGRATION=0 ^
+    -p=NEED_VS2019_INTEGRATION=0
+```
+
+**Build oneDNN 3.10.2 from source:**
+
+```bat
+curl -L -O https://github.com/uxlfoundation/oneDNN/archive/refs/tags/v3.10.2.tar.gz
+```
+
+```python
+# Use Python to extract — tar.exe -C is unreliable on Windows
+import tarfile
+tarfile.open("v3.10.2.tar.gz", "r:gz").extractall(".")
+```
+
+Run the following in a **VS Developer Command Prompt** (to make MSVC available):
+
+```bat
+cd oneDNN-3.10.2
+cmake -DCMAKE_BUILD_TYPE=Release ^
+      -DONEDNN_LIBRARY_TYPE=STATIC ^
+      -DONEDNN_BUILD_EXAMPLES=OFF ^
+      -DONEDNN_BUILD_TESTS=OFF ^
+      -DONEDNN_ENABLE_WORKLOAD=INFERENCE ^
+      "-DONEDNN_ENABLE_PRIMITIVE=CONVOLUTION;REORDER" ^
+      -DONEDNN_BUILD_GRAPH=OFF .
+cmake --build . --config Release --target install --parallel
+```
+
+```{note}
+The install step writes to `C:\Program Files (x86)\oneDNN` and requires administrator privileges. Run the install step from an elevated prompt or via `Start-Process -Verb RunAs`.
+```
+
+## Clone the repository
+
+```bash
+git clone --recursive https://github.com/OpenNMT/CTranslate2.git
+cd CTranslate2
+```
+
+If you already cloned without `--recursive`, initialize the submodules explicitly:
+
+```bash
+git submodule update --init --recursive
+```
+
+The required submodules are `spdlog`, `cpu_features`, `cutlass`, `googletest`, `ruy`, `thrust`, and `cxxopts`. CMake will fail with `add_subdirectory: source directory does not exist` if any are missing.
+
+## Build the C++ library
+
+Create a build script (e.g. `build_rocm.bat`) with all required environment variables:
+
+```bat
+@echo off
+
+:: --- ROCm environment ---
+set ROCM_PATH=<output of: python -c "from rocm_sdk._devel import get_devel_root; print(get_devel_root())">
+set HIP_PLATFORM=amd
+set HIP_PATH=%ROCM_PATH%
+set HIP_DEVICE_LIB_PATH=%ROCM_PATH%/lib/llvm/amdgcn/bitcode
+set HIP_CLANG_ROOT=%ROCM_PATH%/lib/llvm
+
+:: Windows SDK resource compiler (required when using Clang as the C/C++ compiler)
+set PATH=C:\Program Files (x86)\Windows Kits\10\bin\10.0.26100.0\x64;%PATH%
+
+:: --- Paths (use forward slashes for CMake) ---
+set CMAKE_EXE=C:/Program Files (x86)/Microsoft Visual Studio/2022/BuildTools/Common7/IDE/CommonExtensions/Microsoft/CMake/CMake/bin/cmake.exe
+set NINJA_EXE=C:/Program Files (x86)/Microsoft Visual Studio/2022/BuildTools/Common7/IDE/CommonExtensions/Microsoft/CMake/Ninja/ninja.exe
+set RC_EXE=C:/Program Files (x86)/Windows Kits/10/bin/10.0.26100.0/x64/rc.exe
+set INSTALL_PREFIX=C:/path/to/ctranslate2-install
+
+"%CMAKE_EXE%" -GNinja ^
+    -DCMAKE_BUILD_TYPE=Release ^
+    -S . -B build ^
+    -DCMAKE_MAKE_PROGRAM="%NINJA_EXE%" ^
+    -DCMAKE_C_COMPILER="%ROCM_PATH%/lib/llvm/bin/clang.exe" ^
+    -DCMAKE_CXX_COMPILER="%ROCM_PATH%/lib/llvm/bin/clang++.exe" ^
+    -DCMAKE_RC_COMPILER="%RC_EXE%" ^
+    "-DCMAKE_CXX_FLAGS=-Wno-deprecated-literal-operator" ^
+    "-DCMAKE_HIP_FLAGS=-Wno-deprecated-literal-operator" ^
+    -DCMAKE_INSTALL_PREFIX="%INSTALL_PREFIX%" ^
+    "-DCMAKE_PREFIX_PATH=C:/Program Files (x86)/Intel/oneAPI/compiler/latest/lib;C:/Program Files (x86)/oneDNN" ^
+    -DBUILD_CLI=OFF ^
+    -DWITH_DNNL=ON ^
+    -DWITH_HIP=ON ^
+    "-DCMAKE_HIP_ARCHITECTURES=gfx1100"
+
+"%CMAKE_EXE%" --build build --config Release --parallel
+"%CMAKE_EXE%" --install build --config Release
+```
+
+Replace `gfx1100` with the architecture of your GPU (see the table above). To target multiple GPUs, separate the values with semicolons: `"gfx1100;gfx1101"`.
+
+```{important}
+**Use forward slashes in all CMake paths.** Backslashes are interpreted as escape sequences inside CMake cache strings. A path like `C:\Program Files` becomes invalid (`\P` is not a recognized escape). This applies to all `-D` arguments passed to CMake.
+```
+
+```{note}
+**CMake cannot locate Ninja or `rc.exe` automatically** when Clang is the compiler. Both must be specified explicitly:
+
+- `-DCMAKE_MAKE_PROGRAM` — full path to `ninja.exe` (bundled with VS Build Tools)
+- `-DCMAKE_RC_COMPILER` — full path to `rc.exe` from the Windows SDK
+
+Without `rc.exe`, CMake aborts with: `No CMAKE_RC_COMPILER could be found`.
+```
+
+```{note}
+**CMake cache must be cleared between configuration attempts.** If you change a compiler path or fix a path format error, delete the entire `build/` directory before re-running CMake. Stale cache entries (especially compiler paths) are not overwritten by new `-D` arguments.
+```
+
+A successful configuration ends with:
+
+```
+-- HIP Compiler: .../clang++.exe
+-- CMAKE_HIP_ARCHITECTURES: gfx1100
+-- Configuring done
+-- Generating done
+-- Build files have been written to: .../build
+```
+
+## Build the Python module
+
+Copy the required DLLs into the Python package directory before building:
+
+```powershell
+$install = "C:/path/to/ctranslate2-install"
+Copy-Item "$install/bin/ctranslate2.dll" python/ctranslate2/
+Copy-Item "C:/Program Files (x86)/Intel/oneAPI/2025.3/bin/libiomp5md.dll" python/ctranslate2/
+```
+
+Then install the Python package from a **VS Developer Command Prompt**:
+
+```bat
+set CTRANSLATE2_ROOT=C:/path/to/ctranslate2-install
+set CMAKE_BUILD_PARALLEL_LEVEL=8
+cd python
+pip install "pybind11==2.11.1" setuptools wheel
+pip install . --no-build-isolation
+```
+
+## Verify the installation
+
+When using the ROCm build, the AMD runtime DLLs (`amdhip64_7.dll`, `hipblas.dll`) are not on the system PATH by default. Add the ROCm binary directory before importing the module:
+
+```python
+import os
+from rocm_sdk._devel import get_devel_root
+
+os.add_dll_directory(os.path.join(str(get_devel_root()), "bin"))
+
+import ctranslate2
+
+print(ctranslate2.__version__)
+print(ctranslate2.get_supported_compute_types("cuda"))  # cuda = HIP device
+print(ctranslate2._ext.get_cuda_device_count())
+```
+
+Expected output (RX 7900 XTX):
+
+```
+4.7.1
+{'float32', 'float16', 'bfloat16', 'int8', 'int8_float16', 'int8_bfloat16', 'int8_float32'}
+1
+```
+
+```{note}
+CTranslate2 uses `Device::CUDA` as a unified enum for both CUDA and HIP backends. When using the Python API, specify `device="cuda"` to run inference on an AMD GPU.
+```
+
+## Known limitations
+
+The following features are currently not available when building with `-DWITH_HIP=ON`:
+
+- **Flash Attention** (`-DWITH_FLASH_ATTN=ON`) — mutually exclusive with `WITH_HIP`
+- **Tensor parallelism** (`-DWITH_TENSOR_PARALLEL=ON`) — mutually exclusive with `WITH_HIP`
+- **AWQ quantization** — GPU kernels are not yet ported to HIP; AWQ models fall back to CPU execution
+- **Asynchronous memory allocator** — disabled on Windows/HIP builds due to instability; synchronous allocation is used instead
diff --git a/docs/hardware_support.md b/docs/hardware_support.md
@@ -17,6 +17,18 @@ See the [environment variables](environment_variables.md) `CT2_USE_MKL` and `CT2
 
 ## GPU
 
+### NVIDIA
+
 * NVIDIA GPUs with a Compute Capability greater or equal to 3.5
 
 The driver requirement depends on the CUDA version. See the [CUDA Compatibility guide](https://docs.nvidia.com/deploy/cuda-compatibility/index.html) for more information.
+
+### AMD (ROCm)
+
+* AMD RDNA 2 and RDNA 3 GPUs (RX 6000 and RX 7000 series)
+
+Prebuilt Python wheels for ROCm are available on the [releases page](https://github.com/OpenNMT/CTranslate2/releases/). To build from source on Windows, see the {doc}`building_rocm_windows` guide.
+
+```{note}
+The ROCm backend is exposed through the same `device="cuda"` API as NVIDIA GPUs. CTranslate2 uses a unified device abstraction for both backends.
+```
diff --git a/docs/index.rst b/docs/index.rst
@@ -11,6 +11,7 @@ The documentation includes installation instructions, usage guides, and API refe
 
    quickstart.md
    installation.md
+   building_rocm_windows.md
 
 .. toctree::
    :caption: Tasks

diff --git a/docs/installation.md b/docs/installation.md
@@ -18,6 +18,8 @@ The Python wheels have the following requirements:
 The Linux and Windows Python wheels support GPU execution. Install [CUDA](https://developer.nvidia.com/cuda-toolkit) 12.x to use the GPU.
 
 If you plan to run models with convolutional layers (e.g. for speech recognition), you should also install [cuDNN 8](https://developer.nvidia.com/cudnn) for CUDA 12.x.
+
+If you have an AMD ROCm GPU, prebuilt Python wheels are available on the [releases page](https://github.com/OpenNMT/CTranslate2/releases/). To build from source on Windows with ROCm, see the {doc}`building_rocm_windows` guide.
 ```
 
 ```{note}
@@ -124,7 +126,7 @@ Some build options require additional dependencies. See their respective documen
 * `-DWITH_DNNL=ON` requires [oneDNN](https://github.com/oneapi-src/oneDNN) >= 3.0
 * `-DWITH_ACCELERATE=ON` requires [Accelerate](https://developer.apple.com/documentation/accelerate)
 * `-DWITH_OPENBLAS=ON` requires [OpenBLAS](https://github.com/xianyi/OpenBLAS)
-* `-DWITH_HIP=ON` requires [ROCm libraries](https://rocm.docs.amd.com/en/latest/reference/api-libraries.html)
+* `-DWITH_HIP=ON` requires [ROCm libraries](https://rocm.docs.amd.com/en/latest/reference/api-libraries.html). For a complete Windows build guide, see {doc}`building_rocm_windows`.
 
 Multiple backends can be enabled for a single build, for example: