[WIP] Make accelerate work end-to-end on AMD ROCm #4025
Draft
Abdennacer-Badaoui wants to merge 3 commits intohuggingface:mainfrom
Draft
[WIP] Make accelerate work end-to-end on AMD ROCm #4025Abdennacer-Badaoui wants to merge 3 commits intohuggingface:mainfrom
Abdennacer-Badaoui wants to merge 3 commits intohuggingface:mainfrom
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
f711c06 to
3e2b10e
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Validated accelerate on 8× MI300X (ROCm 7.1, PyTorch 2.8.0+rocm7.1.0) and patched the gaps. Changes are minimal and gated on ROCm where appropriate; CUDA paths are untouched.
ROCm detection helpers
src/accelerate/utils/imports.py, __init__.pyis_rocm_available()andis_amdsmi_available()along with their exports. These helpers are used across the changes below.NUMA affinity on ROCm
src/accelerate/utils/environment.pyoverride_numa_affinitynow branches on ROCm./sys/devices/system/node/node{N}/cpulist) to resolve a GPU's NUMA node.Notebook launcher under ROCm
src/accelerate/launchers.py, src/accelerate/test_utils/scripts/test_notebook.pytorch.cuda.is_available()initializes the HIP runtime in the parent process on ROCm, which breaks fork-based subprocesses.DeepSpeed bf16 silently producing NaNs
src/accelerate/utils/dataclasses.pycommunication_data_type="fp32"into the DeepSpeed config (logged at INFO level). User-defined values are respected (no override).FSDP2 + tied weights
src/accelerate/utils/fsdp_utils.pystate_dict()for tied weights.lm_head.weightvsdeduped embed_tokens.weight).fsdp2_load_full_state_dict:model._tied_weights_keyson all ranks (keeps broadcast aligned).strict=Falseonly when such skips occur.bitsandbytes (bnb) + tied weights
src/accelerate/utils/bnb.pyload_and_quantize_modeliterated overnamed_parameters()withremove_duplicate=True, skipping aliases of tied parameters.transformer.word_embeddings.weightprocessedlm_head.weightskipped →keep_in_fp32_modules=["lm_head"]ineffectiveremove_duplicate=Falseso all aliases are visited.test_dynamocross-device fixtests/test_utils.pytorch_deviceto match inputs.Notes
Some related PRs :