You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This PR adds Mthreads GPU (mtml) support to nvitop, enabling basic GPU monitoring on platforms where mtml is available. We developed a wrapper layer for mthreads-ml-py which it can use nvml methods to avoid too many changes in this project.
The implementation is designed to be non-intrusive and fully backward compatible with existing NVML-based workflows.
Motivation and Context
nvitop currently relies on NVIDIA NVML, which makes it unusable on systems equipped with MTGPU devices.
In such environments, users lack a lightweight, top-like GPU monitoring tool.
This PR aims to:
Extend nvitop to support MTGPU-based platforms
Preserve existing behavior on NVIDIA GPUs
Minimize impact on the current code structure
Design & Implementation
Introduced a new backend based on mtml, parallel to the existing NVML backend
Runtime detection is used to select the appropriate backend:
nvml → NVIDIA GPUs
mtml → MTGPU devices
Implemented a compatibility layer to map MTGPU APIs to nvitop's internal data structures
Currently Supported Features (MTGPU)
Driver Version
GPU device enumeration
Total / used memory reporting
Basic utilization metrics
Power usage
Not Yet Supported
MIG-related features
Processes enumeration and utilization
Cuda driver version information
Persistence Mode
Bus-Id infomation
Advanced performance counters (not available in mtml)
Testing
Tested on:
MTGPU platform with mtml
Manual test cases include:
nvitop startup and refresh
MTGpu information
Memory usage display
Mixed error handling when NVML is not present
basic api test
fromnvitopimportDevicecount=Device.count()
print(f'There are {count} MUSA devices')
devices=Device.all()
fordeviceindevices:
processes=device.processes()
sorted_pids=sorted(processes)
print(device)
print(f' - Fan speed: {device.fan_speed()}%')
print(f' - Temperature: {device.temperature()}C')
print(f' - GPU utilization: {device.gpu_utilization()}%')
print(f' - Total memory: {device.memory_total_human()}')
print(f' - Used memory: {device.memory_used_human()}')
print(f' - Free memory: {device.memory_free_human()}')
print(f' - Processes ({len(processes)}): {sorted_pids}')
forpidinsorted_pids:
print(f' - {processes[pid]}')
print('-'*120)
There are 8 MUSA devices
PhysicalDevice(index=0, name='MTT S5000', total_memory=80.00GiB)
- Fan speed: 0%
- Temperature: 52C
- GPU utilization: 0%
- Total memory: 80.00GiB
- Used memory: 78.88GiB
- Free memory: 1148MiB
- Processes (0): []
------------------------------------------------------------------------------------------------------------------------
PhysicalDevice(index=1, name='MTT S5000', total_memory=80.00GiB)
- Fan speed: 0%
- Temperature: 67C
- GPU utilization: 99%
- Total memory: 80.00GiB
- Used memory: 73.63GiB
- Free memory: 6519MiB
- Processes (0): []
------------------------------------------------------------------------------------------------------------------------
PhysicalDevice(index=2, name='MTT S5000', total_memory=80.00GiB)
- Fan speed: 0%
- Temperature: 67C
- GPU utilization: 99%
- Total memory: 80.00GiB
- Used memory: 71.03GiB
- Free memory: 9187MiB
- Processes (0): []
------------------------------------------------------------------------------------------------------------------------
PhysicalDevice(index=3, name='MTT S5000', total_memory=80.00GiB)
- Fan speed: 0%
- Temperature: 59C
- GPU utilization: 59%
- Total memory: 80.00GiB
- Used memory: 78.23GiB
- Free memory: 1810MiB
- Processes (0): []
------------------------------------------------------------------------------------------------------------------------
PhysicalDevice(index=4, name='MTT S5000', total_memory=80.00GiB)
- Fan speed: 0%
- Temperature: 77C
- GPU utilization: 99%
- Total memory: 80.00GiB
- Used memory: 73.39GiB
- Free memory: 6765MiB
- Processes (0): []
------------------------------------------------------------------------------------------------------------------------
PhysicalDevice(index=5, name='MTT S5000', total_memory=80.00GiB)
- Fan speed: 0%
- Temperature: 69C
- GPU utilization: 99%
- Total memory: 80.00GiB
- Used memory: 72.68GiB
- Free memory: 7497MiB
- Processes (0): []
------------------------------------------------------------------------------------------------------------------------
PhysicalDevice(index=6, name='MTT S5000', total_memory=80.00GiB)
- Fan speed: 0%
- Temperature: 78C
- GPU utilization: 99%
- Total memory: 80.00GiB
- Used memory: 75.62GiB
- Free memory: 4480MiB
- Processes (0): []
------------------------------------------------------------------------------------------------------------------------
PhysicalDevice(index=7, name='MTT S5000', total_memory=80.00GiB)
- Fan speed: 0%
- Temperature: 63C
- GPU utilization: 99%
- Total memory: 80.00GiB
- Used memory: 72.48GiB
- Free memory: 7702MiB
- Processes (0): []
------------------------------------------------------------------------------------------------------------------------
Hi @gingerXue, thanks for the work — and for picking the right structural approach. mthreads-ml-py is officially published by Moore Threads Corporation on PyPI, wraps libmtml.so via ctypes directly, and is declared as an optional dep — that clears the bars I just laid out in #211 and #210. The remaining concerns are different.
1. Support matrix and ABI compatibility (main blocker)
The hard question for any non-NVIDIA backend is: which combinations of wrapper version × driver version × runtime version do we promise to keep working? For NVIDIA we rely on nvidia-ml-py tracking NVML's documented stable ABI. For Moore Threads, that mapping isn't documented anywhere I can find — mthreads-ml-py's release history jumps 0.0.1 → 2.2.0 → ... → 2.2.11 with no SemVer line for the Python API, which strongly suggests the wrapper version mirrors the underlying MTML library (or MUSA) version, not the Python binding's own compatibility. There's no published mapping of mthreads-ml-py ↔ minimum MT driver ↔ MTML ABI ↔ MUSA runtime, and no documented MTML ABI compatibility window across MT driver major versions.
Before this can merge we need: a documented support matrix from Moore Threads (ideally surfaced in the mthreads-ml-py README), a stated MTML ABI compatibility window across driver majors, and a pinned/bounded mthreads-ml-py version range in pyproject.toml reflecting what nvitop will actually support.
2. nvitop/api/libnvml.py should stay NVML-only
The +33 / -9 delta inside libnvml.py adds runtime backend selection to the NVML wrapper module itself. That conflates two vendor backends in one file and locks future non-NVIDIA backends (#210 MetaX, AMD in #211) into the same dispatch shape. Backend dispatch belongs in a separate layer (the nvitop.api.backends abstraction discussed in #211), not inside libnvml.py — design-blocked on the same question, which I'd rather settle once.
3. Missing process enumeration is a backend-completeness blocker
You note the MTGPU backend currently does not support process enumeration, MIG, CUDA driver version, persistence mode, or Bus-Id. Process enumeration is a core nvitop feature — the entire bottom half of the TUI is process-centric, and Device.processes() is one of the most-used API entry points. Shipping an MTGPU backend without it would mislead users expecting feature parity. If MTML doesn't expose process info at all, that's an upstream gap to file with Moore Threads; if it does, please wire it up before merge.
To summarize: the structural choices here are right, but this is upstream-blocked on a documented mthreads-ml-py support matrix and MTML ABI compatibility statement, design-blocked on the backend-abstraction question in #211, and functionally blocked on process enumeration. Happy to revisit once those are in place.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Issue Type
Runtime Environment
Ubuntu 22.04.4 LTxterm-256color3.10.12N/A2.2.0nvitopversion or commit:1.6.2.dev4+g31792ddmthreads-ml-pyversion:2.2.0C.UTF-8Description
This PR adds Mthreads GPU (mtml) support to
nvitop, enabling basic GPU monitoring on platforms where mtml is available. We developed a wrapper layer for mthreads-ml-py which it can use nvml methods to avoid too many changes in this project.The implementation is designed to be non-intrusive and fully backward compatible with existing NVML-based workflows.
Motivation and Context
nvitopcurrently relies on NVIDIA NVML, which makes it unusable on systems equipped with MTGPU devices.In such environments, users lack a lightweight, top-like GPU monitoring tool.
This PR aims to:
nvitopto support MTGPU-based platformsDesign & Implementation
mtml, parallel to the existing NVML backendnvml→ NVIDIA GPUsmtml→ MTGPU devicesnvitop's internal data structuresCurrently Supported Features (MTGPU)
Not Yet Supported
Testing
Tested on:
mtmlManual test cases include:
nvitopstartup and refreshbasic api test
Future Work
Images / Videos