Skip to content

support mthreads gpu monitoring#198

Draft
gingerXue wants to merge 1 commit into
XuehaiPan:mainfrom
gingerXue:feat/mtgpu-support
Draft

support mthreads gpu monitoring#198
gingerXue wants to merge 1 commit into
XuehaiPan:mainfrom
gingerXue:feat/mtgpu-support

Conversation

@gingerXue

Copy link
Copy Markdown

Issue Type

  • Improvement/feature implementation

Runtime Environment

  • Operating system and version: Ubuntu 22.04.4 LT
  • Terminal emulator and version: xterm-256color
  • Python version: 3.10.12
  • NVML version (driver version): N/A
  • MTML version: 2.2.0
  • nvitop version or commit: 1.6.2.dev4+g31792dd
  • mthreads-ml-py version: 2.2.0
  • Locale: C.UTF-8

Description

This PR adds Mthreads GPU (mtml) support to nvitop, enabling basic GPU monitoring on platforms where mtml is available. We developed a wrapper layer for mthreads-ml-py which it can use nvml methods to avoid too many changes in this project.

The implementation is designed to be non-intrusive and fully backward compatible with existing NVML-based workflows.


Motivation and Context

nvitop currently relies on NVIDIA NVML, which makes it unusable on systems equipped with MTGPU devices.
In such environments, users lack a lightweight, top-like GPU monitoring tool.

This PR aims to:

  • Extend nvitop to support MTGPU-based platforms
  • Preserve existing behavior on NVIDIA GPUs
  • Minimize impact on the current code structure

Design & Implementation

  • Introduced a new backend based on mtml, parallel to the existing NVML backend
  • Runtime detection is used to select the appropriate backend:
    • nvml → NVIDIA GPUs
    • mtml → MTGPU devices
  • Implemented a compatibility layer to map MTGPU APIs to nvitop's internal data structures
Currently Supported Features (MTGPU)
  • Driver Version
  • GPU device enumeration
  • Total / used memory reporting
  • Basic utilization metrics
  • Power usage
Not Yet Supported
  • MIG-related features
  • Processes enumeration and utilization
  • Cuda driver version information
  • Persistence Mode
  • Bus-Id infomation
  • Advanced performance counters (not available in mtml)

Testing

Tested on:

  • MTGPU platform with mtml

Manual test cases include:

  • nvitop startup and refresh
  • MTGpu information
  • Memory usage display
  • Mixed error handling when NVML is not present
basic api test
from nvitop import Device

count = Device.count()
print(f'There are {count} MUSA devices')
devices = Device.all()

for device in devices:
    processes = device.processes()
    sorted_pids = sorted(processes)
    
    print(device)
    print(f'  - Fan speed:       {device.fan_speed()}%')
    print(f'  - Temperature:     {device.temperature()}C')
    print(f'  - GPU utilization: {device.gpu_utilization()}%')
    print(f'  - Total memory:    {device.memory_total_human()}')
    print(f'  - Used memory:     {device.memory_used_human()}')
    print(f'  - Free memory:     {device.memory_free_human()}')
    print(f'  - Processes ({len(processes)}): {sorted_pids}')
    for pid in sorted_pids:
        print(f'    - {processes[pid]}')
    print('-' * 120)
There are 8 MUSA devices
PhysicalDevice(index=0, name='MTT S5000', total_memory=80.00GiB)
  - Fan speed:       0%
  - Temperature:     52C
  - GPU utilization: 0%
  - Total memory:    80.00GiB
  - Used memory:     78.88GiB
  - Free memory:     1148MiB
  - Processes (0): []
------------------------------------------------------------------------------------------------------------------------
PhysicalDevice(index=1, name='MTT S5000', total_memory=80.00GiB)
  - Fan speed:       0%
  - Temperature:     67C
  - GPU utilization: 99%
  - Total memory:    80.00GiB
  - Used memory:     73.63GiB
  - Free memory:     6519MiB
  - Processes (0): []
------------------------------------------------------------------------------------------------------------------------
PhysicalDevice(index=2, name='MTT S5000', total_memory=80.00GiB)
  - Fan speed:       0%
  - Temperature:     67C
  - GPU utilization: 99%
  - Total memory:    80.00GiB
  - Used memory:     71.03GiB
  - Free memory:     9187MiB
  - Processes (0): []
------------------------------------------------------------------------------------------------------------------------
PhysicalDevice(index=3, name='MTT S5000', total_memory=80.00GiB)
  - Fan speed:       0%
  - Temperature:     59C
  - GPU utilization: 59%
  - Total memory:    80.00GiB
  - Used memory:     78.23GiB
  - Free memory:     1810MiB
  - Processes (0): []
------------------------------------------------------------------------------------------------------------------------
PhysicalDevice(index=4, name='MTT S5000', total_memory=80.00GiB)
  - Fan speed:       0%
  - Temperature:     77C
  - GPU utilization: 99%
  - Total memory:    80.00GiB
  - Used memory:     73.39GiB
  - Free memory:     6765MiB
  - Processes (0): []
------------------------------------------------------------------------------------------------------------------------
PhysicalDevice(index=5, name='MTT S5000', total_memory=80.00GiB)
  - Fan speed:       0%
  - Temperature:     69C
  - GPU utilization: 99%
  - Total memory:    80.00GiB
  - Used memory:     72.68GiB
  - Free memory:     7497MiB
  - Processes (0): []
------------------------------------------------------------------------------------------------------------------------
PhysicalDevice(index=6, name='MTT S5000', total_memory=80.00GiB)
  - Fan speed:       0%
  - Temperature:     78C
  - GPU utilization: 99%
  - Total memory:    80.00GiB
  - Used memory:     75.62GiB
  - Free memory:     4480MiB
  - Processes (0): []
------------------------------------------------------------------------------------------------------------------------
PhysicalDevice(index=7, name='MTT S5000', total_memory=80.00GiB)
  - Fan speed:       0%
  - Temperature:     63C
  - GPU utilization: 99%
  - Total memory:    80.00GiB
  - Used memory:     72.48GiB
  - Free memory:     7702MiB
  - Processes (0): []
------------------------------------------------------------------------------------------------------------------------

Future Work

  • Extend MTGPU metrics as mtml evolves
  • Add automated tests for backend selection
  • Improve feature parity where possible

Images / Videos

image

@XuehaiPan

Copy link
Copy Markdown
Owner

Hi @gingerXue, thanks for the work — and for picking the right structural approach. mthreads-ml-py is officially published by Moore Threads Corporation on PyPI, wraps libmtml.so via ctypes directly, and is declared as an optional dep — that clears the bars I just laid out in #211 and #210. The remaining concerns are different.

1. Support matrix and ABI compatibility (main blocker)

The hard question for any non-NVIDIA backend is: which combinations of wrapper version × driver version × runtime version do we promise to keep working? For NVIDIA we rely on nvidia-ml-py tracking NVML's documented stable ABI. For Moore Threads, that mapping isn't documented anywhere I can find — mthreads-ml-py's release history jumps 0.0.12.2.0 → ... → 2.2.11 with no SemVer line for the Python API, which strongly suggests the wrapper version mirrors the underlying MTML library (or MUSA) version, not the Python binding's own compatibility. There's no published mapping of mthreads-ml-py ↔ minimum MT driver ↔ MTML ABI ↔ MUSA runtime, and no documented MTML ABI compatibility window across MT driver major versions.

Before this can merge we need: a documented support matrix from Moore Threads (ideally surfaced in the mthreads-ml-py README), a stated MTML ABI compatibility window across driver majors, and a pinned/bounded mthreads-ml-py version range in pyproject.toml reflecting what nvitop will actually support.

2. nvitop/api/libnvml.py should stay NVML-only

The +33 / -9 delta inside libnvml.py adds runtime backend selection to the NVML wrapper module itself. That conflates two vendor backends in one file and locks future non-NVIDIA backends (#210 MetaX, AMD in #211) into the same dispatch shape. Backend dispatch belongs in a separate layer (the nvitop.api.backends abstraction discussed in #211), not inside libnvml.py — design-blocked on the same question, which I'd rather settle once.

3. Missing process enumeration is a backend-completeness blocker

You note the MTGPU backend currently does not support process enumeration, MIG, CUDA driver version, persistence mode, or Bus-Id. Process enumeration is a core nvitop feature — the entire bottom half of the TUI is process-centric, and Device.processes() is one of the most-used API entry points. Shipping an MTGPU backend without it would mislead users expecting feature parity. If MTML doesn't expose process info at all, that's an upstream gap to file with Moore Threads; if it does, please wire it up before merge.


To summarize: the structural choices here are right, but this is upstream-blocked on a documented mthreads-ml-py support matrix and MTML ABI compatibility statement, design-blocked on the backend-abstraction question in #211, and functionally blocked on process enumeration. Happy to revisit once those are in place.

@XuehaiPan XuehaiPan self-assigned this May 6, 2026
@XuehaiPan XuehaiPan marked this pull request as draft May 6, 2026 08:29
@XuehaiPan XuehaiPan marked this pull request as draft May 6, 2026 08:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants