support mthreads gpu monitoring by gingerXue · Pull Request #198 · XuehaiPan/nvitop

gingerXue · 2025-12-19T08:18:45Z

Issue Type

Improvement/feature implementation

Runtime Environment

Operating system and version: Ubuntu 22.04.4 LT
Terminal emulator and version: xterm-256color
Python version: 3.10.12
NVML version (driver version): N/A
MTML version: 2.2.0
nvitop version or commit: 1.6.2.dev4+g31792dd
mthreads-ml-py version: 2.2.0
Locale: C.UTF-8

Description

This PR adds Mthreads GPU (mtml) support to nvitop, enabling basic GPU monitoring on platforms where mtml is available. We developed a wrapper layer for mthreads-ml-py which it can use nvml methods to avoid too many changes in this project.

The implementation is designed to be non-intrusive and fully backward compatible with existing NVML-based workflows.

Motivation and Context

nvitop currently relies on NVIDIA NVML, which makes it unusable on systems equipped with MTGPU devices.
In such environments, users lack a lightweight, top-like GPU monitoring tool.

This PR aims to:

Extend nvitop to support MTGPU-based platforms
Preserve existing behavior on NVIDIA GPUs
Minimize impact on the current code structure

Design & Implementation

Introduced a new backend based on mtml, parallel to the existing NVML backend
Runtime detection is used to select the appropriate backend:
- nvml → NVIDIA GPUs
- mtml → MTGPU devices
Implemented a compatibility layer to map MTGPU APIs to nvitop's internal data structures

Currently Supported Features (MTGPU)

Driver Version
GPU device enumeration
Total / used memory reporting
Basic utilization metrics
Power usage

Not Yet Supported

MIG-related features
Processes enumeration and utilization
Cuda driver version information
Persistence Mode
Bus-Id infomation
Advanced performance counters (not available in mtml)

Testing

Tested on:

MTGPU platform with mtml

Manual test cases include:

nvitop startup and refresh
MTGpu information
Memory usage display
Mixed error handling when NVML is not present

basic api test

from nvitop import Device

count = Device.count()
print(f'There are {count} MUSA devices')
devices = Device.all()

for device in devices:
    processes = device.processes()
    sorted_pids = sorted(processes)
    
    print(device)
    print(f'  - Fan speed:       {device.fan_speed()}%')
    print(f'  - Temperature:     {device.temperature()}C')
    print(f'  - GPU utilization: {device.gpu_utilization()}%')
    print(f'  - Total memory:    {device.memory_total_human()}')
    print(f'  - Used memory:     {device.memory_used_human()}')
    print(f'  - Free memory:     {device.memory_free_human()}')
    print(f'  - Processes ({len(processes)}): {sorted_pids}')
    for pid in sorted_pids:
        print(f'    - {processes[pid]}')
    print('-' * 120)

There are 8 MUSA devices
PhysicalDevice(index=0, name='MTT S5000', total_memory=80.00GiB)
  - Fan speed:       0%
  - Temperature:     52C
  - GPU utilization: 0%
  - Total memory:    80.00GiB
  - Used memory:     78.88GiB
  - Free memory:     1148MiB
  - Processes (0): []
------------------------------------------------------------------------------------------------------------------------
PhysicalDevice(index=1, name='MTT S5000', total_memory=80.00GiB)
  - Fan speed:       0%
  - Temperature:     67C
  - GPU utilization: 99%
  - Total memory:    80.00GiB
  - Used memory:     73.63GiB
  - Free memory:     6519MiB
  - Processes (0): []
------------------------------------------------------------------------------------------------------------------------
PhysicalDevice(index=2, name='MTT S5000', total_memory=80.00GiB)
  - Fan speed:       0%
  - Temperature:     67C
  - GPU utilization: 99%
  - Total memory:    80.00GiB
  - Used memory:     71.03GiB
  - Free memory:     9187MiB
  - Processes (0): []
------------------------------------------------------------------------------------------------------------------------
PhysicalDevice(index=3, name='MTT S5000', total_memory=80.00GiB)
  - Fan speed:       0%
  - Temperature:     59C
  - GPU utilization: 59%
  - Total memory:    80.00GiB
  - Used memory:     78.23GiB
  - Free memory:     1810MiB
  - Processes (0): []
------------------------------------------------------------------------------------------------------------------------
PhysicalDevice(index=4, name='MTT S5000', total_memory=80.00GiB)
  - Fan speed:       0%
  - Temperature:     77C
  - GPU utilization: 99%
  - Total memory:    80.00GiB
  - Used memory:     73.39GiB
  - Free memory:     6765MiB
  - Processes (0): []
------------------------------------------------------------------------------------------------------------------------
PhysicalDevice(index=5, name='MTT S5000', total_memory=80.00GiB)
  - Fan speed:       0%
  - Temperature:     69C
  - GPU utilization: 99%
  - Total memory:    80.00GiB
  - Used memory:     72.68GiB
  - Free memory:     7497MiB
  - Processes (0): []
------------------------------------------------------------------------------------------------------------------------
PhysicalDevice(index=6, name='MTT S5000', total_memory=80.00GiB)
  - Fan speed:       0%
  - Temperature:     78C
  - GPU utilization: 99%
  - Total memory:    80.00GiB
  - Used memory:     75.62GiB
  - Free memory:     4480MiB
  - Processes (0): []
------------------------------------------------------------------------------------------------------------------------
PhysicalDevice(index=7, name='MTT S5000', total_memory=80.00GiB)
  - Fan speed:       0%
  - Temperature:     63C
  - GPU utilization: 99%
  - Total memory:    80.00GiB
  - Used memory:     72.48GiB
  - Free memory:     7702MiB
  - Processes (0): []
------------------------------------------------------------------------------------------------------------------------

Future Work

Extend MTGPU metrics as mtml evolves
Add automated tests for backend selection
Improve feature parity where possible

Images / Videos

XuehaiPan · 2026-05-06T08:29:34Z

Hi @gingerXue, thanks for the work — and for picking the right structural approach. mthreads-ml-py is officially published by Moore Threads Corporation on PyPI, wraps libmtml.so via ctypes directly, and is declared as an optional dep — that clears the bars I just laid out in #211 and #210. The remaining concerns are different.

1. Support matrix and ABI compatibility (main blocker)

The hard question for any non-NVIDIA backend is: which combinations of wrapper version × driver version × runtime version do we promise to keep working? For NVIDIA we rely on nvidia-ml-py tracking NVML's documented stable ABI. For Moore Threads, that mapping isn't documented anywhere I can find — mthreads-ml-py's release history jumps 0.0.1 → 2.2.0 → ... → 2.2.11 with no SemVer line for the Python API, which strongly suggests the wrapper version mirrors the underlying MTML library (or MUSA) version, not the Python binding's own compatibility. There's no published mapping of mthreads-ml-py ↔ minimum MT driver ↔ MTML ABI ↔ MUSA runtime, and no documented MTML ABI compatibility window across MT driver major versions.

Before this can merge we need: a documented support matrix from Moore Threads (ideally surfaced in the mthreads-ml-py README), a stated MTML ABI compatibility window across driver majors, and a pinned/bounded mthreads-ml-py version range in pyproject.toml reflecting what nvitop will actually support.

2. `nvitop/api/libnvml.py` should stay NVML-only

The +33 / -9 delta inside libnvml.py adds runtime backend selection to the NVML wrapper module itself. That conflates two vendor backends in one file and locks future non-NVIDIA backends (#210 MetaX, AMD in #211) into the same dispatch shape. Backend dispatch belongs in a separate layer (the nvitop.api.backends abstraction discussed in #211), not inside libnvml.py — design-blocked on the same question, which I'd rather settle once.

3. Missing process enumeration is a backend-completeness blocker

You note the MTGPU backend currently does not support process enumeration, MIG, CUDA driver version, persistence mode, or Bus-Id. Process enumeration is a core nvitop feature — the entire bottom half of the TUI is process-centric, and Device.processes() is one of the most-used API entry points. Shipping an MTGPU backend without it would mislead users expecting feature parity. If MTML doesn't expose process info at all, that's an upstream gap to file with Moore Threads; if it does, please wire it up before merge.

To summarize: the structural choices here are right, but this is upstream-blocked on a documented mthreads-ml-py support matrix and MTML ABI compatibility statement, design-blocked on the backend-abstraction question in #211, and functionally blocked on process enumeration. Happy to revisit once those are in place.

support mthreads-ml-py

a1c3f09

gingerXue force-pushed the feat/mtgpu-support branch from 272c161 to a1c3f09 Compare December 19, 2025 09:08

gingerXue closed this Dec 22, 2025

gingerXue reopened this Dec 22, 2025

This was referenced May 6, 2026

[RFC] AMD GPU Support #211

Open

Add mx-smi backend support for MetaX GPUs #210

Draft

XuehaiPan self-assigned this May 6, 2026

XuehaiPan marked this pull request as draft May 6, 2026 08:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support mthreads gpu monitoring#198

support mthreads gpu monitoring#198
gingerXue wants to merge 1 commit into
XuehaiPan:mainfrom
gingerXue:feat/mtgpu-support

gingerXue commented Dec 19, 2025

Uh oh!

XuehaiPan commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

gingerXue commented Dec 19, 2025

Issue Type

Runtime Environment

Description

Motivation and Context

Design & Implementation

Currently Supported Features (MTGPU)

Not Yet Supported

Testing

basic api test

Future Work

Images / Videos

Uh oh!

XuehaiPan commented May 6, 2026

1. Support matrix and ABI compatibility (main blocker)

2. nvitop/api/libnvml.py should stay NVML-only

3. Missing process enumeration is a backend-completeness blocker

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

2. `nvitop/api/libnvml.py` should stay NVML-only