Add mx-smi backend support for MetaX GPUs#210
Conversation
Agent-Logs-Url: https://github.com/mhson-kyle/nvitop/sessions/9e63a25a-5033-4588-bfdd-3fb0d64c9d9f Co-authored-by: mhson-kyle <72399227+mhson-kyle@users.noreply.github.com>
Agent-Logs-Url: https://github.com/mhson-kyle/nvitop/sessions/e5fd1e19-5d52-4ab0-ac60-5b545ffb9632 Co-authored-by: mhson-kyle <72399227+mhson-kyle@users.noreply.github.com>
Add mx-smi backend support for MetaX GPUs
There was a problem hiding this comment.
Pull request overview
Adds a new mx-smi-based GPU query backend so nvitop can run on MetaX GPU systems where NVIDIA NVML is unavailable, and updates the TUI header to display MetaX-appropriate version labels.
Changes:
- Introduce
nvitop/api/libmxsmi.pyto query/parse MetaX GPU + process info viamx-smiwith caching. - Update
nvitop/api/device.pyto select/fallback between NVML andmx-smiand expose the active backend. - Update the TUI device panel header to show
KMD Version/MACA Versionwhen themx-smibackend is active.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| nvitop/tui/screens/main/panels/device.py | Updates displayed version labels based on active backend (nvml vs mx-smi). |
| nvitop/api/libmxsmi.py | New backend implementation that runs mx-smi and parses device/process snapshots with caching. |
| nvitop/api/device.py | Backend detection, fallback logic, and MetaX-backed implementations for key device properties/process enumeration. |
| nvitop/api/init.py | Exposes libmxsmi from the API package. |
| nvitop/init.py | Exposes libmxsmi at the top-level nvitop namespace. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| @classmethod | ||
| def is_available(cls) -> bool: | ||
| """Test whether there are any devices and the NVML library is successfully loaded.""" |
There was a problem hiding this comment.
The is_available() docstring is now misleading: it says availability requires NVML to be loaded, but the implementation can return availability based on the mx-smi backend as well (and catches libmxsmi.MxSmiError). Please update the docstring to describe backend-agnostic availability, or rename/split the check if it is intended to remain NVML-specific.
| """Test whether there are any devices and the NVML library is successfully loaded.""" | |
| """Test whether any GPU devices are available via the active backend.""" |
| driver without reloading the kernel module. | ||
| """ | ||
| return libnvml.nvmlQuery('nvmlSystemGetDriverVersion') | ||
| if _should_use_mxsmi_backend(): | ||
| return libmxsmi.driver_version() | ||
| try: |
There was a problem hiding this comment.
Device.driver_version() can now return the MetaX KMD version when the active backend is mx-smi, but the docstring and command-line equivalent are NVIDIA/NVML-specific. Please update the documentation to clarify that the returned value (and its meaning) is backend-dependent.
| If RM detects a driver/library version mismatch, usually after an upgrade for NVIDIA | ||
| driver without reloading the kernel module. | ||
| """ | ||
| cuda_driver_version = libnvml.nvmlQuery('nvmlSystemGetCudaDriverVersion') | ||
| if _should_use_mxsmi_backend(): | ||
| return libmxsmi.maca_version() |
There was a problem hiding this comment.
Device.cuda_driver_version() returns the MetaX MACA runtime version when using the mx-smi backend, but the docstring currently describes the NVIDIA CUDA driver version semantics. Please update the docstring to reflect backend-dependent meaning and naming (e.g., MACA vs CUDA driver).
| memory_utilization=device.memory_utilization, | ||
| temperature=device.temperature, | ||
| power_usage=device.power_usage, | ||
| power_limit=device.power_limit, | ||
| fan_speed=device.fan_speed, |
There was a problem hiding this comment.
In the mx-smi snapshot merge, memory_utilization and fan_speed are copied from the per-sample DeviceInfo, but _parse_summary_output() never populates these fields. As a result, Device.utilization_rates().memory and Device.fan_speed() will always be NA for MetaX devices. Either extend the parser to extract these metrics from mx-smi output (if available) or drop these fields to avoid advertising unsupported data.
| memory_utilization=device.memory_utilization, | |
| temperature=device.temperature, | |
| power_usage=device.power_usage, | |
| power_limit=device.power_limit, | |
| fan_speed=device.fan_speed, | |
| temperature=device.temperature, | |
| power_usage=device.power_usage, | |
| power_limit=device.power_limit, |
| suppress_logs = libmxsmi.is_forced() or shutil.which('mx-smi') is not None | ||
| logger_disabled = libnvml.LOGGER.disabled | ||
| if suppress_logs: | ||
| libnvml.LOGGER.disabled = True | ||
| try: |
There was a problem hiding this comment.
_nvml_probe() temporarily toggles libnvml.LOGGER.disabled without any synchronization. If multiple threads enter this context concurrently, the last exit can restore the logger to the wrong prior state (e.g., leave it permanently disabled). Consider protecting this with a lock and/or a reference counter so nested/concurrent probes restore the original state correctly.
|
Hi @mhson-kyle, thanks for the work — the testing notes (real C500 ×8 hardware, 1. CLI parsing is not an acceptable backend shape
The only acceptable shape for a vendor backend is direct loading of the vendor's shared library via 2. The Python wrapper must be
|
|
@mhson-kyle metax has a reference implementation based on pynvml: https://github.com/MetaX-MACA/pymxsml. You can use that. |
Issue Type
Improvement/feature implementation
Runtime Environment
Operating system and version: AlmaLinux 9.7
Terminal emulator and version: screen / remote shell
Python version: 3.9.25
NVML version (driver version): N/A for MetaX; mx-smi KMD driver 2.16.0, MACA runtime 3.0.0.8
nvitop version or commit: 1.6.3.dev11+ga306d69 / a306d69
python-ml-py version: nvidia-ml-py 13.595.45
Locale: en_US.UTF-8
Description
This adds support for MetaX GPUs through mx-smi, allowing nvitop to run on systems where NVIDIA NVML is unavailable but MetaX devices are present.
The change introduces an mx-smi backend that parses MetaX GPU inventory, utilization, memory, temperature, power, driver/runtime versions, and process information. The existing Device API now falls back to mx-smi when NVML is
unavailable, and the backend can also be forced with:
NVITOP_GPU_BACKEND=mx-smi
The TUI header was also updated to show MetaX-specific version labels, using KMD and MACA versions instead of NVIDIA driver/CUDA labels when the active backend is mx-smi.
Motivation and Context
nvitop currently assumes NVIDIA/NVML availability. On MetaX GPU servers, nvidia-smi/NVML is not available, while GPU information is exposed through mx-smi.
This allows users on MetaX systems to use the same nvitop interface for monitoring GPU status and GPU processes.
Testing
Tested on a MetaX C500 server with 8 GPUs and /usr/bin/mx-smi available.
Checks run:
/usr/bin/python3.9 -m py_compile nvitop/api/libmxsmi.py nvitop/api/device.py nvitop/api/init.py nvitop/init.py nvitop/tui/screens/main/panels/device.py
API smoke test verified:
Also tested:
CUDA_VISIBLE_DEVICES=1,0
to verify MetaX device filtering/order handling.
TUI smoke test:
nvitop --once
confirmed all 8 MetaX C500 devices render correctly.
Exporter smoke test also passed after installing nvitop-exporter.