Skip to content

Add mx-smi backend support for MetaX GPUs#210

Draft
mhson-kyle wants to merge 4 commits into
XuehaiPan:mainfrom
mhson-kyle:main
Draft

Add mx-smi backend support for MetaX GPUs#210
mhson-kyle wants to merge 4 commits into
XuehaiPan:mainfrom
mhson-kyle:main

Conversation

@mhson-kyle

Copy link
Copy Markdown

Issue Type
Improvement/feature implementation
Runtime Environment
Operating system and version: AlmaLinux 9.7
Terminal emulator and version: screen / remote shell
Python version: 3.9.25
NVML version (driver version): N/A for MetaX; mx-smi KMD driver 2.16.0, MACA runtime 3.0.0.8
nvitop version or commit: 1.6.3.dev11+ga306d69 / a306d69
python-ml-py version: nvidia-ml-py 13.595.45
Locale: en_US.UTF-8
Description
This adds support for MetaX GPUs through mx-smi, allowing nvitop to run on systems where NVIDIA NVML is unavailable but MetaX devices are present.

The change introduces an mx-smi backend that parses MetaX GPU inventory, utilization, memory, temperature, power, driver/runtime versions, and process information. The existing Device API now falls back to mx-smi when NVML is
unavailable, and the backend can also be forced with:

NVITOP_GPU_BACKEND=mx-smi

The TUI header was also updated to show MetaX-specific version labels, using KMD and MACA versions instead of NVIDIA driver/CUDA labels when the active backend is mx-smi.

Motivation and Context

nvitop currently assumes NVIDIA/NVML availability. On MetaX GPU servers, nvidia-smi/NVML is not available, while GPU information is exposed through mx-smi.

This allows users on MetaX systems to use the same nvitop interface for monitoring GPU status and GPU processes.

Testing

Tested on a MetaX C500 server with 8 GPUs and /usr/bin/mx-smi available.

Checks run:

/usr/bin/python3.9 -m py_compile nvitop/api/libmxsmi.py nvitop/api/device.py nvitop/api/init.py nvitop/init.py nvitop/tui/screens/main/panels/device.py

API smoke test verified:

  • backend detection returns mx-smi
  • device count returns 8
  • driver version returns 2.16.0
  • MACA runtime version returns 3.0.0.8
  • device memory snapshot works

Also tested:

CUDA_VISIBLE_DEVICES=1,0

to verify MetaX device filtering/order handling.

TUI smoke test:

nvitop --once

confirmed all 8 MetaX C500 devices render correctly.

Exporter smoke test also passed after installing nvitop-exporter.

Copilot AI review requested due to automatic review settings April 29, 2026 13:32

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new mx-smi-based GPU query backend so nvitop can run on MetaX GPU systems where NVIDIA NVML is unavailable, and updates the TUI header to display MetaX-appropriate version labels.

Changes:

  • Introduce nvitop/api/libmxsmi.py to query/parse MetaX GPU + process info via mx-smi with caching.
  • Update nvitop/api/device.py to select/fallback between NVML and mx-smi and expose the active backend.
  • Update the TUI device panel header to show KMD Version / MACA Version when the mx-smi backend is active.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
nvitop/tui/screens/main/panels/device.py Updates displayed version labels based on active backend (nvml vs mx-smi).
nvitop/api/libmxsmi.py New backend implementation that runs mx-smi and parses device/process snapshots with caching.
nvitop/api/device.py Backend detection, fallback logic, and MetaX-backed implementations for key device properties/process enumeration.
nvitop/api/init.py Exposes libmxsmi from the API package.
nvitop/init.py Exposes libmxsmi at the top-level nvitop namespace.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread nvitop/api/device.py

@classmethod
def is_available(cls) -> bool:
"""Test whether there are any devices and the NVML library is successfully loaded."""

Copilot AI Apr 29, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The is_available() docstring is now misleading: it says availability requires NVML to be loaded, but the implementation can return availability based on the mx-smi backend as well (and catches libmxsmi.MxSmiError). Please update the docstring to describe backend-agnostic availability, or rename/split the check if it is intended to remain NVML-specific.

Suggested change
"""Test whether there are any devices and the NVML library is successfully loaded."""
"""Test whether any GPU devices are available via the active backend."""

Copilot uses AI. Check for mistakes.
Comment thread nvitop/api/device.py
Comment on lines 413 to +417
driver without reloading the kernel module.
"""
return libnvml.nvmlQuery('nvmlSystemGetDriverVersion')
if _should_use_mxsmi_backend():
return libmxsmi.driver_version()
try:

Copilot AI Apr 29, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Device.driver_version() can now return the MetaX KMD version when the active backend is mx-smi, but the docstring and command-line equivalent are NVIDIA/NVML-specific. Please update the documentation to clarify that the returned value (and its meaning) is backend-dependent.

Copilot uses AI. Check for mistakes.
Comment thread nvitop/api/device.py
Comment on lines 443 to +447
If RM detects a driver/library version mismatch, usually after an upgrade for NVIDIA
driver without reloading the kernel module.
"""
cuda_driver_version = libnvml.nvmlQuery('nvmlSystemGetCudaDriverVersion')
if _should_use_mxsmi_backend():
return libmxsmi.maca_version()

Copilot AI Apr 29, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Device.cuda_driver_version() returns the MetaX MACA runtime version when using the mx-smi backend, but the docstring currently describes the NVIDIA CUDA driver version semantics. Please update the docstring to reflect backend-dependent meaning and naming (e.g., MACA vs CUDA driver).

Copilot uses AI. Check for mistakes.
Comment thread nvitop/api/libmxsmi.py
Comment on lines +290 to +294
memory_utilization=device.memory_utilization,
temperature=device.temperature,
power_usage=device.power_usage,
power_limit=device.power_limit,
fan_speed=device.fan_speed,

Copilot AI Apr 29, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the mx-smi snapshot merge, memory_utilization and fan_speed are copied from the per-sample DeviceInfo, but _parse_summary_output() never populates these fields. As a result, Device.utilization_rates().memory and Device.fan_speed() will always be NA for MetaX devices. Either extend the parser to extract these metrics from mx-smi output (if available) or drop these fields to avoid advertising unsupported data.

Suggested change
memory_utilization=device.memory_utilization,
temperature=device.temperature,
power_usage=device.power_usage,
power_limit=device.power_limit,
fan_speed=device.fan_speed,
temperature=device.temperature,
power_usage=device.power_usage,
power_limit=device.power_limit,

Copilot uses AI. Check for mistakes.
Comment thread nvitop/api/device.py
Comment on lines +266 to +270
suppress_logs = libmxsmi.is_forced() or shutil.which('mx-smi') is not None
logger_disabled = libnvml.LOGGER.disabled
if suppress_logs:
libnvml.LOGGER.disabled = True
try:

Copilot AI Apr 29, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_nvml_probe() temporarily toggles libnvml.LOGGER.disabled without any synchronization. If multiple threads enter this context concurrently, the last exit can restore the logger to the wrong prior state (e.g., leave it permanently disabled). Consider protecting this with a lock and/or a reference counter so nested/concurrent probes restore the original state correctly.

Copilot uses AI. Check for mistakes.
@XuehaiPan

Copy link
Copy Markdown
Owner

Hi @mhson-kyle, thanks for the work — the testing notes (real C500 ×8 hardware, CUDA_VISIBLE_DEVICES filtering, exporter smoke test) are very welcome. A few concerns before this can move forward.

1. CLI parsing is not an acceptable backend shape

libmxsmi.py parses the text output of /usr/bin/mx-smi via subprocess. We will not accept backends that parse the text output of vendor CLI tools — that approach is:

  • Buggy — CLI output formats are unstable across vendor releases, drivers, and locales, and silent format changes become silent monitor breakage.
  • Slow — spawning a subprocess and parsing text on every refresh defeats the point of a real-time monitor.
  • Unsafe — many users run nvitop under sudo to see all GPU processes; spawning arbitrary vendor binaries from that context is a privilege/attack-surface concern.

The only acceptable shape for a vendor backend is direct loading of the vendor's shared library via ctypes.CDLL, the same way pynvml wraps libnvidia-ml.so. MetaX's SDK ships libmxml.so (their NVML equivalent), so the right path here is to bind that directly. (Same policy I just laid out in #211 for AMD.)

2. The Python wrapper must be pip-installable from PyPI

Even with a CDLL binding, nvitop needs to be able to declare the MetaX wrapper as an optional dependency in pyproject.toml — the same way nvidia-ml-py is declared today for the NVIDIA backend. I checked PyPI and couldn't find a MetaX wrapper under the obvious names (mx-smi, mxsmi, metax-smi, metax-ml-py, mxml, libmxml). If MetaX hasn't published one yet, the upstream ask would be for them to do so — ideally an officially maintained wrapper from MetaX, with a documented support matrix between the wrapper version and the MetaX driver / MACA runtime versions it targets.

We won't take a hard-coded in-tree ctypes.CDLL("libmxml.so") binding either: it would couple nvitop to one specific MetaX ABI version with no upgrade path. The wrapper has to live in its own PyPI-distributed package owned by MetaX (or an officially blessed maintainer).

3. Modifications to nvitop/api/device.py are gated on the #211 outcome

The +228 / -21 delta in device.py adds backend dispatch into the most central file in the repo. The same architectural question applies here as in #211 — i.e. whether non-NVIDIA backends should hook into Device.* directly or live behind a separate API surface (or, ideally, a proper nvitop.api.backends abstraction). I'd rather settle that once across all the in-flight non-NVIDIA proposals (#198, #210, #211) than answer it three different ways. Once a CDLL-based MetaX binding exists, we can re-open this discussion against whatever abstraction we land on.


To summarize: this is upstream-blocked on an officially maintained MetaX Python wrapper published to PyPI and design-blocked on the backend-abstraction question in #211. I'm not closing this — it's useful as a concrete reference for what a MetaX backend would need to cover — but it shouldn't merge in its current shape. Happy to revisit once those pieces are in place.

@XuehaiPan XuehaiPan marked this pull request as draft May 6, 2026 08:24
@XuehaiPan XuehaiPan marked this pull request as draft May 6, 2026 08:24
@wnark

wnark commented May 13, 2026

Copy link
Copy Markdown

@mhson-kyle metax has a reference implementation based on pynvml: https://github.com/MetaX-MACA/pymxsml. You can use that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants