Add mx-smi backend support for MetaX GPUs by mhson-kyle · Pull Request #210 · XuehaiPan/nvitop

mhson-kyle · 2026-04-29T13:32:03Z

Issue Type
Improvement/feature implementation
Runtime Environment
Operating system and version: AlmaLinux 9.7
Terminal emulator and version: screen / remote shell
Python version: 3.9.25
NVML version (driver version): N/A for MetaX; mx-smi KMD driver 2.16.0, MACA runtime 3.0.0.8
nvitop version or commit: 1.6.3.dev11+ga306d69 / a306d69
python-ml-py version: nvidia-ml-py 13.595.45
Locale: en_US.UTF-8
Description
This adds support for MetaX GPUs through mx-smi, allowing nvitop to run on systems where NVIDIA NVML is unavailable but MetaX devices are present.

The change introduces an mx-smi backend that parses MetaX GPU inventory, utilization, memory, temperature, power, driver/runtime versions, and process information. The existing Device API now falls back to mx-smi when NVML is
unavailable, and the backend can also be forced with:

NVITOP_GPU_BACKEND=mx-smi

The TUI header was also updated to show MetaX-specific version labels, using KMD and MACA versions instead of NVIDIA driver/CUDA labels when the active backend is mx-smi.

Motivation and Context

nvitop currently assumes NVIDIA/NVML availability. On MetaX GPU servers, nvidia-smi/NVML is not available, while GPU information is exposed through mx-smi.

This allows users on MetaX systems to use the same nvitop interface for monitoring GPU status and GPU processes.

Testing

Tested on a MetaX C500 server with 8 GPUs and /usr/bin/mx-smi available.

Checks run:

/usr/bin/python3.9 -m py_compile nvitop/api/libmxsmi.py nvitop/api/device.py nvitop/api/init.py nvitop/init.py nvitop/tui/screens/main/panels/device.py

API smoke test verified:

backend detection returns mx-smi
device count returns 8
driver version returns 2.16.0
MACA runtime version returns 3.0.0.8
device memory snapshot works

Also tested:

CUDA_VISIBLE_DEVICES=1,0

to verify MetaX device filtering/order handling.

TUI smoke test:

nvitop --once

confirmed all 8 MetaX C500 devices render correctly.

Exporter smoke test also passed after installing nvitop-exporter.

Agent-Logs-Url: https://github.com/mhson-kyle/nvitop/sessions/9e63a25a-5033-4588-bfdd-3fb0d64c9d9f Co-authored-by: mhson-kyle <72399227+mhson-kyle@users.noreply.github.com>

Agent-Logs-Url: https://github.com/mhson-kyle/nvitop/sessions/e5fd1e19-5d52-4ab0-ac60-5b545ffb9632 Co-authored-by: mhson-kyle <72399227+mhson-kyle@users.noreply.github.com>

Add mx-smi backend support for MetaX GPUs

Copilot

Pull request overview

Adds a new mx-smi-based GPU query backend so nvitop can run on MetaX GPU systems where NVIDIA NVML is unavailable, and updates the TUI header to display MetaX-appropriate version labels.

Changes:

Introduce nvitop/api/libmxsmi.py to query/parse MetaX GPU + process info via mx-smi with caching.
Update nvitop/api/device.py to select/fallback between NVML and mx-smi and expose the active backend.
Update the TUI device panel header to show KMD Version / MACA Version when the mx-smi backend is active.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
nvitop/tui/screens/main/panels/device.py	Updates displayed version labels based on active backend (`nvml` vs `mx-smi`).
nvitop/api/libmxsmi.py	New backend implementation that runs `mx-smi` and parses device/process snapshots with caching.
nvitop/api/device.py	Backend detection, fallback logic, and MetaX-backed implementations for key device properties/process enumeration.
nvitop/api/init.py	Exposes `libmxsmi` from the API package.
nvitop/init.py	Exposes `libmxsmi` at the top-level `nvitop` namespace.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-29T13:38:53Z


    @classmethod
    def is_available(cls) -> bool:
        """Test whether there are any devices and the NVML library is successfully loaded."""


The is_available() docstring is now misleading: it says availability requires NVML to be loaded, but the implementation can return availability based on the mx-smi backend as well (and catches libmxsmi.MxSmiError). Please update the docstring to describe backend-agnostic availability, or rename/split the check if it is intended to remain NVML-specific.

Suggested change

"""Test whether there are any devices and the NVML library is successfully loaded."""

"""Test whether any GPU devices are available via the active backend."""

Copilot · 2026-04-29T13:38:53Z

                driver without reloading the kernel module.
        """
-        return libnvml.nvmlQuery('nvmlSystemGetDriverVersion')
+        if _should_use_mxsmi_backend():
+            return libmxsmi.driver_version()
+        try:


Device.driver_version() can now return the MetaX KMD version when the active backend is mx-smi, but the docstring and command-line equivalent are NVIDIA/NVML-specific. Please update the documentation to clarify that the returned value (and its meaning) is backend-dependent.

Copilot · 2026-04-29T13:38:54Z

                If RM detects a driver/library version mismatch, usually after an upgrade for NVIDIA
                driver without reloading the kernel module.
        """
-        cuda_driver_version = libnvml.nvmlQuery('nvmlSystemGetCudaDriverVersion')
+        if _should_use_mxsmi_backend():
+            return libmxsmi.maca_version()


Device.cuda_driver_version() returns the MetaX MACA runtime version when using the mx-smi backend, but the docstring currently describes the NVIDIA CUDA driver version semantics. Please update the docstring to reflect backend-dependent meaning and naming (e.g., MACA vs CUDA driver).

Copilot · 2026-04-29T13:38:54Z

+            memory_utilization=device.memory_utilization,
+            temperature=device.temperature,
+            power_usage=device.power_usage,
+            power_limit=device.power_limit,
+            fan_speed=device.fan_speed,


In the mx-smi snapshot merge, memory_utilization and fan_speed are copied from the per-sample DeviceInfo, but _parse_summary_output() never populates these fields. As a result, Device.utilization_rates().memory and Device.fan_speed() will always be NA for MetaX devices. Either extend the parser to extract these metrics from mx-smi output (if available) or drop these fields to avoid advertising unsupported data.

Suggested change

memory_utilization=device.memory_utilization,

temperature=device.temperature,

power_usage=device.power_usage,

power_limit=device.power_limit,

fan_speed=device.fan_speed,

temperature=device.temperature,

power_usage=device.power_usage,

power_limit=device.power_limit,

Copilot · 2026-04-29T13:38:54Z

+    suppress_logs = libmxsmi.is_forced() or shutil.which('mx-smi') is not None
+    logger_disabled = libnvml.LOGGER.disabled
+    if suppress_logs:
+        libnvml.LOGGER.disabled = True
+    try:


_nvml_probe() temporarily toggles libnvml.LOGGER.disabled without any synchronization. If multiple threads enter this context concurrently, the last exit can restore the logger to the wrong prior state (e.g., leave it permanently disabled). Consider protecting this with a lock and/or a reference counter so nested/concurrent probes restore the original state correctly.

XuehaiPan · 2026-05-06T08:23:45Z

Hi @mhson-kyle, thanks for the work — the testing notes (real C500 ×8 hardware, CUDA_VISIBLE_DEVICES filtering, exporter smoke test) are very welcome. A few concerns before this can move forward.

1. CLI parsing is not an acceptable backend shape

libmxsmi.py parses the text output of /usr/bin/mx-smi via subprocess. We will not accept backends that parse the text output of vendor CLI tools — that approach is:

Buggy — CLI output formats are unstable across vendor releases, drivers, and locales, and silent format changes become silent monitor breakage.
Slow — spawning a subprocess and parsing text on every refresh defeats the point of a real-time monitor.
Unsafe — many users run nvitop under sudo to see all GPU processes; spawning arbitrary vendor binaries from that context is a privilege/attack-surface concern.

The only acceptable shape for a vendor backend is direct loading of the vendor's shared library via ctypes.CDLL, the same way pynvml wraps libnvidia-ml.so. MetaX's SDK ships libmxml.so (their NVML equivalent), so the right path here is to bind that directly. (Same policy I just laid out in #211 for AMD.)

2. The Python wrapper must be `pip`-installable from PyPI

Even with a CDLL binding, nvitop needs to be able to declare the MetaX wrapper as an optional dependency in pyproject.toml — the same way nvidia-ml-py is declared today for the NVIDIA backend. I checked PyPI and couldn't find a MetaX wrapper under the obvious names (mx-smi, mxsmi, metax-smi, metax-ml-py, mxml, libmxml). If MetaX hasn't published one yet, the upstream ask would be for them to do so — ideally an officially maintained wrapper from MetaX, with a documented support matrix between the wrapper version and the MetaX driver / MACA runtime versions it targets.

We won't take a hard-coded in-tree ctypes.CDLL("libmxml.so") binding either: it would couple nvitop to one specific MetaX ABI version with no upgrade path. The wrapper has to live in its own PyPI-distributed package owned by MetaX (or an officially blessed maintainer).

3. Modifications to `nvitop/api/device.py` are gated on the #211 outcome

The +228 / -21 delta in device.py adds backend dispatch into the most central file in the repo. The same architectural question applies here as in #211 — i.e. whether non-NVIDIA backends should hook into Device.* directly or live behind a separate API surface (or, ideally, a proper nvitop.api.backends abstraction). I'd rather settle that once across all the in-flight non-NVIDIA proposals (#198, #210, #211) than answer it three different ways. Once a CDLL-based MetaX binding exists, we can re-open this discussion against whatever abstraction we land on.

To summarize: this is upstream-blocked on an officially maintained MetaX Python wrapper published to PyPI and design-blocked on the backend-abstraction question in #211. I'm not closing this — it's useful as a concrete reference for what a MetaX backend would need to cover — but it shouldn't merge in its current shape. Happy to revisit once those pieces are in place.

wnark · 2026-05-13T03:41:44Z

@mhson-kyle metax has a reference implementation based on pynvml: https://github.com/MetaX-MACA/pymxsml. You can use that.

mhson-kyle and others added 4 commits April 29, 2026 20:13

Add mx-smi MetaX GPU backend

a306d69

libmxsmi: cache mx-smi -L inventory separately with 60s TTL

dd9aeb7

Agent-Logs-Url: https://github.com/mhson-kyle/nvitop/sessions/9e63a25a-5033-4588-bfdd-3fb0d64c9d9f Co-authored-by: mhson-kyle <72399227+mhson-kyle@users.noreply.github.com>

device: replace is_available() in _nvml_probe() with shutil.which check

7336642

Agent-Logs-Url: https://github.com/mhson-kyle/nvitop/sessions/e5fd1e19-5d52-4ab0-ac60-5b545ffb9632 Co-authored-by: mhson-kyle <72399227+mhson-kyle@users.noreply.github.com>

Merge pull request #1 from mhson-kyle/metax-mx-smi-support

ee8b997

Add mx-smi backend support for MetaX GPUs

Copilot AI review requested due to automatic review settings April 29, 2026 13:32

Copilot started reviewing on behalf of mhson-kyle April 29, 2026 13:32 View session

Copilot AI reviewed Apr 29, 2026

View reviewed changes

XuehaiPan mentioned this pull request May 6, 2026

[RFC] AMD GPU Support #211

Open

XuehaiPan marked this pull request as draft May 6, 2026 08:24

XuehaiPan mentioned this pull request May 6, 2026

support mthreads gpu monitoring #198

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add mx-smi backend support for MetaX GPUs#210

Add mx-smi backend support for MetaX GPUs#210
mhson-kyle wants to merge 4 commits into
XuehaiPan:mainfrom
mhson-kyle:main

mhson-kyle commented Apr 29, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 29, 2026

Uh oh!

Copilot AI Apr 29, 2026

Uh oh!

Copilot AI Apr 29, 2026

Uh oh!

Copilot AI Apr 29, 2026

Uh oh!

Copilot AI Apr 29, 2026

Uh oh!

XuehaiPan commented May 6, 2026

Uh oh!

wnark commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

	"""Test whether there are any devices and the NVML library is successfully loaded."""
	"""Test whether any GPU devices are available via the active backend."""

Conversation

mhson-kyle commented Apr 29, 2026

Motivation and Context

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

XuehaiPan commented May 6, 2026

1. CLI parsing is not an acceptable backend shape

2. The Python wrapper must be pip-installable from PyPI

3. Modifications to nvitop/api/device.py are gated on the #211 outcome

Uh oh!

wnark commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

2. The Python wrapper must be `pip`-installable from PyPI

3. Modifications to `nvitop/api/device.py` are gated on the #211 outcome