Skip to content

Add SM and Tensor-core activity plot series (NVML GPM)#483

Open
rocker-zhang wants to merge 1 commit into
Syllo:masterfrom
rocker-zhang:feat-gpm-utilization
Open

Add SM and Tensor-core activity plot series (NVML GPM)#483
rocker-zhang wants to merge 1 commit into
Syllo:masterfrom
rocker-zhang:feat-gpm-utilization

Conversation

@rocker-zhang

Copy link
Copy Markdown

Adds opt-in chart series for SM activity and Tensor-core activity (plus SM occupancy and DRAM bandwidth %), addressing the request in #163.

The data comes from the NVML GPM API (nvmlGpmSampleGet / nvmlGpmMetricsGet), which lives inside libnvidia-ml and is dlsym'd like the rest of the NVIDIA backend — so there is no new dependency (no DCGM daemon, no CUDA). Two samples are differenced each refresh; NVML returns 0–100 directly, so the new series reuse the existing percentage plot path (as the power% / clock% series already do).

  • Opt-in from the F2 chart menu; the backend samples GPM only while a series is enabled, so an idle nvtop never arms the shared perfmon counters.
  • GPM support is detected from the actual nvmlGpmSampleGet return code; on GPUs that don't support it the series are simply absent (no error, no clutter).
  • SM/Tensor matched dcgmi (DCGM_FI_PROF_SM_ACTIVE / PIPE_TENSOR_ACTIVE) to 3 decimals under a cuBLAS GEMM on a Blackwell board.

Note: GPM reads the same hardware perfmon counters DCGM uses, so on a node already running a DCGM exporter the two can perturb each other's readings — which is why sampling is gated to "only while the series is displayed".

Adds opt-in chart series for SM activity and Tensor-core activity (plus SM
occupancy and DRAM bandwidth %), addressing the request in issue Syllo#163.

The data comes from the NVML GPM API (nvmlGpmSampleGet / nvmlGpmMetricsGet),
which lives inside libnvidia-ml and is dlsym'd like the rest of the NVIDIA
backend, so there is no new dependency (no DCGM daemon, no CUDA). Two samples
are differenced each refresh; NVML returns 0..100 directly, so the new series
reuse the existing percentage plot path (as power% / clock% already do).

- Opt-in from the F2 chart menu; the backend samples GPM only while a series is
  enabled, so an idle nvtop never arms the shared perfmon counters.
- GPM support is detected from the actual nvmlGpmSampleGet return; on GPUs that
  do not support it the series are simply absent.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant