hw: systemd exporter service + driver-probe-error dmesg check#57
Merged
Conversation
ADI Binding Audit Report
Undocumented bindings
|
Unit Test Results 3 files 3 suites 8s ⏱️ Results for commit 2dbfe69. ♻️ This comment has been updated with latest results. |
Kuiper64 Test Results625 tests 601 ✅ 2s ⏱️ Results for commit 2dbfe69. ♻️ This comment has been updated with latest results. |
Hardware Test Results6 files 6 suites 9m 59s ⏱️ Results for commit 2dbfe69. ♻️ This comment has been updated with latest results. |
tfcollins
added a commit
that referenced
this pull request
Apr 21, 2026
… gap) ``adi.ad9081(uri=...)`` (and ``adi.ad9371``, ``adi.adrv9009``, etc.) hardcodes the IIO device names it expects — for AD9081 that's ``axi-ad9081-rx-hpc`` + ``axi-ad9081-tx-hpc``. Several of the designs in this suite expose the buffered frontend as the generic TPL core (``ad_ip_jesd204_tpl_adc`` / ``ad_ip_jesd204_tpl_dac``) instead, so instantiating a pyadi-iio wrapper fails with ``AttributeError: 'NoneType' object has no attribute 'channels'`` and the capture never runs. Hit by every leg of the last hw run on PR #57 (mini2/bq, direct + coord). Swap the helper to raw libiio: - Signature now takes ``(ctx, device_candidates, …)`` instead of a pyadi-iio instance. ``device_candidates`` is a str or tuple of candidate IIO device names — the first one present on the context wins. No pyadi-iio class lookup, no device-name assumptions beyond what each test already encodes in its ``found``-set assertion. - Enables every scan-capable, non-output channel; buffer-refills once; reads each channel as ``int16`` (correct for every AXI ADC frontend in this suite); then runs the same non-zero + non-latched checks as before. Per-test call-sites now pass the same candidate tuple each test already uses in its IIO-device-present assertion: - AD9081 XSA / system — ``("axi-ad9081-rx-hpc", "ad_ip_jesd204_tpl_adc")`` - ADRV9009 — ``("axi-adrv9009-rx-hpc", "axi-adrv9009-rx-obs-hpc")`` - AD9371 / ZC706 — ``("axi-ad9371-rx-hpc", "axi-ad9371-rx-obs-hpc")`` - FMCDAQ3 / VCU118 — ``("axi-ad9680-core-lpc", "axi-ad9680-hpc", "axi-ad9680-rx-hpc")``
Replaces the ``nohup labgrid-exporter … & disown`` pattern used on each hw-node runner (bq, mini2, nuc) with a proper systemd template unit. ``labgrid-exporter@<place>`` auto-restarts on failure, survives reboots, and makes ``systemctl restart`` / ``journalctl -u`` the one-line ops commands after a YAML edit. Files: - ``scripts/labgrid-exporter/[email protected]`` — template unit. ``%i`` is the place name; per-instance env (``LG_EXPORTER_BIN``, ``LG_COORDINATOR``, ``LG_EXPORTER_NAME``, ``LG_EXPORTER_YAML``, ``PATH``) lives in ``/etc/default/labgrid-exporter-<place>``. ``User=`` is baked in at install time via a sed placeholder. - ``scripts/labgrid-exporter/install.sh`` — root installer. Takes ``<instance-name> <yaml-path>`` + ``--coordinator``, ``--user``, ``--bin``, ``--ser2net-path``, ``--no-start``. Writes both files, reloads systemd, enables + starts. Idempotent. - ``doc/source/developer/hardware_ci.rst`` — new "Exporter systemd service" section with install recipe + all flags + day-to-day commands, plus a cross-link from step 1 of the "Adding a new hardware node" walkthrough.
``assert_no_kernel_faults`` only catches panic/oops/BUG/SError — drivers can fail to probe silently (bad DT overlay apply, phandle mismatch, missing regulator) and the suite would reach the IIO-device assertion with a confusing "not found" message instead of the probe-error root cause. Similarly, ``Link status: DATA`` only says JESD trained at link-up — DMA / TPL / clock-path can silently stop delivering samples and the test wouldn't notice. Add two helpers in ``test/hw/hw_helpers.py`` and wire them into every hw test that has IIO verification: 1. ``assert_no_probe_errors(dmesg_txt)`` — scans for ``probe of <dev> failed with error``, ``Error applying overlay`` / ``failed to apply overlay``, ``Error resolving``. Reuses ``_DMESG_BENIGN_SUBSTRINGS`` (now also allowlisting the stock-Kuiper ZCU102 / ZynqMP watchdog, DisplayPort, and Ceva AHCI probes that fire on every boot regardless of overlay). 2. ``assert_rx_capture_valid(ctx, device_candidates, …)`` — uses raw libiio (works with any buffered AXI ADC regardless of whether a pyadi-iio wrapper exists for its device name), enables every non-output scan channel, refills a one-shot buffer, and asserts at least one channel is non-zero + at least one channel's ``|std|`` >= 1 LSB. Device selection tries an ordered candidate list, then falls back to any ``axi-*`` / ``cf-*`` / TPL frontend on the context. ``TimeoutError`` from the refill path gets remapped to a clear ``AssertionError`` pointing at the stalled DMA. Wired into: - ``test_ad9081_zcu102_xsa_hw.py`` → ``axi-ad9081-rx-hpc`` / ``ad_ip_jesd204_tpl_adc``. - ``test_ad9081_zcu102_system_hw.py`` → same. - ``test_adrv9009_zcu102_hw.py`` → ``axi-adrv9009-rx-hpc``. Also restructured to use ``board`` fixture (from the ``target``-swap fix) so the VCU118-style teardown power-off runs. - ``test_fmcdaq3_vcu118_hw.py`` → ``axi-ad9680-core-lpc`` / ``axi-ad9680-hpc``. Four findings surfaced via this check, all fixed: - ``assert_no_probe_errors`` tripped on stock Kuiper ZynqMP boot noise (``cdns-wdt: probe of ffcb0000.watchdog failed with error -2``, DisplayPort + Ceva AHCI) → three specific device-node addresses added to the benign list so a real watchdog/display/sata regression elsewhere still trips. - ``adi.ad9081(uri=…)`` fails with ``'NoneType' object has no attribute 'channels'`` when the design exposes the TPL core rather than ``axi-ad9081-rx-hpc`` → moved to raw libiio. - Fallback initially picked the control-plane device (``ad9528-1``) which isn't AXI-DMA-backed → narrowed fallback to ``axi-*`` / ``cf-*`` / TPL only.
Align ``ADRV937xBuilder``'s output for ``zc706+adrv9371`` with the
working SD-card devicetree shipped by Kuiper so the AD9528,
axi-adxcvr, axi-jesd204, and AD9371 drivers all probe and reach
JESD ``opt_post_running_stage`` cleanly on real hardware. Full
state progression verified on bq via hw CI:
before → after
--------------------------------------------------------------
RESET Failed → reset succeeds
Requesting device clock 122.88 MHz → AD9528 drives dev_clk
failed got 0 (channel@13 / FMC_CLK)
jesd204 link_pre_setup -ENODEV → opt_post_running_stage
cycling clean on all 3 links
axi-adxcvr / axi-jesd204-rx → both probe (``AXI-
deferred probe pending JESD204-RX (1.07.a) at
0x44AA0000...``)
cf_axi_adc probe stuck → probed ADC AD9371 MASTER
AD9371 ARM uninitialized → Firmware 5.2.2 API
1.5.2.3566 initialized
JESD clocks 245.76 / 122.88 → 122.88 / 122.88 match
mismatch measured
Changes:
- ``adidt/xsa/builders/adrv937x.py``:
- Emit AD9528 ``channel@{1,3,12,13}`` with correct divider +
signal-source + extended-name (DEV_CLK, FMC_CLK, DEV_SYSREF,
FMC_SYSREF).
- Wire AD9528 ``reset-gpios = <&gpio0 113 0>``.
- Mark AD9528 as ``jesd204-device`` / ``#jesd204-cells = <2>``
/ ``jesd204-sysref-provider`` / ``adi,jesd204-max-sysref-
frequency-hz`` — required for the Mykonos driver's
``opt_post_running_stage`` callback to find the sysref
provider in the graph.
- Correct AD9371 GPIO pins: ``trx_reset_gpio`` 130 → 106,
``trx_sysref_req_gpio`` 136 → 112, new ``ad9528_reset_gpio``
= 113.
- Add AD9528 as link-2 input on the AD9371's
``jesd204-inputs``.
- Add the three ``adi,{sys,out}-clk-select`` +
``adi,use-lpm-enable`` props on ``axi-adxcvr`` — without
these the platform driver defers probe indefinitely.
- Drop emission of ``_DEFAULT_MYKONOS_PROFILE_PROPS`` as a
builder-level constant. Move the profile values into
``adidt/xsa/profiles/adrv937x_zc706.json`` as a
``trx_profile_props`` list, keeping the builder default
empty. The profile has to match the HDL's compile-time
``TX_JESD_*`` / ``RX_JESD_*`` knobs (see
``analogdevicesinc/hdl/projects/adrv937x/zc706/README``),
which is board-build-specific.
- ``adidt/xsa/profiles/adrv937x_zc706.json``:
- ``trx_reset_gpio`` → 106, ``trx_sysref_req_gpio`` → 112,
new ``ad9528_reset_gpio`` → 113.
- ``misc_clk_hz`` 245_760_000 → 122_880_000 (physical FMC
clock on this board; the wrong 245.76 MHz declared rate
triggered the JESD link-clock measured/reported mismatch
that kept the link disabled).
- New ``trx_profile_props`` list carrying 51 Mykonos
``adi,{rx,obs,tx,sniffer}-profile-*`` + ``adi,clocks-*``
lines copied verbatim from bq's working SD-card DT.
- ``adidt/xsa/profiles.py``: accept ``ad9528_reset_gpio`` +
``ad9528_jesd204_max_sysref_hz`` in the ``adrv9009_board``
schema allowlist so JSON loads don't reject the new keys.
- ``adidt/devices/clocks/ad952x.py``: optional
``jesd204_sysref_provider`` + ``jesd204_max_sysref_hz`` fields
on ``AD9528_1`` that ``extra_dt_lines`` emits when set. Opt-in
so the ADRV9009/ZCU102 path is unaffected.
- ``adidt/eval/adrv937x_fmc.py``: System-API side GPIO sync
(130/136 → 106/112) to keep the dts-parity test green.
- ``test/devices/fixtures/adrv9371_zc706_xsa_reference.dts``:
regenerated from the new builder output.
Still-open blocker (documented in the hw test's TODO):
ad9371 spi1.1: ILAS mismatch: c7f8
ILAS {lanes per converter, scrambling, octets per frame,
frames per multiframe, number of converters,
sample resolution, control bits per sample} did not match
Our ``axi-jesd204-{tx,rx}`` overlays already emit the exact
framing (M=4 L=4 F=2 / M=4 L=2 F=4, Np=16, CS=2) that the HDL
README documents as default, and the same builder path works on
ADRV9009/ZCU102 — so the remaining gap is that the
``trx_profile_props`` shipped in the profile JSON (copied from
the SD-card DT as a starting point) implies a different JESD
framing than the XSA's HDL bitstream compiles in. Pairing the
profile to the HDL is the single-file drop-in fix that closes
this out; flagged in the TODO with concrete next steps (iio-
oscilloscope regeneration vs ``trx_profile_props`` JSON override).
Unit tests: 449 passed, 13 skipped, 4 xfailed.
4783354 to
2dbfe69
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Follow-ups to PR #56 in three independent pieces on top of the
merged re-arch:
Systemd template unit + installer for
labgrid-exporter(
scripts/labgrid-exporter/). Replaces the currentnohup labgrid-exporter … & disownpattern on each hw exporterhost with a proper systemd template (
labgrid-exporter@<place>).After install:
sudo systemctl restart labgrid-exporter@<place>on yamlchange,
journalctl -u labgrid-exporter@<place> -ffor live logs,Restart=on-failurebrings it back after a crash, and theexporter auto-starts after a reboot.
hardware_ci.rst— new "Exporter systemd service" sectioncovering install, installed file layout, every install flag, and
day-to-day commands. Cross-linked from the "Adding a new
hardware node" recipe and two troubleshooting entries (which now
use
systemctl restart labgrid-exporter@<place>as thecanonical recovery action).
Driver-probe-error check in hw tests — new
assert_no_probe_errors(dmesg_txt)helper intest/hw/hw_helpers.pythat catches canonical probe failures(
probe of <dev> failed with error, overlay-apply errors,phandle-resolution errors) while honouring the existing benign
allowlist +
-EPROBE_DEFERretries. Wired into every existingassert_no_kernel_faultssite. Also restructurestest_fmcdaq3_vcu118_hw.pyto use theboardfixture (thetargetfixture it used before skipped the teardown power-off,leaving VCU118 energised between runs) and swaps the bespoke
grepsnippet forcollect_dmesg+ the two assertions, matchingthe structure of the other four hw tests.
Test plan
python3 -m py_compileon every edited filepytest --collect-onlysees every hw test without importerrors
assert_no_probe_errorswith synthetic dmesglines (positive +
-EPROBE_DEFERbenign + clean)sudo systemctl restart labgrid-exporter@nuc+ verifylabgrid-client placesstill listsnuc