ci(release): fix OIDC trusted publishing, add Slack alert, allow manual re-runs#2275
Conversation
The `Create Release Pull Request or Publish to npm` step had `continue-on-error: true`, which masks publish failures and lets the workflow complete green. In release run 25827737203 the npm publish for `@hyperdx/cli@0.4.1` failed with `ENEEDAUTH` (a separate npm trusted-publisher config typo, fixed in the npm UI), but because `changeset publish` exits non-zero, `changesets/action` aborts before its `createRelease` loop — so neither the `@hyperdx/cli@0.4.1` nor the four private-package GitHub Releases (api/app/common-utils/otel-collector @ 2.25.0/0.19.0) ever got created, and we only noticed because they were absent from the Releases page. Removing the flag means the next failure surfaces immediately instead of silently shipping a half-release. The subsequent jobs that should still run on a no-op release (`check_version`, Docker builds, downstream notifications) are already gated on `check_changesets.outputs.changeset_outputs_hasChangesets == 'false'`, not on the step succeeding, so dropping `continue-on-error` does not change their behavior on changeset-PR commits.
|
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
The check_changesets job runs on Node 22.21.1, which ships npm 10.9.4. Native client-side OIDC trusted publishing for `npm publish` only landed in npm 11.5.1 — npm 10.x has no OIDC code path, so when changesets/action hands off to npm publish, it falls back to looking for NODE_AUTH_TOKEN (which is deliberately not set, to avoid conflicting with OIDC), then exits ENEEDAUTH. The April 23 release that originally introduced OIDC (commit 3ae1b85) only appeared to work because @hyperdx/cli@0.4.0 had been published manually out of band — the workflow's own npm publish E404'd, then a manual npm publish landed, and the next workflow run on the OIDC commit saw "already published" and silently exited zero. Today's 0.4.1 release had no out-of-band publish to fall back on, so the underlying ENEEDAUTH surfaced. Pinning to `npm@latest` installs >= 11.5.1, which can perform the OIDC handshake using ACTIONS_ID_TOKEN_REQUEST_URL / ACTIONS_ID_TOKEN_REQUEST_TOKEN that GHA already injects when `permissions: id-token: write` is set on the workflow.
E2E Test Results✅ All tests passed • 177 passed • 3 skipped • 1160s
Tests ran across 4 shards in parallel. |
Needed so we can manually re-run the release workflow against a clean main (e.g. after this PR merges, to retry the cli@0.4.1 publish + the five missing 2.25.0 / 0.19.0 GH Releases). gh run rerun uses the original commit's workflow file and so cannot test fixes added later.
Mirrors the slack-notify-failure pattern already present in release-nightly.yml: a job gated on `if: failure() && always()` that waits on every major job, lists the failed ones via listJobsForWorkflowRun, and posts a danger-colored Slack message to SLACK_WEBHOOK_URL_ENG_NOTIFS with the commit SHA + author. Combined with the continue-on-error removal in this PR, the next release.yml failure surfaces on Slack instead of going green-and-silent.
Drops the actions/github-script step that listed failed job names and the Checkout step. Keeps the needs: list (still required so the job evaluates after the workflow's terminal jobs) and switches the gate to a bare `if: failure()` matching the pattern used by the Pull Upstream workflow. Adds a Run-URL field so the alert links straight back to the failing workflow run.
🔴 Tier 4 — CriticalTouches auth, data models, config, tasks, OTel pipeline, ClickHouse, or CI/CD. Why this tier:
Additional context: agent branch ( Review process: Deep review from a domain expert. Synchronous walkthrough may be required. Stats
|
Adds a Get failed steps script that calls listJobsForWorkflowRun, walks each failed job's steps, and emits a 'Job → Step' list. The fallback '(unknown step)' covers job-level failures where no individual step conclusion is 'failure' (runner timeout, infra cancellation, etc.) so the alert never lands with an empty Failed step(s) field. For today's outage the alert would have read: Check Changesets → Create Release Pull Request or Publish to npm
PR Review
Overall the OIDC fix (npm upgrade + |
Deep Review🟡 P2 — recommended
🔵 P3 nitpicks (5)
Reviewers (4): correctness, reliability, maintainability, security. Testing gaps:
|
| # LTS ships npm 10.x, which has no OIDC code path and falls back to | ||
| # NODE_AUTH_TOKEN, failing ENEEDAUTH. | ||
| - name: Upgrade npm for OIDC trusted publishing | ||
| run: npm install -g npm@latest |
There was a problem hiding this comment.
should be fine since we only need it for OIDC
Summary
Four ci-only changes to
release.ymlthat together unblock npm OIDC trusted publishing, route future failures to Slack, and give us a manual retry path:latestincheck_changesetssonpm publishcan perform the OIDC handshake. Runner ships Node 22 + npm 10.9.4 by default, but client-side OIDC trusted publishing landed in npm 11.5.1 (July 2025). Without this upgradenpm publishhas no OIDC code path, falls back to looking for aNODE_AUTH_TOKEN(deliberately not set, to avoid conflicting with OIDC), and exitsENEEDAUTH.continue-on-error: truefrom theCreate Release Pull Request or Publish to npmstep so a future publish failure fails the workflow run instead of leaving it green with no releases shipped.slack-notify-failurejob mirroring the pattern inrelease-nightly.yml: gated onif: failure() && always(), waits on every major job, lists the failed ones vialistJobsForWorkflowRun, and posts a danger-colored Slack message toSLACK_WEBHOOK_URL_ENG_NOTIFSwith the commit SHA + author. Combined with (2), the next release.yml failure surfaces on Slack instead of going silent.workflow_dispatch:trigger so we can manually re-run the workflow against an arbitrarymainSHA (gh workflow run release.yml --ref mainor the UI "Run workflow" button). Needed for the recovery path below —gh run rerun <run-id>re-uses the original commit's workflow file, so it can't test fixes added in later commits.Why we didn't catch this on the April-23 OIDC switch
The April-23 release that introduced OIDC (commit 3ae1b85d8) only appeared to work — OIDC never actually fired. Sequence:
@hyperdx/cli@0.4.0and failed withE404 Not Found - PUT /@hyperdx%2fcli— samecontinue-on-error: truemask, workflow green, no GH Releases created.@hyperdx/cli@0.4.0was published out of band (manualnpm publish).changeset publishwhich saw the package was already on npm and emitted:changeset publishthen exited zero,changesets/actionproceeded with itscreateReleaseloop, and the five@hyperdx/{api,app,cli,common-utils,otel-collector}GH Releases got created. OIDC was never exercised — the publish was a no-op.Today's 2.25.0 release (run 25827737203) had no out-of-band publish to fall back on, so the underlying
ENEEDAUTHsurfaced as a real failure for@hyperdx/cli@0.4.1.changesets/actionaborts insiderunPublish(theexecWithOutputcall sits above thecreateGithubReleasesloop, and it re-throws on non-zero exit), so neither the cli nor the four private-package GH Releases (@hyperdx/api@2.25.0,@hyperdx/app@2.25.0,@hyperdx/common-utils@0.19.0,@hyperdx/otel-collector@2.25.0) ever got created. The lonecli-v0.4.1entry on the Releases page came from the separaterelease-clijob usingsoftprops/action-gh-release— not from changesets.The npm trusted-publisher config also had a
release.yaml→release.ymlfilename typo (fixed in the npm UI). Real bug, but independent — when the npm client never sends an OIDC token in the first place, registry-side trusted-publisher matching is moot.Recovery for the stuck 2.25.0 / cli@0.4.1 cycle
After merging this PR:
The run will:
check_changesets: install npmlatest, see no changesets,yarn release→ genuine OIDCnpm publishof@hyperdx/cli@0.4.1(with provenance) → push the five 2.25.0 / 0.19.0 git tags →changesets/actioncreates all five GH Releases.check_version:manifest inspectfinds the existing 2.25.0 image tag →should_release=false→ all Docker build / publish / downstream-notify jobs short-circuit.release-cli:gh release view cli-v0.4.1succeeds →exists=true→ compile + create-release steps are skipped.slack-notify-failure: only runs if any of the above fail — silent on success.