Skip to content

ci(release): fix OIDC trusted publishing, add Slack alert, allow manual re-runs#2275

Merged
kodiakhq[bot] merged 7 commits into
mainfrom
claude/release-fail-loudly-on-publish-error
May 14, 2026
Merged

ci(release): fix OIDC trusted publishing, add Slack alert, allow manual re-runs#2275
kodiakhq[bot] merged 7 commits into
mainfrom
claude/release-fail-loudly-on-publish-error

Conversation

@wrn14897
Copy link
Copy Markdown
Member

@wrn14897 wrn14897 commented May 14, 2026

Summary

Four ci-only changes to release.yml that together unblock npm OIDC trusted publishing, route future failures to Slack, and give us a manual retry path:

  1. Upgrade npm to latest in check_changesets so npm publish can perform the OIDC handshake. Runner ships Node 22 + npm 10.9.4 by default, but client-side OIDC trusted publishing landed in npm 11.5.1 (July 2025). Without this upgrade npm publish has no OIDC code path, falls back to looking for a NODE_AUTH_TOKEN (deliberately not set, to avoid conflicting with OIDC), and exits ENEEDAUTH.
  2. Drop continue-on-error: true from the Create Release Pull Request or Publish to npm step so a future publish failure fails the workflow run instead of leaving it green with no releases shipped.
  3. Add slack-notify-failure job mirroring the pattern in release-nightly.yml: gated on if: failure() && always(), waits on every major job, lists the failed ones via listJobsForWorkflowRun, and posts a danger-colored Slack message to SLACK_WEBHOOK_URL_ENG_NOTIFS with the commit SHA + author. Combined with (2), the next release.yml failure surfaces on Slack instead of going silent.
  4. Add workflow_dispatch: trigger so we can manually re-run the workflow against an arbitrary main SHA (gh workflow run release.yml --ref main or the UI "Run workflow" button). Needed for the recovery path below — gh run rerun <run-id> re-uses the original commit's workflow file, so it can't test fixes added in later commits.

Why we didn't catch this on the April-23 OIDC switch

The April-23 release that introduced OIDC (commit 3ae1b85d8) only appeared to work — OIDC never actually fired. Sequence:

  1. Run 24862921978 (2026-04-23, on the release-PR commit before the OIDC switch) tried to publish @hyperdx/cli@0.4.0 and failed with E404 Not Found - PUT /@hyperdx%2fcli — same continue-on-error: true mask, workflow green, no GH Releases created.
  2. Between 22:55 and 23:30 UTC, @hyperdx/cli@0.4.0 was published out of band (manual npm publish).
  3. Run 24864061403, kicked off by the OIDC commit itself, re-ran changeset publish which saw the package was already on npm and emitted:
    🦋  warn @hyperdx/cli is not being published because version 0.4.0 is already published on npm
    🦋  warn No unpublished projects to publish
    
    changeset publish then exited zero, changesets/action proceeded with its createRelease loop, and the five @hyperdx/{api,app,cli,common-utils,otel-collector} GH Releases got created. OIDC was never exercised — the publish was a no-op.

Today's 2.25.0 release (run 25827737203) had no out-of-band publish to fall back on, so the underlying ENEEDAUTH surfaced as a real failure for @hyperdx/cli@0.4.1. changesets/action aborts inside runPublish (the execWithOutput call sits above the createGithubReleases loop, and it re-throws on non-zero exit), so neither the cli nor the four private-package GH Releases (@hyperdx/api@2.25.0, @hyperdx/app@2.25.0, @hyperdx/common-utils@0.19.0, @hyperdx/otel-collector@2.25.0) ever got created. The lone cli-v0.4.1 entry on the Releases page came from the separate release-cli job using softprops/action-gh-release — not from changesets.

The npm trusted-publisher config also had a release.yamlrelease.yml filename typo (fixed in the npm UI). Real bug, but independent — when the npm client never sends an OIDC token in the first place, registry-side trusted-publisher matching is moot.

Recovery for the stuck 2.25.0 / cli@0.4.1 cycle

After merging this PR:

gh workflow run release.yml --ref main

The run will:

  • check_changesets: install npm latest, see no changesets, yarn release → genuine OIDC npm publish of @hyperdx/cli@0.4.1 (with provenance) → push the five 2.25.0 / 0.19.0 git tags → changesets/action creates all five GH Releases.
  • check_version: manifest inspect finds the existing 2.25.0 image tag → should_release=false → all Docker build / publish / downstream-notify jobs short-circuit.
  • release-cli: gh release view cli-v0.4.1 succeeds → exists=true → compile + create-release steps are skipped.
  • slack-notify-failure: only runs if any of the above fail — silent on success.

The `Create Release Pull Request or Publish to npm` step had
`continue-on-error: true`, which masks publish failures and lets the
workflow complete green. In release run 25827737203 the npm publish
for `@hyperdx/cli@0.4.1` failed with `ENEEDAUTH` (a separate npm
trusted-publisher config typo, fixed in the npm UI), but because
`changeset publish` exits non-zero, `changesets/action` aborts before
its `createRelease` loop — so neither the `@hyperdx/cli@0.4.1` nor the
four private-package GitHub Releases (api/app/common-utils/otel-collector
@ 2.25.0/0.19.0) ever got created, and we only noticed because they
were absent from the Releases page. Removing the flag means the next
failure surfaces immediately instead of silently shipping a half-release.

The subsequent jobs that should still run on a no-op release
(`check_version`, Docker builds, downstream notifications) are already
gated on `check_changesets.outputs.changeset_outputs_hasChangesets ==
'false'`, not on the step succeeding, so dropping `continue-on-error`
does not change their behavior on changeset-PR commits.
@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented May 14, 2026

⚠️ No Changeset found

Latest commit: dff29a8

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@vercel
Copy link
Copy Markdown

vercel Bot commented May 14, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
hyperdx-oss Ready Ready Preview, Comment May 14, 2026 5:19pm

Request Review

The check_changesets job runs on Node 22.21.1, which ships npm 10.9.4.
Native client-side OIDC trusted publishing for `npm publish` only
landed in npm 11.5.1 — npm 10.x has no OIDC code path, so when
changesets/action hands off to npm publish, it falls back to looking
for NODE_AUTH_TOKEN (which is deliberately not set, to avoid
conflicting with OIDC), then exits ENEEDAUTH.

The April 23 release that originally introduced OIDC (commit
3ae1b85) only appeared to work because @hyperdx/cli@0.4.0 had
been published manually out of band — the workflow's own npm
publish E404'd, then a manual npm publish landed, and the next
workflow run on the OIDC commit saw "already published" and
silently exited zero. Today's 0.4.1 release had no out-of-band
publish to fall back on, so the underlying ENEEDAUTH surfaced.

Pinning to `npm@latest` installs >= 11.5.1, which can perform
the OIDC handshake using ACTIONS_ID_TOKEN_REQUEST_URL /
ACTIONS_ID_TOKEN_REQUEST_TOKEN that GHA already injects when
`permissions: id-token: write` is set on the workflow.
@wrn14897 wrn14897 changed the title ci(release): drop continue-on-error from changesets publish step ci(release): fix OIDC trusted publishing + fail loudly on publish errors May 14, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 14, 2026

E2E Test Results

All tests passed • 177 passed • 3 skipped • 1160s

Status Count
✅ Passed 177
❌ Failed 0
⚠️ Flaky 4
⏭️ Skipped 3

Tests ran across 4 shards in parallel.

View full report →

Needed so we can manually re-run the release workflow against a
clean main (e.g. after this PR merges, to retry the cli@0.4.1
publish + the five missing 2.25.0 / 0.19.0 GH Releases). gh run
rerun uses the original commit's workflow file and so cannot test
fixes added later.
@wrn14897 wrn14897 changed the title ci(release): fix OIDC trusted publishing + fail loudly on publish errors ci(release): fix OIDC trusted publishing, fail loudly, allow manual re-runs May 14, 2026
Mirrors the slack-notify-failure pattern already present in
release-nightly.yml: a job gated on `if: failure() && always()`
that waits on every major job, lists the failed ones via
listJobsForWorkflowRun, and posts a danger-colored Slack message to
SLACK_WEBHOOK_URL_ENG_NOTIFS with the commit SHA + author.

Combined with the continue-on-error removal in this PR, the next
release.yml failure surfaces on Slack instead of going green-and-silent.
@wrn14897 wrn14897 changed the title ci(release): fix OIDC trusted publishing, fail loudly, allow manual re-runs ci(release): fix OIDC trusted publishing, add Slack alert, allow manual re-runs May 14, 2026
Drops the actions/github-script step that listed failed job names
and the Checkout step. Keeps the needs: list (still required so the
job evaluates after the workflow's terminal jobs) and switches the
gate to a bare `if: failure()` matching the pattern used by the
Pull Upstream workflow. Adds a Run-URL field so the alert links
straight back to the failing workflow run.
@wrn14897 wrn14897 marked this pull request as ready for review May 14, 2026 00:59
@github-actions github-actions Bot added the review/tier-4 Critical — deep review + domain expert sign-off label May 14, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 14, 2026

🔴 Tier 4 — Critical

Touches auth, data models, config, tasks, OTel pipeline, ClickHouse, or CI/CD.

Why this tier:

  • Critical-path files (1):
    • .github/workflows/release.yml
  • All files are docs / images / lock files

Additional context: agent branch (claude/release-fail-loudly-on-publish-error)

Review process: Deep review from a domain expert. Synchronous walkthrough may be required.
SLA: Schedule synchronous review within 2 business days.

Stats
  • Production files changed: 0
  • Production lines changed: 0
  • Branch: claude/release-fail-loudly-on-publish-error
  • Author: wrn14897

To override this classification, remove the review/tier-4 label and apply a different review/tier-* label. Manual overrides are preserved on subsequent pushes.

Adds a Get failed steps script that calls listJobsForWorkflowRun, walks
each failed job's steps, and emits a 'Job → Step' list. The fallback
'(unknown step)' covers job-level failures where no individual step
conclusion is 'failure' (runner timeout, infra cancellation, etc.) so
the alert never lands with an empty Failed step(s) field.

For today's outage the alert would have read:
  Check Changesets → Create Release Pull Request or Publish to npm
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 14, 2026

PR Review

  • ⚠️ npm install -g npm@latest is unpinned → pin to a known-good major (e.g., npm@11) so a future broken npm release can't break the release workflow. This step runs on every release; an unpinned latest reintroduces the same class of silent breakage this PR is fixing.
  • ⚠️ if: failure() diverges from the referenced nightly pattern (if: failure() && always()) at .github/workflows/release.yml:712 → functionally close, but worth aligning for consistency since the PR description claims it mirrors release-nightly.yml.
  • ℹ️ slack-notify-failure needs: list omits check_version and the build-* jobs at .github/workflows/release.yml:699-710. Failures there will still bubble up transitively through publish-*, so notifications will fire — but listing the build jobs directly would make failures of build-* (when the corresponding publish-* is skipped) clearer in the alert and avoids reliance on transitive failure() semantics.
  • ℹ️ Verify the SLACK_WEBHOOK_URL_ENG_NOTIFS secret is configured for this workflow's environment (already used by release-nightly.yml, so likely fine — just worth confirming before merge).

Overall the OIDC fix (npm upgrade + continue-on-error removal) is the right diagnosis and matches the failure mode described. No blocking issues.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 14, 2026

Deep Review

🟡 P2 — recommended

  • .github/workflows/release.yml:712if: failure() deviates from the sibling pattern if: failure() && always() used in release-nightly.yml:349; with this job's nine-entry needs: list including jobs that routinely skip on the release-PR path (e.g. notify_helm_charts gated on hasChangesets, notify_ch with if: false), the unguarded failure() is more fragile under skip/cancel interactions than the established convention.
    • Fix: Change to if: failure() && always() to match the sibling job and guarantee the alert fires regardless of which needs: job triggered the failure.
    • correctness, reliability, maintainability
  • .github/workflows/release.yml:33-34npm install -g npm@latest floats unpinned in a job that holds id-token: write, contents: write, and packages: write; future npm releases (a regression or compromised release like the recent chalk/debug/ua-parser-js incidents) will silently flow in and either break OIDC publishing or run a hostile postinstall before the OIDC token is minted.
    • Fix: Pin to a known-good minimum like npm install -g npm@^11.5.1 (matching the version cited in the inline comment) and bump explicitly via PR.
    • reliability, security, maintainability
  • .github/workflows/release.yml:714-736 — the Get failed steps github-script has no error handling around listJobsForWorkflowRun; if the API call fails (rate limit, transient 5xx) the step throws, failed_steps is never set, the unguarded Slack step is skipped by default, and the workflow that this PR is meant to alert on goes silent.
    • Fix: Wrap the script in try/catch that falls back to a sentinel value, or add continue-on-error: true to this step plus if: always() to the Slack step so the alert fires even when step enumeration fails.
    • reliability
  • .github/workflows/release.yml:750${{ steps.get_failed_steps.outputs.failed_steps }} is interpolated directly into a JSON string inside custom_payload; safe today because job/step names are workflow-controlled and lack "/\, but the GitHub-recommended hardening pattern is to pass the value via env: and reference it as $FAILED_STEPS so a future name change cannot break the payload.
    • Fix: Move the value to an env: block on the Slack step and reference it as a shell variable in custom_payload, or JSON-escape it in the github-script before core.setOutput.
    • correctness, security
  • .github/workflows/release.yml:698-773 vs .github/workflows/release-nightly.yml:340-402 — the two slack-notify-failure jobs drift on if: condition, output/step naming (failed_steps vs failed_jobs), fields: list, message text, attachment titles, and the missing status === 'completed' filter; a reader six months from now will have to diff the files to figure out which is canonical.
    • Fix: Pick one canonical shape (the new step-level variant is strictly more informative) and align both files; backport the Run link and step-level enumeration to nightly.
    • maintainability
🔵 P3 nitpicks (5)
  • .github/workflows/release.yml:47if: always() on the changesets step is now inert with continue-on-error: true removed; the only prior steps are trivially infallible config commands plus yarn install and make ci-build, and always() only serves to mask a real prior failure under a confusing secondary changesets error.
    • Fix: Drop if: always() so an earlier-step failure is reported as the root cause.
    • correctness, maintainability
  • .github/workflows/release.yml:726-734 — the filter job.conclusion === 'failure' excludes cancelled, timed_out, and action_required jobs; a timeout (real failure mode on Large-Runner-* builds) would alert with failed_steps = unknown.
    • Fix: Broaden to ['failure', 'cancelled', 'timed_out'].includes(job.conclusion) and apply the same to the step-level filter.
  • .github/workflows/release.yml:699-710 and :779-791 — the nine-job needs: list is duplicated between slack-notify-failure and otel-cicd-action; new publish jobs must be added in two places, a real drift hazard.
    • Fix: Make otel-cicd-action.needs: depend only on slack-notify-failure (which transitively waits on the publish jobs), keeping if: always() so the trace export still runs on success.
    • maintainability
  • .github/workflows/release.yml:699-710needs: omits check_version and the build-*/merge-* matrix jobs; failure() does walk the ancestor graph so behavior is currently correct, but the explicit dependency list is incomplete relative to what the listed publish jobs depend on.
    • Fix: Either add the intermediate jobs to needs: for explicitness, or add a comment noting that transitive failure() is intentional.
  • .github/workflows/release.yml:5workflow_dispatch: accepts no inputs: and has no branch guard; an operator can dispatch from a feature branch and publish Docker images built from non-main code (changesets itself is gated, but the check_version/build/publish path is not).
    • Fix: Restrict dispatch to main via a job-level if: github.ref == 'refs/heads/main' or add an inputs: confirmation field.

Reviewers (4): correctness, reliability, maintainability, security.

Testing gaps:

  • No exercise of the failure path — force a failure on check_changesets from a throwaway branch to confirm the Slack webhook actually fires and the JSON payload is well-formed.
  • No assertion that npm --version is >= 11.5.1 after the global install, so a silent regression in npm@latest would only surface as an ENEEDAUTH failure during publish.
  • No verification that the github-script step handles a listJobsForWorkflowRun API failure without dropping the Slack alert.

# LTS ships npm 10.x, which has no OIDC code path and falls back to
# NODE_AUTH_TOKEN, failing ENEEDAUTH.
- name: Upgrade npm for OIDC trusted publishing
run: npm install -g npm@latest
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we pin this?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be fine since we only need it for OIDC

@kodiakhq kodiakhq Bot merged commit 9ce6d72 into main May 14, 2026
19 checks passed
@kodiakhq kodiakhq Bot deleted the claude/release-fail-loudly-on-publish-error branch May 14, 2026 17:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

automerge review/tier-4 Critical — deep review + domain expert sign-off

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants