Skip to content

feat(api): redactSecrets util for LLM input from observability data#2188

Merged
kodiakhq[bot] merged 7 commits into
mainfrom
alex/HDX-3992-redact-secrets
May 7, 2026
Merged

feat(api): redactSecrets util for LLM input from observability data#2188
kodiakhq[bot] merged 7 commits into
mainfrom
alex/HDX-3992-redact-secrets

Conversation

@alex-fedotyev
Copy link
Copy Markdown
Contributor

Summary

Adds a reusable best-effort secret redactor at packages/api/src/utils/redactSecrets.ts. Internal-only utility; no consumer in this PR. The next AI-summarize PR (HDX-3992 split) imports it; future LLM endpoints that ingest observability data should also.

The file header codifies the design rule for HyperDX AI endpoints:

Any LLM input derived from observability data passes through redactSecrets before leaving the API process. User-authored prose (the chart-builder assistant where the user types their own question) does NOT, because redacting the user's own input would strip exactly what they meant to ask.

Patterns covered

name shape
pem -----BEGIN ... PRIVATE KEY----- blocks (RSA, EC, DSA, OPENSSH, PKCS#8)
basic-auth-url https://user:pass@host
key-value password=..., api_key=..., token=..., etc.
json-quoted {"password":"..."} and similar
http-header X-Api-Key:, X-Auth-Token:, Api-Key:
bearer Authorization: Bearer ...
basic Authorization: Basic ...
jwt eyJ... three-segment base64url
aws-access-key AKIA[16], ASIA[16]
slack-token xox[a-z]-...
github-token ghp_, gho_, ghu_, ghs_, ghr_

Known gaps (deferred to follow-ups)

  • URL percent-encoded values (beyond what query-string parsing already catches)
  • Vendor-specific tokens not listed above (Stripe, Twilio, Datadog)
  • Generic high-entropy hex blobs (too many false positives without surrounding context)

Why this is its own PR

Splits cleanly from the larger AI summarize work (HDX-3992 / #2108). Lands as a small, isolated, test-heavy change so review is fast and the util is in place before downstream consumers arrive.

Tests

38 cases in packages/api/src/utils/__tests__/redactSecrets.test.ts covering: each pattern with a positive case, a "looks similar but isn't" negative, and at least one multi-secret payload. Pattern-coverage assertion exposes the registry shape so future additions get a compile-time signal.

yarn jest src/utils/__tests__/redactSecrets.test.ts
# Test Suites: 1 passed, 1 total
# Tests:       38 passed, 38 total

All neighboring api utils tests still pass (8 suites, 61 tests).

No user-facing change

The util is internal API code with no production consumer in this PR. No changeset.

References

Adds a reusable best-effort secret redactor with conservative
allowlist patterns covering: PEM blocks, basic-auth URLs, key=value
pairs, JSON-shaped secrets, HTTP secret headers, Bearer/Basic auth
values, JWTs, AWS access keys, Slack tokens, and GitHub token shapes.

Codifies the design rule for HyperDX AI endpoints in the file header:
LLM input derived from observability data passes through redactSecrets;
user-authored prose does not.

Internal-only; no consumer in this commit. Imported by the upcoming
/ai/summarize endpoint and any future LLM endpoints that ingest
observability data.

Refs HDX-3992.
@vercel
Copy link
Copy Markdown

vercel Bot commented May 4, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
hyperdx-oss Ready Ready Preview, Comment May 7, 2026 1:59pm

Request Review

@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented May 4, 2026

⚠️ No Changeset found

Latest commit: 757c19b

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@github-actions github-actions Bot added the review/tier-2 Low risk — AI review + quick human skim label May 4, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 4, 2026

🔵 Tier 2 — Low Risk

Small, isolated change with no API route or data model modifications.

Why this tier:

  • Standard feature/fix — introduces new logic or modifies core functionality

Review process: AI review + quick human skim (target: 5–15 min). Reviewer validates AI assessment and checks for domain-specific concerns.
SLA: Resolve within 4 business hours.

Stats
  • Production files changed: 1
  • Production lines changed: 144 (+ 359 in test files, excluded from tier calculation)
  • Branch: alex/HDX-3992-redact-secrets
  • Author: alex-fedotyev

To override this classification, remove the review/tier-2 label and apply a different review/tier-* label. Manual overrides are preserved on subsequent pushes.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 4, 2026

PR Review

Solid, well-scoped utility with strong test coverage and good documentation of design intent. A couple of items worth considering before merge:

  • ⚠️ Basic/Bearer patterns false-positive on English prose (redactSecrets.ts:75,81) — /Basic\s+[A-Za-z0-9+/=]+/gi and /Bearer\s+[A-Za-z0-9._~+/=-]+/gi are case-insensitive, so observability log lines containing natural English like "basic understanding", "basic auth flow disabled", or "bearer of bad news" will be redacted to Basic [REDACTED] / Bearer [REDACTED]. → Anchor with a header context (e.g. require Authorization\s*:\s* before the scheme) or drop the i flag and require capitalized Basic/Bearer after : or whitespace at line start. Add a negative test for a plain-prose case to lock the behavior in.

  • ⚠️ SECRET_KEY_TOKENS includes bare auth — combined with the case-insensitive key-value pattern, this matches auth=admin (good) but is a very broad token that risks redacting innocuous keys like auth=cookie in user-controlled query strings. → Consider dropping bare auth (keeping authorization) since the value is mostly redundant with token/api_key once those are covered, OR add a test that pins down the intended boundary.

  • ℹ️ http-header pattern doesn't include Authorization despite that being the most common secret header — and the bearer/basic patterns above only fire on the value, not the full header. If a payload contains Authorization: SomeCustomScheme abc123 (neither Bearer nor Basic), nothing redacts it. → Consider adding authorization to the http-header regex's key alternation as a catch-all.

  • ✅ PEM bounded-quantifier ReDoS mitigation, JWT word-boundary anchor, and basic-auth URL @-in-password handling are all correct and tested. Pattern-coverage assertion via REDACTION_PATTERN_NAMES is a nice forward-compat signal.

  • ✅ Adheres to repo conventions: util lives in packages/api/src/utils/, tests in adjacent __tests__/, file under 200 lines, no changeset (internal-only utility, correctly noted in PR body).

None of the above are merge-blockers given the file-header explicitly frames this as best-effort and acknowledges that downstream LLM context degradation from over-redaction is the lesser harm. But the Basic/Bearer prose false-positive is the one I'd most want addressed since observability log bodies often do contain English prose.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 4, 2026

E2E Test Results

All tests passed • 164 passed • 3 skipped • 1182s

Status Count
✅ Passed 164
❌ Failed 0
⚠️ Flaky 5
⏭️ Skipped 3

Tests ran across 4 shards in parallel.

View full report →

Address review comments on #2188:

- basic-auth-url now handles "@" in passwords. Previous regex stopped
  at the first "@", leaving any password tail before the host visible.
  New regex greedily consumes the password and backtracks to the last
  "@" before the host; host is captured and preserved in the
  replacement. New test: a password containing "@" must be fully
  redacted, with the host intact.

- key-value pattern now matches shell-style quoted values:
  PASSWORD="hunter2 with spaces" and API_KEY='abc 123' are redacted.
  Previously the unquoted character class stopped at the leading
  quote, so neither pattern fired. Two new tests cover both quote
  styles.

- pem pattern is bounded by {0,16000}? on the lazy match so an
  unmatched BEGIN does not scan an unbounded amount of trailing
  input. Real PEM blocks are well under 16KB; the API caps the whole
  request body at 50KB. New test asserts unchanged output and
  sub-500ms wall-clock on a 50KB unmatched-BEGIN payload.

- Header "Known gaps" comment now mentions raw "@" in basic-auth
  usernames (ambiguous to parse without percent-encoding).

44 tests pass; eight new cases for the items above. No changes to the
public surface. Refs HDX-3992.
@alex-fedotyev
Copy link
Copy Markdown
Contributor Author

Thanks for the review. Pushed fixes in 9753dc1.

  • ⚠️ basic-auth-url with @ in password — fixed. The pattern now greedily consumes the password and backtracks to the last @ before the host; the host is captured and kept in the replacement. The previous test was too weak (it asserted .toContain('[REDACTED]:[REDACTED]') and .not.toContain('svc:p@ss'), both of which passed even though ss was leaking). Replaced with a stricter assertion: a password containing @ must be fully gone and the host must be intact. Good catch.

  • ⚠️ key-value misses quoted values — fixed. The value group now alternates over "…", '…', and the unquoted class, so PASSWORD="hunter2" and API_KEY='abc 123' both get redacted. Two new tests.

  • ℹ️ PEM lazy match unbounded — fixed. The [\s\S]*? is now bounded by {0,16000}?. Real PEM blocks are well under 16KB; the API also caps the whole request body at 50KB so this is purely defensive. New test asserts unchanged output and sub-500ms wall-clock on a 50KB unmatched-BEGIN payload.

  • Also added a note to the "Known gaps" header about raw @ in basic-auth usernames (ambiguous without percent-encoding).

44 tests pass; eight new cases.

@github-actions github-actions Bot added review/tier-3 Standard — full human review required and removed review/tier-2 Low risk — AI review + quick human skim labels May 4, 2026
The previous review-fix commit pushed prod lines from 139 to 153, just
over the Tier 2 threshold (< 150 prod lines). Compressing the verbose
comments on PEM, basic-auth-url, and key-value patterns brings prod
back to 144. No behavior change.
@github-actions github-actions Bot added review/tier-2 Low risk — AI review + quick human skim and removed review/tier-3 Standard — full human review required labels May 4, 2026
@alex-fedotyev alex-fedotyev marked this pull request as draft May 5, 2026 03:13
The PR body has always declared this PR as having no user-facing
change (internal-only utility, no consumer in this PR). The
changeset was added in error and would surface a stray "feat(api)"
line in the next release notes for code that no production caller
reaches yet. Drop it; the consumer's PR (#2206) carries the
changeset that ships the user-facing behavior.
alex-fedotyev added a commit that referenced this pull request May 6, 2026
Backend endpoint for natural-language summaries of logs/traces and
patterns. Subject-prompt registry keyed by `kind`, hardcoded tone
modifiers (default | noir | attenborough | shakespeare), and a 30
req/min per-user rate limit. User content is wrapped in <data> tags
so the model can separate data from instructions; secrets are
redacted via the utility from #2188.

Initial release covers `event` and `pattern`. The `alert` kind,
conversation history (`messages` array), and trace-context
enrichment land in follow-up PRs as their UI consumers ship.
@fleon fleon self-requested a review May 7, 2026 13:55
@fleon
Copy link
Copy Markdown
Contributor

fleon commented May 7, 2026

LGTM for the sole purpose of omitting secrets before they are sent to the LLM. However, we should note that this should not be used anywhere else as the current implementation is fairly naive and wouldn't cover many kinds of secrets.

@kodiakhq kodiakhq Bot merged commit 4e34645 into main May 7, 2026
19 checks passed
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

Deep Review

🔴 P0/P1 -- must fix

  • packages/api/src/utils/redactSecrets.ts:69 -- The bearer value class [A-Za-z0-9._~+/=-] omits _, so a real JWT bearer token (base64url uses _) terminates at the first underscore and the post-_ tail (signature, trailing payload bytes) is left in plaintext after Bearer [REDACTED].
    • Fix: Add _ to the char class so it matches the alphabet already used by the jwt pattern at line 82, e.g. /Bearer\s+[A-Za-z0-9._~+/=_-]+/gi.
    • adversarial
  • packages/api/src/utils/redactSecrets.ts:63 -- The basic-auth-url scheme allowlist (https?|ftp|ssh) misses database/queue connection strings (postgres://, postgresql://, mongodb://, mongodb+srv://, mysql://, redis://, amqp://, kafka://) which routinely appear with embedded credentials in observability logs and are not in the file's "Known gaps" list.
    • Fix: Extend the scheme group to include postgres(?:ql)?|mysql|mariadb|mongodb(?:\+srv)?|redis|rediss|amqp|amqps|kafka|clickhouse and add regression tests for postgres:// and mongodb+srv://.
    • security
  • packages/api/src/utils/redactSecrets.ts:49 -- No pattern matches OpenAI (sk-…) or Anthropic (sk-ant-…) API key shapes, despite this redactor specifically fronting an LLM-provider call — a leaked LLM-vendor key would be exfiltrated to the very provider that issued it.
    • Fix: Add /\bsk-(?:ant-)?[A-Za-z0-9_-]{20,}\b/g -> [REDACTED_LLM_KEY] and document the threat-model rationale alongside the existing patterns block.
    • security

🟡 P2 -- recommended

  • packages/api/src/utils/redactSecrets.ts:69 -- The bearer regex is case-insensitive with no \b prefix, so prose like a bearer of bad news or lumbearer abc matches bearer of / bearer abc and rewrites the next word to Bearer [REDACTED], destroying legitimate log content (and force-uppercasing lowercase bearer).
    • Fix: Either drop the i flag and add \b anchors, or require an Authorization\s*: lead-in (e.g. /(?:^|[\s,])Bearer\s+[A-Za-z0-9._~+/=_-]{8,}/).
    • correctness, security, adversarial
  • packages/api/src/utils/redactSecrets.ts:75 -- The basic regex is case-insensitive with no \b and the value class [A-Za-z0-9+/=]+ is a strict superset of [A-Za-z], so common log prose like Basic auth required or Basic Authentication is required is mangled into Basic [REDACTED] required.
    • Fix: Anchor with \b on both sides and require base64-shaped values (minimum length plus presence of + / or =-padding) so English words don't satisfy the value class.
    • correctness, security, adversarial
  • packages/api/src/utils/redactSecrets.ts:119 -- The json-quoted value class [^"]* stops at the first literal quote regardless of escape, so {"password":"abc\"def","user":"alice"} matches "password":"abc\" and leaks the post-escape suffix def","user":... into the output as a stranded JSON fragment.
    • Fix: Change the value class to (?:[^"\\]|\\.)* so backslash-escaped quotes are absorbed by the match.
    • adversarial
  • packages/api/src/utils/redactSecrets.ts:55 -- The PEM body bound {0,16000}? was sized for ReDoS safety but silently fails to redact PRIVATE KEY blocks larger than ~16 KB (encrypted PKCS#8 bundles, multi-key OpenSSH, RSA-8192 with metadata) — the engine never finds the END within bounds and the full key body passes through.
    • Fix: Either widen the bound (e.g. {0,131072}?), document the limit and add an explicit "oversized PEM is intentionally not redacted" test, or pair the regex with a line-counter pass for oversized blocks.
    • adversarial
  • packages/api/src/utils/redactSecrets.ts:63 -- The basic-auth-url greedy password class [^/\s]+ permits @ and is followed by literal @ plus host, producing O(n²) backtracking on adversarial inputs like https://a: + b@.repeat(N) + end.
    • Fix: Either drop @ from the password class (accepting that @-containing passwords stay unredacted) or bound it (e.g. {1,256}); add a regression test that runs in <100 ms on b@.repeat(4000).
    • performance
  • packages/api/src/utils/redactSecrets.ts:49 -- Vendor-token shapes the file lists as deferred — Stripe (sk_live_, rk_live_, pk_live_, whsec_), Google (AIza…), Twilio (AC…/SK…), Datadog — are routinely emitted by payment, maps, SMS, and APM integrations and have low-false-positive shapes that are cheap to add.
    • Fix: Add \b(?:sk|rk|pk)_live_[A-Za-z0-9]{16,}\b, \bwhsec_[A-Za-z0-9]{20,}\b, \bAIza[A-Za-z0-9_-]{35}\b, \bAC[a-f0-9]{32}\b, \bSK[a-f0-9]{32}\b.
    • security
  • packages/api/src/utils/redactSecrets.ts:36 -- SECRET_KEY_TOKENS does not include session-cookie names (session, sid, jsessionid, connect.sid, phpsessid), so a Cookie: or Set-Cookie: line carrying an opaque session token slips through (only the JWT case is rescued via the jwt pattern).
    • Fix: Add session-cookie names to SECRET_KEY_TOKENS and add a Cookie/Set-Cookie pattern that redacts the entire cookie value when a session-shaped name is present.
    • security
  • packages/api/src/utils/redactSecrets.ts:88 -- Once the AKIA… ID is matched, the paired AWS secret access key (40-char base64-ish) has no distinct prefix and only gets redacted when it lands on a SECRET_KEY_TOKENS key-name; camel-case forms like secretAccessKey may slip through.
    • Fix: Add tests for AWS_SECRET_ACCESS_KEY=… and secretAccessKey: …, and consider redacting any 40-char base64 value that follows an AKIA-redacted token within N characters on the same line.
    • security
  • packages/api/src/utils/__tests__/redactSecrets.test.ts:340 -- The pattern coverage test uses expect.arrayContaining([...]), a one-way assertion that misses additions, duplicates, and reordering, even though the source comment at line 44 declares pattern order load-bearing.
    • Fix: Replace with expect(REDACTION_PATTERN_NAMES).toEqual([...]) against the exact ordered list, or generate the expected list from a per-pattern fixture map so adding a pattern without coverage breaks the test.
    • testing, kieran-typescript
  • packages/api/src/utils/__tests__/redactSecrets.test.ts:81 -- Several JSON-quoted and HTTP-header tests assert toContain('[REDACTED]') instead of exact output, so a regex regression that drops surrounding quotes or a colon would still pass.
    • Fix: Use expect(out).toBe(exact) for these single-input deterministic transforms; reserve toContain for embedded-in-context cases.
    • testing
  • packages/api/src/utils/__tests__/redactSecrets.test.ts:181 -- The PEM ReDoS guard uses expect(elapsed).toBeLessThan(500), a brittle wall-clock threshold that flakes on contended CI while a healthy regex stays well under Jest's default timeout.
    • Fix: Either compare elapsed time against a same-runner baseline call, or expand the input to ~1 MB so a genuinely catastrophic regex blows the 5-second Jest timeout while a healthy one passes comfortably.
    • testing
  • packages/api/src/utils/__tests__/redactSecrets.test.ts:1 -- No test asserts redactSecrets(redactSecrets(x)) === redactSecrets(x). The current patterns are idempotent only by happy coincidence (redacted markers contain [/] which every value class excludes), and one char-class change would silently break it.
    • Fix: Add an idempotence assertion across the multi-secret payload fixture and at least one positive case per pattern.
    • testing
  • packages/api/src/utils/__tests__/redactSecrets.test.ts:108 -- No fixture exercises bearer or Basic as English words mid-prose (a bearer of bad news, Basic auth required), so the false-positive findings above are not pinned and cannot regress noisily.
    • Fix: Add at least one negative-case test per pattern asserting the prose passes through unchanged.
    • testing
  • packages/api/src/utils/__tests__/redactSecrets.test.ts:314 -- The multi-secret payload covers basic-auth-url, bearer, jwt, aws, github, slack, and key-value but omits pem, json-quoted, and http-header, so a regression in those three patterns is invisible to the cross-pattern integration test.
    • Fix: Extend the multi-secret fixture to include a JSON-shaped secret, an X-Api-Key: header line, and a small PEM block.
    • testing
  • packages/api/src/utils/__tests__/redactSecrets.test.ts:1 -- The pattern-order invariant documented at redactSecrets.ts:44 (high-confidence shapes run before the key-value catch-all so password=eyJ… preserves the JWT shape) is enforced only by comment.
    • Fix: Add a regression test that pins the result of password=eyJa.b.c so any future reorder breaks loudly.
    • testing
  • packages/api/src/utils/redactSecrets.ts:119 -- json-quoted only matches double-quoted string values and silently leaks "token": 12345, "password": null, and "api_key": true from JSON payloads.
    • Fix: Either widen the value match to (?:"[^"]*"|null|true|false|-?\d+(?:\.\d+)?) or document the gap explicitly in the file's "Known gaps" block.
    • correctness
🔵 P3 nitpicks (12)
  • packages/api/src/utils/redactSecrets.ts:69 -- The bearer value class includes . and /, so Bearer abc123. Also note… redacts the trailing sentence period and Bearer abc/v1/data continue swallows the URL path.
    • Fix: Tighten the trailing boundary or add a \b so trailing punctuation is preserved.
    • adversarial
  • packages/api/src/utils/redactSecrets.ts:88 -- Inserting a default-ignorable code point (U+200B ZWSP, U+200C ZWNJ, U+2060, U+FEFF) inside an AKIA…, gh*_…, xox*-…, eyJ…, or PEM body breaks the strict-ASCII char classes; bypass requires adversarial control of input but yields full secret disclosure.
    • Fix: NFKC-normalize and strip default-ignorable code points before matching, or document that input is assumed ASCII-clean.
    • adversarial
  • packages/api/src/utils/redactSecrets.ts:49 -- When a partial redaction strands a secret fragment (the JWT tail after the bearer-_ halt, the post-escape suffix from the JSON-quoted bug), no later pattern in the array can recognize the orphan.
    • Fix: Run the pattern loop a second time after primary redaction, or compose patterns into one alternation so partial matches are finalized in a single sweep.
    • adversarial
  • packages/api/src/utils/redactSecrets.ts:134 -- 11 sequential String.prototype.replace passes scan the full input each call with no length cap; a 1–5 MB log batch on a hot LLM path serializes ~100–300 ms of event-loop time per call.
    • Fix: Cap input length before redaction (truncate or sample beyond N KB), gate work behind a single indexOf prefix sniff, or compose the patterns into one alternation.
    • performance, adversarial
  • packages/api/src/utils/redactSecrets.ts:130 -- http-header replacement '$1: [REDACTED]' always emits : (colon + space), so Api-Key:abc123 becomes Api-Key: [REDACTED] and non-secret formatting is silently rewritten.
    • Fix: Capture and reuse the original separator, e.g. /(\b…)(\s*:\s*)([^\s,;]+)/ with '$1$2[REDACTED]'.
    • correctness
  • packages/api/src/utils/redactSecrets.ts:16 -- The 12-line pattern enumeration in the header comment duplicates the inline name: fields below, creating comment-drift risk for every pattern addition.
    • Fix: Drop the enumeration; the inline name: strings are self-documenting. Keep the "Design rule" and "Known gaps" paragraphs.
    • maintainability
  • packages/api/src/utils/redactSecrets.ts:127 -- The http-header regex hand-rolls a key vocabulary that overlaps but does not align with SECRET_KEY_TOKENS, so a contributor adding private_key or client_secret to one will reasonably assume the other follows.
    • Fix: Add a one-line comment explaining why the header vocabulary is hand-curated, or rename SECRET_KEY_TOKENS to SECRET_KV_KEYS to scope it to key-value/JSON.
    • maintainability
  • packages/api/src/utils/redactSecrets.ts:144 -- REDACTION_PATTERN_NAMES is exported as mutable string[] despite the comment declaring it test-only; any consumer can .push/.splice it and corrupt module state.
    • Fix: Type as readonly string[], mark @internal via JSDoc, or relocate to a sibling redactSecrets.test-utils.ts.
    • kieran-typescript, maintainability
  • packages/api/src/utils/redactSecrets.ts:49 -- const PATTERNS: RedactionPattern[] widens each name from its string-literal type to string, throwing away the union 'pem' | 'basic-auth-url' | … that callers and tests could otherwise rely on.
    • Fix: Use as const satisfies readonly RedactionPattern[] (or declare a RedactionName literal-union and type name against it).
    • kieran-typescript
  • packages/api/src/utils/redactSecrets.ts:134 -- redactSecrets(input: string) calls String.prototype.replace directly; passing undefined or a non-string through any (realistic when span attributes are non-string shapes) throws TypeError.
    • Fix: Coerce at the boundary with String(input) or guard with if (typeof input !== 'string') return '', depending on desired semantics.
    • maintainability
  • packages/api/src/utils/__tests__/redactSecrets.test.ts:194 -- Tests asserting not.toContain('hunter2') catch over-leak (whole secret remaining) but not under-redaction (e.g. an off-by-one regex bug leaving unter2); paired exact-shape assertions would catch both.
    • Fix: Combine each not.toContain with at least one toBe/toEqual exact-shape assertion on the canonical fixture.
    • testing
  • packages/api/src/utils/__tests__/redactSecrets.test.ts:1 -- No CRLF, multi-line JSON value, length-1 input, "input is exactly the secret", or large-non-PEM-input fixtures, and no generative property test for a regex-heavy module.
    • Fix: Add a CRLF case, an "input equals secret" case per pattern, and a small fast-check property test for redactSecrets(redactSecrets(x)) === redactSecrets(x) and "no original secret survives".
    • testing

Reviewers (10): correctness, security, adversarial, testing, maintainability, project-standards, performance, kieran-typescript, agent-native, learnings-researcher.

Testing gaps:

  • No idempotence assertion — current implementation is idempotent only by happy coincidence.
  • No prose-negative tests for bearer / Basic mid-sentence usage.
  • pem, json-quoted, and http-header are missing from the multi-secret integration fixture.
  • Pattern-ordering invariant lives only in comments, not tests.
  • ReDoS guard threshold is wall-clock (500 ms), not invariant.
  • No URL-percent-encoded basic-auth, oversized-PEM, JSON-non-string-value, CRLF, length-1, or generative coverage.
  • No same-runner baseline / large-input perf test for the non-PEM patterns (e.g. basic-auth-url quadratic-backtracking guard).

@alex-fedotyev
Copy link
Copy Markdown
Contributor Author

Agreed on the scope. The deep-review that fired ~20 minutes after merge confirmed the same point with specifics: the bearer value class omits _ (so a real JWT terminates at the first underscore and the signature tail leaks), the basic-auth-url scheme allowlist misses postgres://, mongodb(+srv)://, mysql://, redis://, etc., and there's no shape for OpenAI / Anthropic API keys (sk-..., sk-ant-...) which is exactly the class of secret most likely to be exfiltrated to the very provider that issued them.

I'll open a follow-up PR for those P0/P1 items and tighten the file's docstring to "LLM-input only; not a general-purpose redactor."

alex-fedotyev added a commit that referenced this pull request May 8, 2026
Deep-review on the merged #2188 surfaced three security/correctness
gaps in the redactor; this PR addresses each.

- `bearer` value class now includes `_`, so a JWT bearer token with
  underscores in its signature ("eyJ...AbC_DeF...") no longer
  terminates at the first underscore and leaks the trailing bytes
  past the [REDACTED] marker. base64url uses both "_" and "-"; the
  alphabet is now consistent with the `jwt` pattern.
- `basic-auth-url` scheme allowlist now covers the database/queue
  connection strings most likely to land in observability payloads
  with embedded credentials: postgres(ql), mysql, mariadb,
  mongodb(+srv), redis(s), amqp(s), kafka, clickhouse. Previously
  these slipped through entirely while http(s)/ftp/ssh were
  redacted.
- New `llm-vendor-key` pattern catches OpenAI ("sk-...") and
  Anthropic ("sk-ant-...") API keys. This redactor specifically
  fronts an LLM-provider call, so a vendor-shape key must not leak
  to the very provider that issued it. Floors at 20 chars after
  the prefix to avoid catching English fragments like "sk-ip" or
  "sk-line".

Docstring now scopes the redactor explicitly to LLM input
("LLM-input only; not a general-purpose secret redactor"), per
fleon's caveat on #2188 and the discovered gaps. Pattern coverage
test, multi-secret integration test, and per-pattern regression
tests cover each new shape plus the JWT-with-underscore regression.

53/53 unit tests pass. No OpenAPI / changeset surface change.
kodiakhq Bot pushed a commit that referenced this pull request May 11, 2026
## Summary

Follow-up to #2188 addressing three P0/P1 findings from the deep-review that fired ~20 minutes after merge.

- **`bearer` value class now includes `_`.** A JWT bearer token with underscores in the signature (base64url uses `_`) terminated at the first underscore, leaking the post-`_` tail past the `[REDACTED]` marker. The class is now `[A-Za-z0-9._~+/=_-]`, consistent with the `jwt` pattern.
- **`basic-auth-url` scheme allowlist extended.** The original allowlist `(https?|ftp|ssh)` missed every database/queue connection string most likely to appear in observability payloads with embedded credentials. Now covers `postgres(ql)`, `mysql`, `mariadb`, `mongodb(+srv)`, `redis(s)`, `amqp(s)`, `kafka`, `clickhouse`.
- **New `llm-vendor-key` pattern.** This redactor specifically fronts an LLM-provider call, so a leaked OpenAI / Anthropic key would be exfiltrated to the very provider that issued it. Pattern is `\bsk-(?:ant-)?[A-Za-z0-9_-]{20,}\b` so it catches OpenAI's 48+ char and Anthropic's longer formats while avoiding English fragments like `sk-ip` / `sk-line`.

Docstring now scopes the redactor explicitly to LLM input ("**LLM-input only.** Do not use as a general-purpose secret redactor"), aligning with [fleon's caveat](#2188 (comment)) on the original PR.

The deferred P2/P3 items from the same deep-review (false-positive on prose like "a bearer of bad news", JSON-quoted escaped quotes, oversized PEM, vendor tokens for Stripe/Twilio/Datadog/GCP, ReDoS bounds on `basic-auth-url`, idempotence assertion, exact-shape vs `toContain`) are intentionally out of scope here. Several reshape behavior in ways that need a fresh look; tracked in #2237.

## Test plan

- [x] `yarn jest src/utils/__tests__/redactSecrets.test.ts` (53/53 passing, including the new JWT-with-underscore regression, postgres / mongodb+srv / mysql / redis / amqp / kafka / clickhouse cases, OpenAI / Anthropic / free-floating `sk-` cases, and the multi-secret integration test extended with an LLM key)
- [x] `yarn workspace @hyperdx/api ci:lint` clean (eslint + tsc + spectral)
- [x] `prose-lint --base origin/main --staged` clean
- [x] Tier predictor reports Tier 2 (1 prod file, 54 prod lines)
@github-actions github-actions Bot mentioned this pull request May 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

automerge review/tier-2 Low risk — AI review + quick human skim

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants