feat(telemetry): propagate W3C traceparent + first-party-scoped session id header (#4384)#4390
feat(telemetry): propagate W3C traceparent + first-party-scoped session id header (#4384)#4390doudouOUC wants to merge 14 commits into
Conversation
📋 Review SummaryThis PR implements W3C traceparent propagation for outbound LLM requests by wiring 🔍 General FeedbackPositive aspects:
Architectural decisions:
Concerns:
🎯 Specific Feedback🟢 Medium
🔵 Low
✅ Highlights
|
There was a problem hiding this comment.
Pull request overview
This PR wires OpenTelemetry’s undici/fetch instrumentation into the core telemetry SDK so outbound LLM HTTP calls made via globalThis.fetch (undici) propagate W3C trace context (traceparent) and produce client HTTP spans, while avoiding tracing OTLP exporter uploads to prevent feedback loops. It also adds a design doc and developer documentation describing the outbound trace propagation behavior.
Changes:
- Register
@opentelemetry/instrumentation-undicialongside existingHttpInstrumentationininitializeTelemetry(). - Add
ignoreRequestHooklogic to skip OTLP exporter endpoints and prevent infinite self-tracing loops. - Add unit tests validating both instrumentations are registered and that OTLP endpoints are ignored correctly.
Reviewed changes
Copilot reviewed 5 out of 6 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| packages/core/src/telemetry/sdk.ts | Adds UndiciInstrumentation and OTLP endpoint ignore logic to propagate trace context on fetch/undici calls. |
| packages/core/src/telemetry/sdk.test.ts | Adds tests covering instrumentation registration and ignoreRequestHook endpoint matching. |
| packages/core/package.json | Adds @opentelemetry/[email protected] dependency. |
| package-lock.json | Updates lockfile to include the new OTel undici instrumentation dependency and related dependency graph changes. |
| docs/developers/development/telemetry.md | Documents automatic traceparent propagation on outbound LLM fetch requests and the feedback-loop avoidance approach. |
| docs/design/telemetry-outbound-propagation-design.md | Adds the design document covering outbound propagation (part 1 + planned part 2). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Review feedback on #4390: 1. CI was failing on npm ci because the lockfile was generated with npm 11 locally (it sprinkles `peer: true` annotations npm 10 reads differently and rejects). Regenerated with npm 10 (matching CI's Node 22.x default), so the diff vs main is now 18 lines (the actual instrumentation-undici entry) instead of 105 lines of npm-version drift noise. 2. (Copilot inline at sdk.ts:330) `otlpUrlPrefixes` was derived from raw Config strings, so a settings.json `"otlpEndpoint": "\"http://...\""` (quoted) or trailing `#fragment` would silently miss the prefix match and reintroduce the feedback loop the hook exists to prevent. Replaced the regex-based suffix trim with a WHATWG URL parser: - strips ?query, #fragment, trailing slash - trims symmetric ASCII quotes a user may have placed in settings.json - falls back to safe suffix trimming if URL parsing fails (misconfigured endpoint still gets SOME protection) 3. (CodeQL inline) Replaced the `/\?.*$/` regex in ignoreRequestHook with `indexOf('?')`/`indexOf('#')` slicing for ReDoS hygiene. The regex was linear in practice but flagged as polynomial — using indexOf removes the ambiguity and is arguably simpler. Added 3 tests in sdk.test.ts covering the new normalizations (#fragment on incoming path, quoted endpoint, #fragment on configured endpoint). 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)
Review feedback on #4390: 1. CI was failing on npm ci because the lockfile was generated with npm 11 locally (it sprinkles `peer: true` annotations npm 10 reads differently and rejects). Regenerated with npm 10 (matching CI's Node 22.x default), so the diff vs main is now 18 lines (the actual instrumentation-undici entry) instead of 105 lines of npm-version drift noise. 2. (Copilot inline at sdk.ts:330) `otlpUrlPrefixes` was derived from raw Config strings, so a settings.json `"otlpEndpoint": "\"http://...\""` (quoted) or trailing `#fragment` would silently miss the prefix match and reintroduce the feedback loop the hook exists to prevent. Replaced the regex-based suffix trim with a WHATWG URL parser: - strips ?query, #fragment, trailing slash - trims symmetric ASCII quotes a user may have placed in settings.json - falls back to safe suffix trimming if URL parsing fails (misconfigured endpoint still gets SOME protection) 3. (CodeQL inline) Replaced the `/\?.*$/` regex in ignoreRequestHook with `indexOf('?')`/`indexOf('#')` slicing for ReDoS hygiene. The regex was linear in practice but flagged as polynomial — using indexOf removes the ambiguity and is arguably simpler. Added 3 tests in sdk.test.ts covering the new normalizations (#fragment on incoming path, quoted endpoint, #fragment on configured endpoint). 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)
Code Coverage Summary
CLI Package - Full Text ReportCore Package - Full Text ReportFor detailed HTML reports, please see the 'coverage-reports-22.x-ubuntu-latest' artifact from the main CI run. |
…uests Part 2 of #4384. Stacks on top of PR #4390 (traceparent via undici). Adds a product-namespaced HTTP header X-Qwen-Code-Session-Id to every outbound LLM request when telemetry is enabled, so server-side ingestion can correlate observed requests with qwen-code session metric/log records. Pattern matched from claude-code (X-Claude-Code-Session-Id, verified at src/services/api/client.ts:108 in their open-source repo). Critical design decision (design doc section 4.3): the OpenAI / Anthropic providers use a per-request fetch wrapper rather than the SDK defaultHeaders option, because content-generator SDK clients are constructed once and NOT recreated on /clear-triggered session resets (Config.resetSession updates this.sessionId but the contentGenerator keeps using the stale header value). Reading config.getSessionId() from inside the wrapper at request time gives the live value. Gemini provider uses static httpOptions.headers — @google/genai HttpOptions interface does not expose a fetch hook (only headers, baseUrl, apiVersion, timeout, extraParams). This is a known limitation: after session reset, Gemini X-Qwen-Code-Session-Id stays stale until the contentGenerator is recreated. Documented in telemetry.md and the design doc section 8.6; spans/logs continue to carry the live session id for trace/log correlation. Lazy-invalidate fix is a follow-up sub-issue. Header is omitted when telemetry is disabled OR when getSessionId returns an empty string (some HTTP middleware rejects empty header values). Integration sites: - packages/core/src/core/openaiContentGenerator/provider/default.ts (base class — automatically covered by deepseek/minimax/mistral/ modelscope/openrouter subclasses; openrouter calls super.buildHeaders) - packages/core/src/core/openaiContentGenerator/provider/dashscope.ts (overrides buildClient — must be touched separately; QwenContentGenerator inherits via this provider) - packages/core/src/core/anthropicContentGenerator/anthropicContentGenerator.ts - packages/core/src/core/geminiContentGenerator/index.ts (factory function, not the GeminiContentGenerator class — no signature change) End-to-end verification (local HTTP server in tmux): PASS: traceparent + X-Qwen-Code-Session-Id on every LLM request PASS: session id refreshes after simulated /clear (staleness regression guarded by llm-correlation-fetch.test.ts) PASS: OTLP upload traffic not traced (no feedback loop — PR A ignoreRequestHook working) Robot generated with Qwen Code https://github.com/QwenLM/qwen-code
…ry safety Adopts 7 review findings from wenshao on #4390 (+ duplicates from now-closed #4393). Critical bugs first, polish second. CRITICAL: 1. tsc TS2322 — wrapper return type incompatible with Anthropic SDK Fetch. `typeof fetch` (Node WHATWG, 2 overloads) is not structurally assignable to Anthropic's narrower `Fetch = (input: RequestInfo, init?) => ...`, even though they're call-compatible at runtime. Make wrapper generic `<TFetch extends FetchLikeLoose>` so callers preserve their exact fetch signature; cast the Anthropic call site through `unknown` with a comment explaining why. 2. tsc TS2352 / TS2493 — `baseFetch.mock.calls[0]![1] as RequestInit` was out-of-bounds when wrapped was called with no init arg. Replaced with a `makeFetchMock()` helper returning typed accessors. 3. normalizeOtlpPrefix catch fallback was DANGEROUS — a config of `"http"` produced prefix `"http"` which `startsWith`-matched every outbound HTTP request → silently disabled ALL instrumentation (no client spans, no correlation header — defeats the entire feature). Fixed: catch returns undefined + diag.warn. Misconfigured endpoint loses its feedback-loop guard (acceptable) instead of disabling all guards (catastrophic). 4. `url.startsWith(prefix)` matching was NOT boundary-safe — port collision (`:4318` matches `:43180`), hostname suffix collision (`otlp.example.com` matches `otlp.example.com.evil.net`), path-segment collision (`/v1` matches `/v1foo/x`). Replaced with origin-equality + path-prefix + boundary-char check (next char must be `/`, `?`, `#`, or end-of-string). 5. HttpInstrumentation also lacked the OTLP feedback-loop guard. The OTLP HTTP exporter (`@opentelemetry/exporter-trace-otlp-http`) uses node:http (patched by HttpInstrumentation, NOT undici). Without this, every OTLP upload batch creates a parasitic client span → feedback loop. Added `ignoreOutgoingRequestHook` that reuses the same `matchesOtlpPrefix` / `stripPathSuffix` helpers as the undici instrumentation. SAFETY: 6. Request input + undefined init dropped the Request's own headers (Authorization etc.) because `new Headers(undefined)` → `{...init, headers}` replaced them with just our session header. Fix: when input is a Request and init.headers is unset, seed from input.headers before adding ours. 7. Wrapped fetch had no try/catch — a throwing Config getter or Headers constructor would propagate as TypeError and break the LLM request path. Wrapped header construction in try/catch; on failure, fall through to baseFetch with original init (no header) + diag.warn. Telemetry must never break the model call. COVERAGE: - 3 new sdk.test.ts boundary tests (port/host/path) - 1 new sdk.test.ts normalizeOtlpPrefix catch-branch coverage - 1 new sdk.test.ts HttpInstrumentation OTLP guard test - 1 new sdk.test.ts proxy-mode wrapped-fetch test (default.test.ts) - 1 new anthropic test asserting wrapped fetch installed on Anthropic SDK - 2 new llm-correlation-fetch.test.ts (Request-headers preservation + try/catch fall-through) All 668 tests pass (1 pre-existing Anthropic User-Agent failure on main is unrelated). tsc clean. Declined: #10 DRY-refactor of baseFetch extraction across 3 sites — the duplication was pre-existing (default/dashscope buildClient was already near-identical), refactoring is a separate cleanup PR not gated by this feature. Will reply on the thread. 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)
wenshao
left a comment
There was a problem hiding this comment.
All 4 findings from the previous review round are resolved in 00434502a:
- Critical:
normalizeOtlpPrefixcatch block now returnsundefined+diag.warn— no more dangerous"http"fallback - Suggestion:
startsWithreplaced withmatchesOtlpPrefix(origin-exact + path-boundary check) - Suggestion:
HttpInstrumentationnow hasignoreOutgoingRequestHookfor OTLP feedback-loop guard - Suggestion: Comprehensive test coverage added (boundary tests, Anthropic fetch wrapper test, catch-branch test)
tsc clean (0 errors), eslint clean, 215/215 tests pass locally. LGTM ✅
Downgraded from Approve to Comment: CI still running.
— qwen-latest-series-invite-beta-v34 via Qwen Code /review
Part 1 of #4384 (sub-issue of #3731 P3 deeper observability). Today qwen-code's only OTel instrumentation is `HttpInstrumentation`, which only patches Node's `http`/`https` modules. The `openai` and `@google/genai` SDKs use `globalThis.fetch` (undici), so outbound LLM requests carry no `traceparent` header and trace context dies at the qwen-code process boundary. Adds `@opentelemetry/[email protected]` (peer-compatible with the installed `@opentelemetry/[email protected]`) and wires it into `initializeTelemetry()` next to the existing `HttpInstrumentation`. Default propagator (W3C tracecontext + baggage) remains unchanged — no explicit `textMapPropagator` needed. `ignoreRequestHook` skips OTLP exporter endpoints to avoid the classic feedback loop (OTel SDK uses fetch to upload OTLP data; without the hook each upload would create a span that gets uploaded, infinitely). Configured `otlpEndpoint` / per-signal endpoints are stripped of trailing slash and query string for robust prefix matching against undici's `request.origin + request.path`. Outbound LLM calls now also produce a client-side HTTP span (separating network TTFB / transfer time from the existing `api.generateContent` total-duration span). Design doc: docs/design/telemetry-outbound-propagation-design.md (Part A — traceparent; Part B — session id header — lands in a follow-up PR per the design's split rationale.) 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)
Review feedback on #4390: 1. CI was failing on npm ci because the lockfile was generated with npm 11 locally (it sprinkles `peer: true` annotations npm 10 reads differently and rejects). Regenerated with npm 10 (matching CI's Node 22.x default), so the diff vs main is now 18 lines (the actual instrumentation-undici entry) instead of 105 lines of npm-version drift noise. 2. (Copilot inline at sdk.ts:330) `otlpUrlPrefixes` was derived from raw Config strings, so a settings.json `"otlpEndpoint": "\"http://...\""` (quoted) or trailing `#fragment` would silently miss the prefix match and reintroduce the feedback loop the hook exists to prevent. Replaced the regex-based suffix trim with a WHATWG URL parser: - strips ?query, #fragment, trailing slash - trims symmetric ASCII quotes a user may have placed in settings.json - falls back to safe suffix trimming if URL parsing fails (misconfigured endpoint still gets SOME protection) 3. (CodeQL inline) Replaced the `/\?.*$/` regex in ignoreRequestHook with `indexOf('?')`/`indexOf('#')` slicing for ReDoS hygiene. The regex was linear in practice but flagged as polynomial — using indexOf removes the ambiguity and is arguably simpler. Added 3 tests in sdk.test.ts covering the new normalizations (#fragment on incoming path, quoted endpoint, #fragment on configured endpoint). 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)
…uests Part 2 of #4384. Stacks on top of PR #4390 (traceparent via undici). Adds a product-namespaced HTTP header X-Qwen-Code-Session-Id to every outbound LLM request when telemetry is enabled, so server-side ingestion can correlate observed requests with qwen-code session metric/log records. Pattern matched from claude-code (X-Claude-Code-Session-Id, verified at src/services/api/client.ts:108 in their open-source repo). Critical design decision (design doc section 4.3): the OpenAI / Anthropic providers use a per-request fetch wrapper rather than the SDK defaultHeaders option, because content-generator SDK clients are constructed once and NOT recreated on /clear-triggered session resets (Config.resetSession updates this.sessionId but the contentGenerator keeps using the stale header value). Reading config.getSessionId() from inside the wrapper at request time gives the live value. Gemini provider uses static httpOptions.headers — @google/genai HttpOptions interface does not expose a fetch hook (only headers, baseUrl, apiVersion, timeout, extraParams). This is a known limitation: after session reset, Gemini X-Qwen-Code-Session-Id stays stale until the contentGenerator is recreated. Documented in telemetry.md and the design doc section 8.6; spans/logs continue to carry the live session id for trace/log correlation. Lazy-invalidate fix is a follow-up sub-issue. Header is omitted when telemetry is disabled OR when getSessionId returns an empty string (some HTTP middleware rejects empty header values). Integration sites: - packages/core/src/core/openaiContentGenerator/provider/default.ts (base class — automatically covered by deepseek/minimax/mistral/ modelscope/openrouter subclasses; openrouter calls super.buildHeaders) - packages/core/src/core/openaiContentGenerator/provider/dashscope.ts (overrides buildClient — must be touched separately; QwenContentGenerator inherits via this provider) - packages/core/src/core/anthropicContentGenerator/anthropicContentGenerator.ts - packages/core/src/core/geminiContentGenerator/index.ts (factory function, not the GeminiContentGenerator class — no signature change) End-to-end verification (local HTTP server in tmux): PASS: traceparent + X-Qwen-Code-Session-Id on every LLM request PASS: session id refreshes after simulated /clear (staleness regression guarded by llm-correlation-fetch.test.ts) PASS: OTLP upload traffic not traced (no feedback loop — PR A ignoreRequestHook working) Robot generated with Qwen Code https://github.com/QwenLM/qwen-code
…ry safety Adopts 7 review findings from wenshao on #4390 (+ duplicates from now-closed #4393). Critical bugs first, polish second. CRITICAL: 1. tsc TS2322 — wrapper return type incompatible with Anthropic SDK Fetch. `typeof fetch` (Node WHATWG, 2 overloads) is not structurally assignable to Anthropic's narrower `Fetch = (input: RequestInfo, init?) => ...`, even though they're call-compatible at runtime. Make wrapper generic `<TFetch extends FetchLikeLoose>` so callers preserve their exact fetch signature; cast the Anthropic call site through `unknown` with a comment explaining why. 2. tsc TS2352 / TS2493 — `baseFetch.mock.calls[0]![1] as RequestInit` was out-of-bounds when wrapped was called with no init arg. Replaced with a `makeFetchMock()` helper returning typed accessors. 3. normalizeOtlpPrefix catch fallback was DANGEROUS — a config of `"http"` produced prefix `"http"` which `startsWith`-matched every outbound HTTP request → silently disabled ALL instrumentation (no client spans, no correlation header — defeats the entire feature). Fixed: catch returns undefined + diag.warn. Misconfigured endpoint loses its feedback-loop guard (acceptable) instead of disabling all guards (catastrophic). 4. `url.startsWith(prefix)` matching was NOT boundary-safe — port collision (`:4318` matches `:43180`), hostname suffix collision (`otlp.example.com` matches `otlp.example.com.evil.net`), path-segment collision (`/v1` matches `/v1foo/x`). Replaced with origin-equality + path-prefix + boundary-char check (next char must be `/`, `?`, `#`, or end-of-string). 5. HttpInstrumentation also lacked the OTLP feedback-loop guard. The OTLP HTTP exporter (`@opentelemetry/exporter-trace-otlp-http`) uses node:http (patched by HttpInstrumentation, NOT undici). Without this, every OTLP upload batch creates a parasitic client span → feedback loop. Added `ignoreOutgoingRequestHook` that reuses the same `matchesOtlpPrefix` / `stripPathSuffix` helpers as the undici instrumentation. SAFETY: 6. Request input + undefined init dropped the Request's own headers (Authorization etc.) because `new Headers(undefined)` → `{...init, headers}` replaced them with just our session header. Fix: when input is a Request and init.headers is unset, seed from input.headers before adding ours. 7. Wrapped fetch had no try/catch — a throwing Config getter or Headers constructor would propagate as TypeError and break the LLM request path. Wrapped header construction in try/catch; on failure, fall through to baseFetch with original init (no header) + diag.warn. Telemetry must never break the model call. COVERAGE: - 3 new sdk.test.ts boundary tests (port/host/path) - 1 new sdk.test.ts normalizeOtlpPrefix catch-branch coverage - 1 new sdk.test.ts HttpInstrumentation OTLP guard test - 1 new sdk.test.ts proxy-mode wrapped-fetch test (default.test.ts) - 1 new anthropic test asserting wrapped fetch installed on Anthropic SDK - 2 new llm-correlation-fetch.test.ts (Request-headers preservation + try/catch fall-through) All 668 tests pass (1 pre-existing Anthropic User-Agent failure on main is unrelated). tsc clean. Declined: #10 DRY-refactor of baseFetch extraction across 3 sites — the duplication was pre-existing (default/dashscope buildClient was already near-identical), refactoring is a separate cleanup PR not gated by this feature. Will reply on the thread. 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)
wenshao
left a comment
There was a problem hiding this comment.
No review findings. Downgraded from Approve to Comment: CI failing: Test (macos-latest, Node 22.x), Test (ubuntu-latest, Node 22.x), Test (windows-latest, Node 22.x). — gpt-5.5 via Qwen Code /review
0043450 to
a1a8a5a
Compare
Local maintainer validation —
|
| Stage | Command | Result |
|---|---|---|
| Install | npm ci |
✅ exit 0 |
| Build | npm run build |
✅ exit 0 (15 unrelated curly warnings in vscode-ide-companion) |
| Typecheck | npm run typecheck |
✅ exit 0 across all 5 workspaces |
| Lint | npm run lint |
✅ exit 0 |
| PR-touched core tests (6 files: sdk, llm-correlation-fetch, default/dashscope providers, gemini/anthropic generators) | npx vitest run ... |
✅ 214 / 215 (1 pre-existing UA env failure — see Triage) |
Full packages/core suite |
cd packages/core && npx vitest run |
|
Full packages/cli suite |
cd packages/cli && npx vitest run |
✅ 6779 passed / 9 skipped / 0 failed across 377 files |
| End-to-end against built artifact | custom Node harness driving dist/.../llm-correlation-fetch.js |
✅ 14 happy-path assertions pass; |
🚨 PR-caused regression
src/core/contentGenerator.test.ts — 2 of 4 tests fail with:
TypeError: config.getTelemetryEnabled is not a function
❯ staticCorrelationHeaders src/telemetry/llm-correlation-fetch.ts:121:15
❯ createGeminiContentGenerator src/core/geminiContentGenerator/index.ts:48:30
❯ Module.createContentGenerator src/core/contentGenerator.ts:370:21
❯ src/core/contentGenerator.test.ts:31:23
Verified PR-caused (not pre-existing or environmental):
- ✅ Reproduces on PR head
a1a8a5ac—2 failed | 2 passed (4) - ✅ Passes on
main(qwen-code-x3at16f0fde19) —4 passed (4) - ✅ Same failure surface on GitHub Actions Ubuntu (
Test (ubuntu-latest, Node 22.x)—2 failed | 9261 passed)
Root cause (diagnosed by e2e harness item [8]):
staticCorrelationHeaders(config) at llm-correlation-fetch.ts:119-127 calls config.getTelemetryEnabled() synchronously without defensive wrapping:
export function staticCorrelationHeaders(config: Config): Record<string, string> {
if (!config.getTelemetryEnabled()) return {}; // ← throws on legacy mocks
...
}But the sibling wrapFetchWithCorrelation at llm-correlation-fetch.ts:73-100 does wrap the same call in try/catch, with this comment (line 63):
Telemetry must never break the LLM request path — if Config getters throw, header construction fails, etc., we still want the model call to proceed.
The contentGenerator.test.ts mock (line 22 / 60) predates this PR and doesn't stub getTelemetryEnabled. The Gemini factory function (geminiContentGenerator/index.ts:48) now calls staticCorrelationHeaders on every construction, exposing the missing method.
Required follow-up — pick one:
- Defensive fix in
staticCorrelationHeaders(consistent with the wrap function's stated contract — recommended):export function staticCorrelationHeaders(config: Config): Record<string, string> { try { if (!config.getTelemetryEnabled()) return {}; const sid = config.getSessionId(); if (!sid) return {}; return { [SESSION_ID_HEADER]: sid }; } catch { return {}; // telemetry must never break the LLM request path } }
- Fix the test mocks in
contentGenerator.test.ts(addgetTelemetryEnabled: () => falseto both mockConfig blocks). Mechanical, but leaves other downstream Config-mock consumers vulnerable to the same trap.
Triage of the other 20 core failures (NOT caused by this PR)
| File | Fails | Cause | Verified |
|---|---|---|---|
src/skills/skill-manager.test.ts |
2 | .qwen path fixture bug |
Reproduces on PR bases of #4314 / #4345 / #4354 / #4386 — every .qwen/tmp/ worktree |
src/core/anthropicContentGenerator/anthropicContentGenerator.test.ts |
1 | Claude Code UA injection (claude-cli/1.2.3 (external, cli) overrides QwenCode/1.2.3) |
Reproduces on qwen-code-x3 main and every prior PR validation |
src/utils/gitDiff.test.ts |
12 | 5 s testTimeout under parallel load (core + cli running together saturates IO/CPU) |
Same pattern as PR 4345 validation — passes in isolation |
src/utils/filesearch/crawler.test.ts |
5 | Same — git ls-files / ripgrep subprocesses time out under parallel load |
Same pattern as PR 4345 — passes in isolation |
End-to-end proof against built packages/core/dist/.../llm-correlation-fetch.js
Drove the compiled wrapFetchWithCorrelation + staticCorrelationHeaders and asserted:
[0] SESSION_ID_HEADER = "X-Qwen-Code-Session-Id" (matches PR spec)
[1] wrapFetchWithCorrelation: telemetry ON adds session header
✓ request carried X-Qwen-Code-Session-Id: sess-abc
✓ Authorization header preserved
[2] wrapFetchWithCorrelation: telemetry OFF skips header
✓ no X-Qwen-Code-Session-Id header sent when telemetry off
[3] CRITICAL: sessionId is read at request time (not captured at construction)
✓ first=sess-initial, second=sess-after-clear — staleness regression NOT present
[4] Empty session id → header omitted (no empty value)
[5] Request input: Authorization preserved when init.headers undefined
[6] Defensive: throwing Config.getTelemetryEnabled does not break request (wrap path)
[7] staticCorrelationHeaders contract (Gemini fallback path) — happy path holds
[8] staticCorrelationHeaders defensiveness ⚠ NOT defensive — reproduces Ubuntu CI fail
[3] is the critical design property the PR description calls out — confirmed: wrapFetchWithCorrelation reads config.getSessionId() at request time, so after a simulated /clear (mutating sessionId on the same Config object), subsequent requests carry the new session id without the SDK being rebuilt.
[6] confirms the wrap path correctly absorbs a throwing Config.getTelemetryEnabled and still sends the request — exactly what the [8] gap should mirror.
Scope / risk
- Diff: +1709 / −4 across 16 files (2 docs + 4 new/edited core source + 7 core tests + 1 new test file + 1 dep added).
- New dep:
@opentelemetry/[email protected](peer-compatible with the existing@opentelemetry/[email protected]per the PR description). - The fetch-wrapper-vs-defaultHeaders choice is sound and well-defended in the PR body —
/clearstaleness is a real concern and this PR is the right shape to address it. Gemini fallback tohttpOptions.headersis documented and a follow-up sub-issue is tracked. - OTLP feedback-loop avoidance via
ignoreRequestHook(with WHATWGURLnormalization addressing the Copilot-flagged quoted-endpoint case) is well-handled.
Reviewer recommendation
Hold for a one-block fix. The header-propagation core mechanics are correct and the staleness regression is closed. The only blocking issue is the contentGenerator.test.ts breakage, which has a 1-block fix (either in staticCorrelationHeaders or in the test mocks). Once that lands:
- Local full core suite goes from
22 failed→20 failed(all 20 are the same environmental issues observed on every recent PR validation; they don't affect CI on clean Ubuntu/macOS/Windows runners) - Ubuntu CI should turn green (it only failed on the 2 PR-caused tests)
- macOS CI was still running at validation time; should also pass once the fix lands
The defensive try/catch fix in staticCorrelationHeaders is recommended because it aligns with the stated "telemetry must never break the LLM request path" contract that already governs wrapFetchWithCorrelation, and it future-proofs against the same trap recurring in other Config-mock-using tests.
— Maintainer local validation, run on a1a8a5ac from upstream pull/4390/head.
wenshao
left a comment
There was a problem hiding this comment.
— qwen-latest-series-invite-beta-v36 via Qwen Code /review
wenshao
left a comment
There was a problem hiding this comment.
No review findings. Downgraded from Approve to Comment: CI failing: Test (windows-latest, Node 22.x), Test (ubuntu-latest, Node 22.x), Test (macos-latest, Node 22.x). — gpt-5.5 via Qwen Code /review
…ndici Switch from exact pin `0.14.0` to `^0.14.0` for consistency with the rest of the `@opentelemetry/*` deps in this block (all carated). For 0.x semver, npm treats `^0.14.0` as `>=0.14.0 <0.15.0`, so patch updates within the 0.14.x line — which are tied to the same `@opentelemetry/[email protected]` peer — flow in via `npm update` without requiring a manual package.json edit. A bump across the 0.x minor (e.g. 0.15.x) would shift the instrumentation peer compatibility and still requires explicit attention, which the caret correctly blocks. Per review feedback on #4390 (wenshao). 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)
PR 4390 本地 tmux 验证报告PR: #4390 feat(telemetry): propagate W3C traceparent + X-Qwen-Code-Session-Id to outbound LLM requests 1. 总体结论
合并建议:📋 待修一个小问题再合(详见 §5 "建议动作")。整体方向、测试覆盖、E2E 行为都过关;缺的只是 2. 验证方法在 tmux 会话
关键方法学:所有 "test-core 失败" 都在独立 worktree 3. PR 自身行为验证3.1 PR 新增/修改的 9 个测试文件(隔离运行)3.2 E2E 端到端行为验证(驱动已构建产物)脚本 亮点:[3] 是 PR 设计文档 §4.3 的核心 — 之所以不能用 4. 失败逐项核对(基线 vs PR)测试运行 PR 时报 23 个失败,逐项在 baseline (v0.16.0) 上做同样隔离测试,分类如下: 4.1
|
| 验证 | 结果 |
|---|---|
| baseline (v0.16.0) 跑同一子集 | crawler 44/44 ✓,gitDiff 59/59 ✓ |
| PR worktree 隔离跑同一子集 | crawler 44/44 ✓,gitDiff 59/59 ✓ |
| PR worktree 全套并发跑 | crawler 5 / gitDiff 12 timeout |
根因:这些测试用了真实 git fork + 文件 IO,5s timeout 在多 worker 并发争抢时偶发超时;与 PR telemetry 改动无关。
建议:可考虑给 crawler / gitDiff 加 testTimeout: 15000,但是独立工程问题。
5. 建议动作
5.1 合并前(阻断性 1 项)
- 修
staticCorrelationHeaders防御性:按 §4.1 推荐方案给函数加 try/catch +diag.warn,保持与wrapFetchWithCorrelation的契约对称。改动约 3-5 行 + 加 1 个单测覆盖(构造一个无getTelemetryEnabled的 Config 传入,断言返回{})。
5.2 合并后(跟踪性 3 项,可另开 issue)
- anthropic
treats unset baseURLUA 测试基线失败(与 revert(core): undo x-api-key + Authorization double-emit (#4342) — regresses IdeaLab-style proxies #4385 revert 可能有关) -
skill-manager.test.tsmock 对.qwen路径过度敏感 -
crawler/gitDiff测试 timeout 容易在并发下 flake
5.3 不影响合并、值得点赞
llm-correlation-fetch.ts行 33-36 的FetchLikeLoose+ 行 64 的<TFetch extends FetchLikeLoose>泛型设计,把 openai / anthropic / @google/genai 三个 SDK 各异的 Fetch 类型差异完美隔离- 设计文档 §8.6 明确写了 Gemini staleness 的已知短板和未来子任务,技术诚实
- OTLP 反馈环 boundary-safe 匹配(
origin精确比 +pathname路径边界)超出常规防护
6. 复现指引
# 进入 tmux 会话
tmux attach -t pr4390
# 关键命令
tmux capture-pane -t pr4390:e2e -p -S -200 # E2E 输出
tmux capture-pane -t pr4390:pr-tests -p # PR 自测试输出
tmux capture-pane -t pr4390:baseline -p -S -100 # baseline 对照输出
# 重跑 E2E 脚本
node /tmp/pr4390-e2e-correlation.mjs报告由 Claude Opus 4.7 在本地 tmux 上完整验证,作为维护者 merge 决策参考。
LaZzyMan
left a comment
There was a problem hiding this comment.
Review
I need to escalate this from my earlier read. After tracing the actual data path, I no longer think this is a "missing opt-out knob" concern — I think this PR is mislabeled and the labeling is what hides the real issue.
The reframing
This PR is filed as feat(telemetry):, the code lives under packages/core/src/telemetry/, and the behavior is gated on config.getTelemetryEnabled(). Those three signals strongly imply "this is about the user's observability data going to the user's collector." It isn't.
The headers this PR adds — X-Qwen-Code-Session-Id and traceparent — never reach a telemetry backend. They are written into the LLM API request itself, which goes to DashScope / OpenAI / Anthropic / Gemini / OpenRouter / MiniMax / ModelScope / Mistral. The OTel SDK and the OTLP exporter never see these headers; the user's own collector receives the same trace data with or without this change.
So a user who reads telemetry.enabled: true reasonably believes they're authorizing data flow into their own observability infrastructure. What they actually authorize, after this PR lands, is broadcasting a stable client-side identifier to every third-party LLM provider they have configured. Those are two different data-egress decisions with two different recipients, and bundling them under one switch is the problem.
Two distinct issues fall out of this:
1. X-Qwen-Code-Session-Id exfiltrates a stable cross-request fingerprint to every LLM provider (severity: high · confidence: very high)
Today, an LLM provider sees the prompt, the model name, and the auth header — no persistent client identifier. With this PR, every outbound request additionally carries a stable session-scoped UUID that doesn't change until /clear. Any of the configured providers can now join all requests from one qwen-code session into a behavioral profile, correlate that profile across the lifetime of the session, and potentially join across providers if two providers ever cooperate. None of this is necessary for the API calls to succeed — it's a wire-level identification signal that benefits the LLM provider, not the qwen-code user.
The reference to X-Claude-Code-Session-Id in the design doesn't transfer. claude-code is Anthropic's first-party client sending an identifier to Anthropic's first-party backend — one vendor, one direction. qwen-code is an open-source client that connects to many third-party providers; broadcasting the same stable identifier to all of them is qualitatively different.
2. traceparent carries the same fingerprint via a different channel (severity: medium · confidence: very high)
The trace id is derived as sha256(sessionId).slice(0, 32), so the traceparent header is effectively a one-way hash of the same stable identifier going to the same recipients. The W3C semantics give it cover, but the cross-vendor trace-continuation use case (LLM provider also reporting into the user's collector, e.g. ARMS+DashScope enterprise integration) is niche. For most users, putting traceparent on the wire delivers no value to them and the same fingerprinting capability to providers.
What I'm asking for
@wenshao — recommending re-evaluation in light of the reframing above. The substantive question is a policy one and I don't think it should be decided inside this PR's diff. Three constraints I'd ask be considered:
- Should an open-source qwen-code default to sending any stable client identifier to third-party LLM providers? My read is no — defaults should leak nothing beyond what the API call itself requires.
- If a session-correlation header is genuinely valuable for support / debugging, the right shape is a per-request random UUID (analogous to claude-code's separate
x-client-request-id), not a session-scoped fingerprint. traceparentpropagation to upstream LLM providers should be a separate opt-in from telemetry going to the user's own collector — they serve different recipients.
I'd suggest holding this PR pending that policy discussion. Tagging scope/data-privacy and status/on-hold.
Verdict
REQUEST_CHANGES — this is mislabeled as a telemetry feature. Its actual effect is sending stable client-identifying headers to third-party LLM providers on every API call, gated under a switch the user reasonably reads as "my own observability." That's a data-egress decision that needs explicit, separate consent. Recommend not merging in the current shape until the policy is settled.
wenshao
left a comment
There was a problem hiding this comment.
Down to 1 inline suggestion from this round — all prior R1–R4 findings remain resolved. LGTM once CI turns green.
— qwen-latest-series-invite-beta-v34 via Qwen Code /review
…ation The comment described the header-seeding logic as merging "BOTH the init.headers AND the Request's own headers", but the two branches are mutually exclusive — `new Headers(init?.headers)` runs unconditionally (empty Headers when init.headers is undefined), and the Request-headers copy only runs when init.headers is undefined. So in practice it's either-or, not BOTH. Reworded to match the actual logic per #4390 review feedback (wenshao). Behavior unchanged. 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)
|
Question on the
For this kind of convenience header, the project already has a |
…cope Two issues found by wenshao reviewing the R3 boundary-safety fixes on PR #4390. 1. **`req.host` may already include `:port`** (sdk.ts ignoreOutgoingRequestHook): When `req.hostname` is absent and `req.host` is the fallback, the value may already be `"collector:4318"`. Naively appending `:${req.port}` produced `"http://collector:4318:4318"` → `new URL()` rejects → catch returns false → silent guard bypass for that request. Currently unreachable because `@opentelemetry/otlp-exporter-base` always sets `hostname` from WHATWG URL parsing, but the fallback exists in the code and must be correct — a future OTLP transport that emits `host` without `hostname` would silently trigger the feedback loop. Strip the port when falling back; bracketed IPv6 literals like `"[::1]:443"` keep their bracketed host intact. 2. **Undici scope honesty** (telemetry.md): Previous docs framed the propagation as "outbound LLM requests", but `UndiciInstrumentation` actually patches `globalThis.fetch` for the whole process — `WebFetch`, MCP clients, IDE extension calls all get spans + `traceparent` injection too. Added a "Scope: all fetch() calls, not just LLM" subsection covering: (a) trace ID leakage to third-party URLs (the user-supplied destinations of `WebFetch` see our trace ID; not secret per W3C but worth knowing); (b) non-LLM span volume inflating OTLP batches with a workaround tip. Per-destination scoping toggle deferred as a follow-up — out of scope for this PR. Added regression test for the host:port-fallback path. Test exercises the previously broken combination (hostname absent, host carries port) through the existing test harness. 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)
… default Address LaZzyMan's REQUEST_CHANGES review of PR #4390. The original design injected `X-Qwen-Code-Session-Id` on every outbound LLM request gated only by `telemetry.enabled`. Review caught that this broadcasts a stable cross-request client identifier to every configured third-party provider (OpenAI, Anthropic, OpenRouter, MiniMax, ModelScope, Mistral, vanilla Gemini, ...), which the claude-code precedent does NOT justify — claude-code is a first-party Anthropic→Anthropic flow; qwen-code is an open-source CLI connecting to many providers. Fix: add a host allowlist with a deliberately narrow default. The header is now only attached to destinations whose hostname matches: dashscope.aliyuncs.com dashscope-intl.aliyuncs.com *.dashscope.aliyuncs.com *.dashscope-intl.aliyuncs.com *.alibaba-inc.com *.aliyun-inc.com This is exactly the set where the LLM provider, the upstream telemetry backend (ARMS Tracing), and qwen-code itself are the same legal entity — mirroring the first-party claude-code pattern and preserving the real product value (server-side trace stitching against DashScope) without exposing the session id to third parties. Operators with broader correlation requirements override via: "telemetry": { "sessionIdHeaderHosts": ["*"] // restore broadcast "sessionIdHeaderHosts": [] // fully disable "sessionIdHeaderHosts": ["api.example.com", "*.foo"] // custom allowlist } Implementation: - NEW `telemetry/trusted-llm-hosts.ts`: `DEFAULT_SESSION_ID_HEADER_HOSTS` + `matchesTrustedHost(hostname, patterns)` + `extractRequestHost(input)`. Pattern syntax is intentionally tiny (bare hostname OR `*.suffix`, dot-anchored to reject `evil-alibaba-inc.com` style attacks). Unit-tested in dedicated test file including TLD/sub-domain attack vectors. - `wrapFetchWithCorrelation` (openai + anthropic providers): resolves the allowlist at wrap time (Config snapshot), inspects each request's destination URL inside `correlationFetch`, falls through to baseFetch for non-trusted destinations. Wildcard escape hatch via `["*"]`. - `staticCorrelationHeaders` (Gemini factory): now takes an optional `destinationUrl` and applies the same host gate. The Gemini SDK default endpoint `generativelanguage.googleapis.com` is NOT on the default allowlist, so vanilla Gemini calls receive no header — matching the "first-party only" scope. Operators who put the Gemini SDK on a DashScope-compatible endpoint via `baseUrl` get the header naturally. - `Config.getTelemetrySessionIdHeaderHosts()` getter + `TelemetrySettings.sessionIdHeaderHosts` interface field + JSON schema entry in `settingsSchema.ts`. Wired through `resolveTelemetrySettings`. - Defensive optional-chaining + try/catch on the Config getter call at wrap time so partial test mocks (or pre-getter Config implementations) fall back to the default allowlist rather than crashing buildClient. Tests: 12 new cases covering host match/skip on default allowlist, sub-domain handling, TLD-suffix attack rejection, `["*"]` broadcast override, `[]` full-disable, custom operator allowlist, unparseable destination (fail closed), and the three Gemini factory paths (googleapis.com default → omit; DashScope `baseUrl` → inject; custom allowlist → inject). Docs updated in `docs/developers/development/telemetry.md` Session correlation header section, including override examples and the new Gemini host-gate semantics. Closes the LaZzyMan REQUEST_CHANGES blocker. The cross-vendor fingerprint-broadcast failure mode is now opt-in rather than default, restoring the first-party-only semantics that make the claude-code precedent applicable. 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)
|
@LaZzyMan — thank you for the deep review. The reframing is correct and the strongest part of your critique is this:
That's accurate. The fix in What changedThe Requests to OpenAI / Anthropic / OpenRouter / MiniMax / ModelScope / Mistral / DeepSeek / vanilla Gemini receive no header under the default config. The "broadcast stable identifier to every configured third-party provider" failure mode is now opt-in via Operator override knobs in "telemetry": {
"sessionIdHeaderHosts": ["*"] // restore broadcast
"sessionIdHeaderHosts": [] // fully disable
"sessionIdHeaderHosts": ["api.mycompany.com",
"*.gateway.mycompany.internal"] // custom allowlist
}Mapping to your three constraints
Addressed. The default scope contains no third-party LLM providers — every host on the allowlist is an Alibaba-affiliated endpoint where the LLM provider, the upstream telemetry backend (ARMS Tracing), and the qwen-code distribution are the same legal entity. Third-party calls go out unchanged from how they did before this PR.
Acknowledged —
I'm asking for relief on this one within this PR scope. Implementation summary
Could you take another pass? Happy to iterate further if the default allowlist needs adjustment or if you want to push harder on the |
…docs Self-review pass on commit 1c8528a (host-scoped session-id header): 1. **Vertex AI destination guessing** (geminiContentGenerator/index.ts) `@google/genai` routes to `{region}-aiplatform.googleapis.com` (not `generativelanguage.googleapis.com`) when `vertexai: true` and no `baseUrl`. The previous "guess generativelanguage" default would have mis-bucketed Vertex traffic under any operator-supplied allowlist that covered the public Gemini endpoint but not the Vertex one. Today invisible (both off the default allowlist), but a latent gotcha for operators tuning `telemetry.sessionIdHeaderHosts`. Fix: pass `undefined` when `config.baseUrl` is unset (fail-closed — no header). Operators who want correlation against Google endpoints must set `baseUrl` explicitly, which is also the SDK's input for destination resolution. 2. **`["*"]` broadcast escape hatch tolerates whitespace** (llm-correlation-fetch.ts) `[" * "]` (a settings.json hand-edit with a stray space) previously silently fell back to "no host matches" — the opposite of operator intent. Now `.trim()` before comparing, so common whitespace mistakes still trigger broadcast. 3. **Doc note on wrap-time allowlist snapshot** (llm-correlation-fetch.ts JSDoc) The session id is read live per-request, but `trustedHosts` is snapshotted once at `wrapFetchWithCorrelation` call time. Spell this out in the JSDoc so a future maintainer doesn't read the live `getSessionId()` and assume the allowlist is the same shape. 4. **Defensive test coverage** (trusted-llm-hosts.test.ts, llm-correlation-fetch.test.ts) Added: extractRequestHost with explicit port / userinfo / query / fragment / IPv6 bracket form. Whitespace `[" * "]` broadcast test. IPv6 case documents the "bracketed → never matches" behavior is intentional fail-closed for the named-host allowlist scope. 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)
Lint check `Check settings schema is up-to-date` failed because the checked-in `packages/vscode-ide-companion/schemas/settings.schema.json` wasn't regenerated after adding `telemetry.sessionIdHeaderHosts` to `settingsSchema.ts` in commit 1c8528a. Regenerated via `npm run generate:settings-schema`. 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)
…t-allowlist scoping Adds a "修订历史" header table at the top and a new §11 "R3 修订 — Host-Allowlist Scoping for X-Qwen-Code-Session-Id" capturing what changed after LaZzyMan's REQUEST_CHANGES review, why, and how. Inline pointers added at §3.1, §3.2, §4.3, §4.4, §9 (claude-code comparison table) to point readers at §11 — original prose preserved as a record of the decision path rather than rewritten in place. Concretely §11 covers: - The three-step LazzyMan critique and why R1's "broadcast to all providers" was structurally wrong for an open-source multi-provider CLI - The default allowlist (`DEFAULT_SESSION_ID_HEADER_HOSTS`) and its semantic alignment with the DashScope provider detector - Pattern grammar (bare hostname / `*.suffix` dot-anchored), the TLD-suffix attack vectors it rejects, why no regex / port-aware globbing - `wrapFetchWithCorrelation` host gate, wrap-time vs request-time semantics, `[" * "]` whitespace tolerance - `staticCorrelationHeaders` `destinationUrl` parameter, Gemini factory's fail-closed treatment of unset `baseUrl` (avoids the Vertex vs `generativelanguage.googleapis.com` ambiguity) - All R3 file changes mapped to the original §5 file-change list - Mapping of LazzyMan's three concerns to R3's responses - §10 future-work additions: `traceparent` per-destination toggle, `X-Qwen-Code-Request-Id`, IPv6 allowlist syntax No code changes; documentation only. 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)
wenshao
left a comment
There was a problem hiding this comment.
No review findings. Downgraded from Approve to Comment: CI still running. — gpt-5.5 via Qwen Code /review
Three issues found by wenshao reviewing the R3 host-allowlist scoping.
1. **[Critical] `broadcastAll` outside safety try/catch**
(llm-correlation-fetch.ts wrapFetchWithCorrelation)
The try/catch only fires when `getTelemetrySessionIdHeaderHosts()`
throws. If it returns a malformed value — a bare string (settings.json
typo `"sessionIdHeaderHosts": "host"` instead of `["host"]`), an array
containing `null`/`undefined`/number entries, or whitespace-padded
entries — `.some((p) => p.trim() === '*')` throws TypeError at
buildClient time, bricking the LLM session before the first prompt.
`staticCorrelationHeaders` already handled this via its end-to-end
try/catch but the sister helper diverged. Settings loader does no
runtime schema validation so this is reachable via a single typo.
Fix: normalize the allowlist at wrap time:
1. catch a throwing getter (existing)
2. reject non-array → default allowlist (NEW — bare string typo)
3. filter out non-string elements (NEW — [null, ...] typo)
4. trim every surviving entry uniformly (NEW — see #2 below)
Then `trustedHosts.includes('*')` instead of `.some((p) => p.trim() === '*')`,
since patterns are already pre-trimmed.
2. **Trim asymmetry between `*` detection and host-pattern match**
(llm-correlation-fetch.ts)
`[" * "]` was tolerated (trimmed before `===` compare) but
`[" dashscope.aliyuncs.com "]` silently never matched. The
normalization above fixes this by trimming uniformly upstream.
3. **Proxy fetch test: only negative assertions**
(openaiContentGenerator/provider/default.test.ts)
The test asserted `callArg.fetch !== proxyFetch` and `!== globalThis.fetch`
but both passed for ANY wrapper, including a buggy one that
accidentally wraps globalThis.fetch instead of proxyFetch. Added a
positive assertion: call the wrapped fetch and verify proxyFetch was
the delegation target.
Tests: 4 new cases — whitespace-padded host pattern, bare-string
malformed config (both wrapper and static), null/number-containing
array malformed config (both wrapper and static), positive proxy fetch
delegation. All pass; pre-existing Anthropic User-Agent failure
unrelated.
🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)
wenshao
left a comment
There was a problem hiding this comment.
No high-confidence issues at commit 7a1b4f8d (round 7). Build passes, 9270/9272 tests pass (2 pre-existing unrelated), tsc/eslint clean. All 7 low-confidence suggestions are incremental quality improvements — see terminal output for details. — qwen3.7-max via Qwen Code /review
Closes #4384. Sub-issue of #3731 (P3 deeper observability).
This PR was originally split into two stacked PRs (this one for traceparent + #4393 for session id); merged into a single PR per reviewer request. #4393 is now closed.
Summary
Two HTTP-header propagations with different default scopes, both gated on
telemetry.enabled. See the "Scope" sections below —traceparentis broader (allfetch()calls),X-Qwen-Code-Session-Idis narrower (first-party allowlist):traceparentvia@opentelemetry/instrumentation-undici— qwen-code traces continue across the process boundary into upstream model services that read W3C tracecontext (DashScope/ARMS Tracing being the headline use case). Also produces a free client-side HTTP span per LLM call (separates network TTFB from upstream model processing).X-Qwen-Code-Session-Idvia per-request fetch wrapper — server-side ingestion can correlate observed LLM requests with qwen-code session metric/log records without parsing trace context. Pattern matched from claude-code (X-Claude-Code-Session-Id, verified atsrc/services/api/client.ts:108in their open-source repo).Scope:
X-Qwen-Code-Session-Idis first-party only (LaZzyMan review fix)After LaZzyMan's REQUEST_CHANGES review, the session id header is restricted to a first-party host allowlist by default:
Requests to third-party LLM providers (OpenAI, Anthropic, OpenRouter, MiniMax, ModelScope, Mistral, DeepSeek, vanilla Gemini, ...) do NOT receive this header by default. This restores the validity of the claude-code precedent (first-party CLI → first-party backend, same legal entity) while preserving the actual product value (ARMS Tracing server-side stitching against DashScope).
Operators with broader correlation requirements override via
telemetry.sessionIdHeaderHosts:Scope:
traceparentpropagates to allfetch()callsUndiciInstrumentationpatchesglobalThis.fetchprocess-wide (not LLM-only), sotraceparentreaches every outbound fetch — includingWebFetch, MCP clients, IDE-extension calls. Value carried issha256(sessionId).slice(0, 32)(one-way hash, not the session id itself), andtraceparentis W3C-standard not-secret-by-spec. Documented honestly intelemetry.mdunder "Scope: all fetch() calls, not just LLM"; per-destination scoping toggle for trace propagation specifically is tracked as a follow-up sub-issue.Critical design: fetch wrapper, not
defaultHeadersFor OpenAI / Anthropic the session-id header goes through a per-request fetch wrapper, not the SDK's
defaultHeadersoption. Why: content-generator SDK clients are constructed ONCE and NOT recreated on/clear-triggered session resets —Config.resetSession()updatesthis.sessionIdbut the contentGenerator keeps using the stale header value. Readingconfig.getSessionId()from inside the wrapper at request time gives the live value. Verified by a dedicated staleness regression test inllm-correlation-fetch.test.ts.Gemini falls back to static
httpOptions.headersbecause@google/genai'sHttpOptionsinterface does NOT expose afetchhook (onlyheaders/baseUrl/apiVersion/timeout/extraParams). The host gate is applied once at SDK construction againstconfig.baseUrl ?? 'https://generativelanguage.googleapis.com'. Vanilla Gemini → no header (googleapis.com not on default allowlist). Operators who put the Gemini SDK on a DashScope-compatiblebaseUrlget the header naturally. Documented limitation: after/clear, Gemini's wire header stays stale until the contentGenerator is recreated (spans/logs still carry the live session id). Lazy-invalidate fix tracked as a follow-up sub-issue. Full discussion in design doc §8.6.Feedback-loop avoidance
OTel SDK uses fetch to upload OTLP data. Without protection, instrumenting fetch would trace those uploads, which would themselves be uploaded ad infinitum.
ignoreRequestHookmatches configuredotlpEndpoint/ per-signal endpoints. After R1-R3 review iterations, URL normalization is done via WHATWGURLparser (handles surrounding quotes /#fragment/ trailing slash / default-port stripping / etc.). Boundary-safe matching prevents port/host/path collisions. BothHttpInstrumentation(OTLP HTTP exporter uses node:http) andUndiciInstrumentation(LLM SDK fetch) carry the guard.Dependency
Adds
@opentelemetry/instrumentation-undici@^0.14.0(peer-compatible with installed@opentelemetry/[email protected]— 0.14.x is the highest contrib version that declares^0.203.0; newer 0.15+ require 0.204+).Integration sites
packages/core/src/core/openaiContentGenerator/provider/default.ts— base; 5 sibling providers inheritpackages/core/src/core/openaiContentGenerator/provider/dashscope.ts—override buildClient; QwenContentGenerator inherits via thispackages/core/src/core/anthropicContentGenerator/anthropicContentGenerator.ts— analogouspackages/core/src/core/geminiContentGenerator/index.ts— factory function with destination-aware static-header injectionAudited but no change needed:
qwenContentGenerator.ts(extends OpenAI base — auto-covered),loggingContentGenerator.ts(wrapper, no SDK construction).New files
packages/core/src/telemetry/trusted-llm-hosts.ts—DEFAULT_SESSION_ID_HEADER_HOSTS+matchesTrustedHost()+extractRequestHost()packages/core/src/telemetry/trusted-llm-hosts.test.ts— pattern-match correctness incl. TLD-suffix attack vectorsSettings
telemetry.sessionIdHeaderHosts: string[]— destination allowlist forX-Qwen-Code-Session-Id["*"]restores broadcast[]fully disables the headerTest plan
["*"]broadcast override,[]disable, custom allowlist, unparseable URL fail-closed),staticCorrelationHeadersdestination-aware variantDesign doc
docs/design/telemetry-outbound-propagation-design.md(in-tree).Operator-facing docs
docs/developers/development/telemetry.md— new "Trace context propagation" + "Session correlation header" sections cover the default scope, override syntax, Gemini limitation, and operator notes.🤖 Generated with Qwen Code