Skip to content

feat(ARSN-572): Propagate W3C trace context to MongoDB metadata writes#2611

Open
delthas wants to merge 1 commit into
development/8.3from
improvement/ARSN-572/trace-context
Open

feat(ARSN-572): Propagate W3C trace context to MongoDB metadata writes#2611
delthas wants to merge 1 commit into
development/8.3from
improvement/ARSN-572/trace-context

Conversation

@delthas
Copy link
Copy Markdown
Contributor

@delthas delthas commented Apr 13, 2026

Context

Cloudserver recently gained OpenTelemetry tracing (CLDSRV-884, cloudserver PR scality/cloudserver#6140). That gives us a single end-to-end trace per S3 request spanning cloudserver → vault → hdproxy → hyperiod.

But many state mutations in the S3 stack are written to MongoDB and consumed asynchronously by worker services via the oplog:

  • Backbeat — replication, lifecycle expiration/transition, garbage collection
  • Sorbet — cold-storage restore completion
  • Other oplog consumers

Today the oplog entry carries no trace context, so async work performed hours or days after the S3 request cannot be tied back to the request that caused it. A user's POST …?restore produces a trace that ends at the metadata write; the actual cold-tier retrieval by Sorbet is invisible from the original trace.

Solution

Stamp a W3C trace context (traceparent/tracestate) on every mutating metadata write, auto-derived from the currently-active OTEL context.

The field lives on ObjectMDData next to originOp:

traceContext?: {
    traceparent?: string;
    tracestate?: string;
};

A new helper stampActiveTraceContext(data) in lib/storage/metadata/captureTraceContext.ts:

  • reads the active OTEL context via @opentelemetry/api
  • serializes it via propagation.inject()
  • writes the result to data.traceContext, or clears the field if no span is active (so a stale value loaded from storage cannot leak forward into a new oplog entry)

It is called from inside MongoUtils.serialize(), so every metadata write that goes through serialize automatically gets the stamp (3 sites in MongoClientInterface: putObject, putObjectVerCase4, repair). For the 2 paths that bypass serialize because they rewrite a doc previously read from the DB with already-escaped tags (internalPutObject, internalDeleteObject), stampActiveTraceContext() is called explicitly on the ObjectMD.getValue() result.

Consumers (backbeat, sorbet — out of scope for this PR) will later:

  1. Read value.traceContext.traceparent from the oplog entry
  2. propagation.extract() it to reconstruct a parent context
  3. Wrap per-entry processing in context.with(extractedCtx, ...)

This closes the loop on end-to-end tracing across the async boundary.

Design decisions

1. Auto-inject at the arsenal write boundary (vs. param plumbing).
The alternative was to thread a new traceContext parameter through the ~15 cloudserver call sites that already set originOp. Rejected because:

  • @opentelemetry/context-async-hooks (loaded by the OTEL SDK at cloudserver startup) already propagates the request context through arsenal's async call chain for free. Re-reading it at the mongo write is idiomatic OTEL.
  • Zero cloudserver surface change — we can iterate on arsenal independently.
  • Parallels existing behavior where arsenal reads originOp out of params and attaches it to the ObjectMD.
  • Avoids replicating the maintenance cost that makes originOp easy to forget on new API endpoints today.

2. Only @opentelemetry/api as a runtime dep (~10 KB, no SDK).
The api package is specifically designed to be safe for libraries to depend on. When no SDK is registered (today's state for anyone not opting into cloudserver OTEL), context.active() returns a no-op context, trace.getSpan() returns undefined, and stampActiveTraceContext() clears the field and returns. Zero runtime cost when OTEL is off.

3. Optional field, no schema migration.
traceContext is absent by default — consistent with other optional ObjectMD fields like legalHold, retentionDate. No default-value change in the ObjectMDData construction. Existing consumers that don't understand the field simply ignore it; fully backward compatible.

4. Always reflects the current write — never carries a stale context forward.
If no span is active when stampActiveTraceContext() runs, it clears traceContext (rather than leaving whatever was loaded from storage). Same when propagation.inject() produces no traceparent. This prevents both edge-case OTEL output and the more practical hazard where a doc read for a delete/repair carries a traceContext from its original write that does not belong on the new oplog entry.

5. Single helper, single contract.
The trace-stamping API on arsenal is one function: stampActiveTraceContext(data). There is intentionally no public setter on ObjectMD — the contract is "trace context is stamped by the metadata backend at write time," and exposing a model-level setter would invite callers in cloudserver/backbeat to bypass that contract (issue raised in review). Internally this also keeps the stamp/clear semantics in exactly one place.

Interaction with async flows (cold restore as example)

Cold storage restore exercises both a sync and an async metadata write:

  1. ObjectRestore:Post (sync, user-initiated via POST /bucket/key?restore) — cloudserver sets originOp; arsenal now also stamps traceContext pointing to the user's request trace. ✅
  2. ObjectRestore:Completed (async, Sorbet-triggered via cloudserver's /_/backbeat/* route) — behavior depends on Sorbet:
    • If Sorbet forwards traceparent on its HTTP call (ideally extracted from step 1's stored traceContext): arsenal picks it up via context.with(remoteContext, ...) in cloudserver's server.js, and the Completed write ties back to the original user restore. ✅
    • If Sorbet calls without traceparent (today's state): safely no-ops — Completed write has no traceContext, no regression. ✅

Same pattern applies to lifecycle expiration/transition, CRR replication, and GC.

Verification

Unit tests

  • tests/unit/storage/metadata/captureTraceContext.spec.js — covers stampActiveTraceContext: no-active-span clears (and is a no-op when there was nothing to clear), traceparent-only write, traceparent + tracestate write, defensive handling of a misbehaving propagator that injects no traceparent, overwrite of a stale value.
  • tests/unit/storage/metadata/mongoclient/{putObject,delObject,repair}.spec.js — sinon-stubbed tests verify the field lands on the document handed to findOneAndReplace / bulkWrite for each of the 4 write paths the cloudserver hot path exercises, and that it is cleared rather than carried forward when no span is active.

All 449 tests in tests/unit/models/ and tests/unit/storage/metadata/ pass.

End-to-end (follow-up on a test cluster)

  1. Bump cloudserver's arsenal pin to a build of this branch; roll the cloudserver deployment on an Artesca cluster with ENABLE_OTEL=true
  2. Make a PutObject; note the trace ID in Jaeger
  3. Inspect the MongoDB oplog: o.value.traceContext.traceparent should be present, with characters 3-35 matching the Jaeger trace ID
  4. Negative test with ENABLE_OTEL=false → field should be absent

Issue: ARSN-572

@bert-e
Copy link
Copy Markdown
Contributor

bert-e commented Apr 13, 2026

Hello delthas,

My role is to assist you with the merge of this
pull request. Please type @bert-e help to get information
on this process, or consult the user documentation.

Available options
name description privileged authored
/after_pull_request Wait for the given pull request id to be merged before continuing with the current one.
/bypass_author_approval Bypass the pull request author's approval
/bypass_build_status Bypass the build and test status
/bypass_commit_size Bypass the check on the size of the changeset TBA
/bypass_incompatible_branch Bypass the check on the source branch prefix
/bypass_jira_check Bypass the Jira issue check
/bypass_peer_approval Bypass the pull request peers' approval
/bypass_leader_approval Bypass the pull request leaders' approval
/approve Instruct Bert-E that the author has approved the pull request. ✍️
/create_pull_requests Allow the creation of integration pull requests.
/create_integration_branches Allow the creation of integration branches.
/no_octopus Prevent Wall-E from doing any octopus merge and use multiple consecutive merge instead
/unanimity Change review acceptance criteria from one reviewer at least to all reviewers
/wait Instruct Bert-E not to run until further notice.
Available commands
name description privileged
/help Print Bert-E's manual in the pull request.
/status Print Bert-E's current status in the pull request TBA
/clear Remove all comments from Bert-E from the history TBA
/retry Re-start a fresh build TBA
/build Re-start a fresh build TBA
/force_reset Delete integration branches & pull requests, and restart merge process from the beginning.
/reset Try to remove integration branches unless there are commits on them which do not appear on the source branch.

Status report is not available.

@bert-e
Copy link
Copy Markdown
Contributor

bert-e commented Apr 13, 2026

Request integration branches

Waiting for integration branch creation to be requested by the user.

To request integration branches, please comment on this pull request with the following command:

/create_integration_branches

Alternatively, the /approve and /create_pull_requests commands will automatically
create the integration branches.

@delthas delthas force-pushed the improvement/ARSN-572/trace-context branch from d42f30b to 01693cd Compare April 13, 2026 15:38
Comment thread lib/models/ObjectMD.ts Outdated
Comment thread lib/storage/metadata/mongoclient/MongoClientInterface.ts Outdated
@scality scality deleted a comment from bert-e Apr 13, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 13, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.54%. Comparing base (1acb14f) to head (d9198ea).

Additional details and impacted files
@@                 Coverage Diff                 @@
##           development/8.3    #2611      +/-   ##
===================================================
+ Coverage            73.48%   73.54%   +0.05%     
===================================================
  Files                  222      223       +1     
  Lines                18183    18205      +22     
  Branches              3762     3789      +27     
===================================================
+ Hits                 13362    13389      +27     
+ Misses                4816     4811       -5     
  Partials                 5        5              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@delthas delthas force-pushed the improvement/ARSN-572/trace-context branch from 01693cd to ca456c9 Compare April 13, 2026 15:56
Comment thread lib/storage/metadata/captureTraceContext.ts
@delthas delthas force-pushed the improvement/ARSN-572/trace-context branch from ca456c9 to a30bce8 Compare April 14, 2026 07:19
Comment thread lib/storage/metadata/captureTraceContext.ts
@delthas delthas force-pushed the improvement/ARSN-572/trace-context branch from a30bce8 to 39233c8 Compare April 14, 2026 07:23
Comment thread lib/storage/metadata/captureTraceContext.ts
@delthas delthas force-pushed the improvement/ARSN-572/trace-context branch from 39233c8 to a1fe7ca Compare May 5, 2026 15:58
@delthas delthas marked this pull request as ready for review May 5, 2026 16:02
@scality scality deleted a comment from bert-e May 5, 2026
@delthas delthas force-pushed the improvement/ARSN-572/trace-context branch from a1fe7ca to eb5a535 Compare May 5, 2026 16:04
@delthas delthas requested review from a team, DarkIsDude and benzekrimaha May 5, 2026 16:08
Comment thread lib/models/ObjectMD.ts Outdated
Comment thread lib/models/ObjectMD.ts Outdated
Comment thread lib/storage/metadata/mongoclient/MongoClientInterface.ts Outdated
Comment thread lib/storage/metadata/mongoclient/MongoClientInterface.ts Outdated
Comment thread lib/storage/metadata/mongoclient/MongoClientInterface.ts
@delthas delthas force-pushed the improvement/ARSN-572/trace-context branch from eb5a535 to 7c5ecce Compare May 6, 2026 14:25
Comment thread lib/storage/metadata/mongoclient/MongoClientInterface.ts Outdated
@delthas delthas force-pushed the improvement/ARSN-572/trace-context branch from 7c5ecce to 28bca79 Compare May 6, 2026 14:43
@delthas delthas requested a review from DarkIsDude May 6, 2026 14:49
@delthas delthas force-pushed the improvement/ARSN-572/trace-context branch from 28bca79 to 8619151 Compare May 7, 2026 09:06
@delthas delthas requested review from SylvainSenechal and removed request for benzekrimaha May 12, 2026 08:36
@scality scality deleted a comment from claude Bot May 13, 2026
@scality scality deleted a comment from claude Bot May 13, 2026
@scality scality deleted a comment from claude Bot May 13, 2026
@scality scality deleted a comment from claude Bot May 13, 2026
@scality scality deleted a comment from claude Bot May 13, 2026
@scality scality deleted a comment from claude Bot May 13, 2026
@scality scality deleted a comment from claude Bot May 13, 2026
@scality scality deleted a comment from claude Bot May 13, 2026
Comment thread lib/models/ObjectMD.ts Outdated
Comment thread lib/storage/metadata/mongoclient/MongoClientInterface.ts Outdated
Comment thread lib/storage/metadata/captureTraceContext.ts Outdated
Comment thread lib/storage/metadata/mongoclient/MongoClientInterface.ts
@scality scality deleted a comment from bert-e May 18, 2026
@delthas delthas force-pushed the improvement/ARSN-572/trace-context branch 2 times, most recently from 04348d1 to 972bdac Compare May 19, 2026 18:13
@scality scality deleted a comment from claude Bot May 19, 2026
Comment thread lib/models/ObjectMD.ts
@scality scality deleted a comment from claude Bot May 19, 2026
Add a traceContext field to ObjectMD carrying the currently-active
OTEL trace context (W3C traceparent/tracestate). Inject it
automatically at the three metadata write chokepoints where originOp
is set today: internalPutObject, repair, and internalDeleteObject.

The value ends up in the MongoDB oplog; downstream consumers
(backbeat, sorbet) can extract it to continue the trace across the
async boundary, closing the loop on end-to-end tracing for flows
that cross the oplog.

When OTEL is not active on the caller (no SDK, or request outside
a traced context), captureCurrentTraceContext returns undefined,
setTraceContext no-ops, and the field stays absent — zero cost.

Only adds @opentelemetry/api as a runtime dependency (the API-only
surface package, becomes a no-op when no SDK is registered).

Issue: ARSN-572
@delthas delthas force-pushed the improvement/ARSN-572/trace-context branch from 972bdac to d9198ea Compare May 20, 2026 09:05
@claude
Copy link
Copy Markdown

claude Bot commented May 20, 2026

LGTM

All five metadata write paths are correctly covered: three via serialize (putObject, putObjectVerCase4, repair) and two via explicit stampActiveTraceContext calls (internalPutObject, internalDeleteObject). Stale context clearing is properly handled, the optional field preserves backward compatibility without a schema migration, and test coverage is thorough across all paths.

Review by Claude Code

@scality scality deleted a comment from claude Bot May 20, 2026
@delthas delthas requested a review from francoisferrand May 20, 2026 09:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants