BB-764: Add OpenTelemetry tracing to backbeat replication pipeline#2733
BB-764: Add OpenTelemetry tracing to backbeat replication pipeline#2733delthas wants to merge 3 commits intodevelopment/9.3from
Conversation
Hello delthas,My role is to assist you with the merge of this Available options
Available commands
Status report is not available. |
Request integration branchesWaiting for integration branch creation to be requested by the user. To request integration branches, please comment on this pull request with the following command: Alternatively, the |
Codecov Report❌ Patch coverage is ❌ Your patch check has failed because the patch coverage (68.53%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files
... and 3 files with indirect coverage changes
@@ Coverage Diff @@
## development/9.3 #2733 +/- ##
===================================================
+ Coverage 74.50% 74.62% +0.11%
===================================================
Files 200 203 +3
Lines 13610 13783 +173
===================================================
+ Hits 10140 10285 +145
- Misses 3460 3488 +28
Partials 10 10
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
| const tracer = trace.getTracer('backbeat'); | ||
| const span = tracer.startSpan(operationName, { | ||
| kind: 1, // SpanKind.CONSUMER | ||
| }, parentCtx); |
There was a problem hiding this comment.
kind: 1 is SpanKind.SERVER, not SpanKind.CONSUMER. The OTEL enum values are: INTERNAL=0, SERVER=1, CLIENT=2, PRODUCER=3, CONSUMER=4. This will cause consumer spans to show up as SERVER spans in Jaeger. Import and use the constant to avoid magic numbers. — Claude Code
| 'service.namespace': process.env.OTEL_SERVICE_NAMESPACE || 'scality', | ||
| }), | ||
| traceExporter, | ||
| sampler: new TraceIdRatioBasedSampler(samplingRatio), |
There was a problem hiding this comment.
TraceIdRatioBasedSampler ignores the parent's sampling decision. If cloudserver samples a trace (sampled flag=1 in traceparent) but backbeat's ratio-based sampler decides to drop it, the trace will have a gap — the S3 side shows a span but backbeat's consumer span is missing. Wrap with ParentBasedSampler so that incoming sampled traces are always recorded: sampler: new ParentBasedSampler({ root: new TraceIdRatioBasedSampler(samplingRatio) }). — Claude Code
| otelContext.with(ctx, () => { | ||
| this._queueProcessor(entry, (err, completionArgs) => { | ||
| if (err) span.recordException(err); | ||
| span.end(); |
There was a problem hiding this comment.
recordException only adds an event to the span — it does not set the span status. Without calling span.setStatus({ code: SpanStatusCode.ERROR }), failed spans will still appear as OK in Jaeger. Add span.setStatus({ code: 2 }) (SpanStatusCode.ERROR) alongside recordException. — Claude Code
Review by Claude Code |
9d08f7b to
2f7afb0
Compare
| "@opentelemetry/instrumentation-http": "^0.55.0", | ||
| "@opentelemetry/resources": "^1.30.1", | ||
| "@opentelemetry/sdk-node": "^0.55.0", | ||
| "@opentelemetry/sdk-trace-base": "^2.0.1", |
There was a problem hiding this comment.
@opentelemetry/sdk-trace-base is pinned at ^2.0.1 (resolves to 2.6.1), but @opentelemetry/sdk-node@0.55.0 depends on sdk-trace-base@1.28.0. The TraceIdRatioBasedSampler imported in lib/otel.js comes from v2, while the NodeSDK internals use v1. This cross-major-version mismatch can cause silent sampling failures at runtime. Pin to ^1.28.0 to match the version used by sdk-node.
— Claude Code
| otelContext.with(ctx, () => { | ||
| this._queueProcessor(entry, (err, completionArgs) => { | ||
| if (err) span.recordException(err); | ||
| span.end(); |
There was a problem hiding this comment.
recordException adds an event to the span but does not mark the span as failed. Without setStatus, error spans will appear as successful in Jaeger/Grafana Tempo. Add span.setStatus({ code: 2, message: err.message }) alongside recordException.
| span.end(); | |
| if (err) { | |
| span.recordException(err); | |
| span.setStatus({ code: 2, message: err.message }); | |
| } |
— Claude Code
| const parentCtx = contextFromKafkaHeaders(kafkaEntry.headers); | ||
| const tracer = trace.getTracer('backbeat'); | ||
| const span = tracer.startSpan(operationName, { | ||
| kind: 1, // SpanKind.CONSUMER |
There was a problem hiding this comment.
Magic number 1 for SpanKind.CONSUMER. Since @opentelemetry/api is already imported in this file, use the named constant for readability.
| kind: 1, // SpanKind.CONSUMER | |
| const span = tracer.startSpan(operationName, { | |
| kind: 4, // SpanKind.CONSUMER (from @opentelemetry/api) |
Actually, SpanKind should be imported from @opentelemetry/api: const { SpanKind } = require('@opentelemetry/api') then use kind: SpanKind.CONSUMER.
— Claude Code
Review by Claude Code |
2f7afb0 to
d562a0a
Compare
|
|
||
| otelContext.with(ctx, () => { | ||
| this._queueProcessor(entry, (err, completionArgs) => { | ||
| if (err) span.recordException(err); |
There was a problem hiding this comment.
recordException adds an exception event but does not set the span status to ERROR. Without span.setStatus({ code: SpanStatusCode.ERROR }), error traces won't appear as failures in Jaeger/tracing UIs.
```suggestion
if (err) {
const { SpanStatusCode } = require('@opentelemetry/api');
span.recordException(err);
span.setStatus({ code: SpanStatusCode.ERROR, message: err.message });
}
| "@opentelemetry/instrumentation-http": "^0.55.0", | ||
| "@opentelemetry/resources": "^1.30.1", | ||
| "@opentelemetry/sdk-node": "^0.55.0", | ||
| "@opentelemetry/sdk-trace-base": "^2.0.1", |
There was a problem hiding this comment.
@opentelemetry/sdk-trace-base is pinned at ^2.0.1 (resolves to 2.6.1), but @opentelemetry/sdk-node@0.55.0 depends on sdk-trace-base@1.28.0. This installs two major versions side-by-side. The TraceIdRatioBasedSampler imported from v2 is passed into NodeSDK which internally uses v1 — if the Sampler interface diverges between majors, this will break silently. Pin to ^1.28.0 to match what sdk-node expects, or upgrade sdk-node to a version that depends on sdk-trace-base v2.
— Claude Code
| const parentCtx = contextFromKafkaHeaders(kafkaEntry.headers); | ||
| const tracer = trace.getTracer('backbeat'); | ||
| const span = tracer.startSpan(operationName, { | ||
| kind: 1, // SpanKind.CONSUMER |
There was a problem hiding this comment.
Use SpanKind.CONSUMER from @opentelemetry/api instead of the magic number 1. The numeric value is an internal detail that could change across versions.suggestion const span = tracer.startSpan(operationName, { kind: trace.SpanKind ? trace.SpanKind.CONSUMER : 1, // SpanKind.CONSUMER }, parentCtx);
Actually, the cleaner fix is to import SpanKind at the top of the file:const { context, propagation, trace, SpanKind } = require('@opentelemetry/api');
then use kind: SpanKind.CONSUMER.
— Claude Code
Review by Claude Code |
| "dependencies": { | ||
| "@opentelemetry/api": "^1.9.0", | ||
| "@opentelemetry/auto-instrumentations-node": "^0.50.2", | ||
| "@opentelemetry/context-async-hooks": "^1.28.0", |
There was a problem hiding this comment.
@opentelemetry/auto-instrumentations-node@0.50.2 is ~5 minor versions behind @opentelemetry/sdk-node@0.55.0. This installs duplicate versions of core OTEL packages (e.g. @opentelemetry/instrumentation 0.53.0 and 0.55.0, @opentelemetry/api-logs 0.53.0 and 0.55.0). Two copies of the instrumentation core can cause hooks registered by one version not to be visible to the other.
Bump auto-instrumentations-node to ^0.55.0 (or whichever minor matches sdk-node) so all OTEL packages resolve to a single version.
— Claude Code
| "@opentelemetry/api": "^1.9.0", | ||
| "@opentelemetry/auto-instrumentations-node": "^0.50.2", | ||
| "@opentelemetry/context-async-hooks": "^1.28.0", | ||
| "@opentelemetry/exporter-trace-otlp-http": "^0.55.0", |
There was a problem hiding this comment.
auto-instrumentations-node bundles instrumentations for ~35 libraries (cassandra, pg, mysql, graphql, grpc, etc.). Only HTTP, AWS SDK, and ioredis are enabled here. Using the three individual packages directly would cut ~1200 lines from yarn.lock and avoid pulling unused transitive deps into the image.
— Claude Code
| const { ctx, span } = startSpanFromKafkaEntry(entry, `${topic}.process`); | ||
| span.setAttribute('messaging.kafka.topic', topic); | ||
| span.setAttribute('messaging.kafka.partition', partition); | ||
|
|
There was a problem hiding this comment.
Consider adding messaging.kafka.consumer_group — it helps distinguish spans from different consumer groups in Jaeger (e.g. replication vs lifecycle processors running in the same cluster).
```suggestion
span.setAttribute('messaging.kafka.topic', topic);
span.setAttribute('messaging.kafka.partition', partition);
span.setAttribute('messaging.kafka.consumer_group', this._groupId);
Review by Claude Code |
d562a0a to
51a9f61
Compare
| "homepage": "https://github.com/scality/backbeat#readme", | ||
| "dependencies": { | ||
| "@opentelemetry/api": "^1.9.0", | ||
| "@opentelemetry/auto-instrumentations-node": "^0.50.2", |
There was a problem hiding this comment.
OTEL package version skew: auto-instrumentations-node@^0.50.2 is from the 0.53 OTEL release line, while sdk-node, instrumentation-http, and exporter-trace-otlp-http are all ^0.55. The yarn.lock confirms duplicates — e.g. @opentelemetry/instrumentation installs as both 0.53.0 and 0.55.0, and instrumentation-http as both 0.53.0 and 0.55.0. Because the SDK and the auto-instrumentation plugins end up referencing different copies of the base @opentelemetry/instrumentation module, auto-instrumentation can silently fail at runtime.
Bump auto-instrumentations-node to a version compatible with the 0.55 ecosystem, and consider dropping instrumentation-http, instrumentation-aws-sdk, and context-async-hooks as direct deps — they are pulled transitively from auto-instrumentations-node and sdk-node.
— Claude Code
|
Add OTEL SDK setup and trace context propagation through backbeat's Kafka-based replication pipeline, linking async replication work back to the original S3 request trace in Jaeger. - Add lib/otel.js mirroring cloudserver's SDK setup pattern (gated by ENABLE_OTEL env var, configurable sampling/exporter) - Add lib/tracing/kafkaTraceContext.js with helpers to extract traceparent from oplog entries and Kafka headers - Extend BackbeatProducer.produce() to pass message headers (7th arg) - Extend QueuePopulatorExtension.publish() to accept headers - Extract traceContext from oplog entries in ReplicationQueuePopulator and forward as Kafka message headers - Wrap BackbeatConsumer._processTask() in OTEL consumer spans linked to the original trace via Kafka headers - Bootstrap OTEL SDK in queuePopulator and replication processor entry points - Add unit tests for kafkaTraceContext helpers Issue: BB-764
51a9f61 to
b9d3528
Compare
| span.end(); | ||
| done(err, completionArgs, finishProcessingTask); | ||
| }); | ||
| }); |
There was a problem hiding this comment.
If _queueProcessor throws synchronously, span.end() is never called and the span leaks. Wrap the call in try/catch to ensure cleanup:suggestion<br> otelContext.with(ctx, () => {<br> try {<br> this._queueProcessor(entry, (err, completionArgs) => {<br> if (err) {<br> span.recordException(err);<br> span.setStatus({ code: SpanStatusCode.ERROR });<br> }<br> span.end();<br> done(err, completionArgs, finishProcessingTask);<br> });<br> } catch (err) {<br> span.recordException(err);<br> span.setStatus({ code: SpanStatusCode.ERROR });<br> span.end();<br> throw err;<br> }<br>
— Claude Code
|
| **Status**: done on branch `improvement/BB-764/otel-replication-tracing` | ||
| (PR [scality/backbeat#2733](https://github.com/scality/backbeat/pull/2733)). | ||
|
|
||
| ### Trust boundary (Part B) |
There was a problem hiding this comment.
This references a local Claude Code plan file path (~/.claude/plans/whimsical-pondering-swan.md) that won't exist on other developers' machines. Consider removing this line or replacing it with a link to the Jira ticket or another shared resource.
— Claude Code
|
Well-designed PR. The OTEL integration follows standard patterns (no-op API when disabled, ParentBasedSampler, trust boundary header stripping), and the link-vs-parent distinction for fan-out consumers is a smart choice to keep Jaeger traces manageable. |
Part B — trust boundary:
- Add lib/tracing/healthPaths.js (isHealthPath helper)
- lib/otel.js wires HttpInstrumentation:
- ignoreIncomingRequestHook drops health/OPTIONS spans
- requestHook strips traceparent/tracestate on outbound calls to
untrusted hosts, keeps the client span (prevents leaks to
AWS/Azure/GCP/remote-Artesca replication destinations)
- buildTrustedHosts(config) derives allowlist from replication source,
Kafka brokers, MongoDB, server.host, Redis; refuse-by-default
Part D — extension to remaining pods:
- Bootstrap OTEL in 6 entry points: notification queue processor,
lifecycle conductor/bucket/object processors, gc, replication-status
- Lifecycle conductor wraps processBuckets in a root
lifecycle.conductor.scan span (kind=INTERNAL); bucket-task messages
now carry traceparent so bucket-processor consumers become children
- LifecycleTask produces object-tasks and transition-tasks with
link-headers (not traceparent) — fan-out break prevents 1M-span
traces from breaking the Jaeger UI
- BackbeatConsumer auto-detects link-headers and uses
startLinkedSpanFromKafkaEntry (new OTEL Link instead of parent-child)
- ReplicationAPI.sendDataMoverAction attaches link-headers when used
by lifecycle transitions (only caller)
- KafkaNotificationDestination.send strips any trace headers before
producing to the external customer Kafka destination
- New helpers in lib/tracing/kafkaTraceContext.js:
linkContextFromKafkaHeaders, startLinkedSpanFromKafkaEntry,
linkHeadersFromCurrentContext, traceHeadersFromCurrentContext
Tests: 35 unit tests covering healthPaths, buildTrustedHosts
(including Config-hosts-subset + bootstrapList-exclusion), and the new
kafkaTraceContext helpers.
OTEL.md updated to reflect completion.
Issue: BB-764
970a811 to
849d6b0
Compare
| links, | ||
| }, context.active()); | ||
|
|
||
| return { ctx: trace.setSpan(context.active(), span), span }; |
There was a problem hiding this comment.
startLinkedSpanFromKafkaEntry is documented as creating a NEW root span, but it passes context.active() as the parent context to tracer.startSpan(). If an ambient span exists in the active context (e.g. if this function is ever called inside an otelContext.with() block), the span becomes a child rather than a root — violating the function's contract.
Use ROOT_CONTEXT instead to guarantee a true root span. You'll need to add ROOT_CONTEXT to the destructured import from @opentelemetry/api at the top of the file.
— Claude Code
| requestHook: (span, request) => { | ||
| const rawHost = (request.getHeader?.('host') || '').toString(); | ||
| const host = rawHost.toLowerCase().split(':')[0]; | ||
| if (host && !TRUSTED_HOSTS.has(host)) { |
There was a problem hiding this comment.
When request.getHeader('host') returns undefined (e.g. HTTP/2 uses :authority instead of Host), host becomes an empty string, and the if (host && ...) guard is falsy — meaning traceparent/tracestate headers are NOT stripped. Consider defaulting to strip when the host is unknown:if (!host || !TRUSTED_HOSTS.has(host))
— Claude Code
Review by Claude Code |
Two issues surfaced after enabling OTEL on the cluster: 1. NotificationQueuePopulator dropped the oplog trace context. Replication already extracts value.traceContext via traceHeadersFromEntry() and passes it to publish(..., headers); the notification producer was missed. Result: notification-processor-azure consumer spans appeared as orphan roots in Jaeger instead of children of the causing S3 PUT trace. Fix mirrors ReplicationQueuePopulator. 2. /metrics on processor pods wasn't in healthPaths.js. Prometheus scrapes each pod at :8901/metrics (no /_/ prefix) every 15s; every scrape created a server span — ~60 noise spans / 15min × 14 pods. Added /metrics to the explicit Set. Tests: 3 new cases on NotificationQueuePopulator covering traceparent propagation (with + without tracestate, and no-op when the oplog entry has no traceContext); healthPaths.spec.js extended for /metrics (exact, with query string) while keeping the existing /_/metrics and /_/metrics/* coverage. 71 unit tests pass. Issue: BB-764
| const span = tracer.startSpan(operationName, { | ||
| kind: SpanKind.CONSUMER, | ||
| links, | ||
| }, context.active()); |
There was a problem hiding this comment.
The comment on line 97 says "NEW root span — do not pass an active parent" but context.active() is passed as the parent context. If this function is ever called while another span is active (e.g. from within an otelContext.with() block in a future refactor), the span becomes a child instead of a root span, silently breaking the fan-out trace-break semantics.
Use ROOT_CONTEXT to guarantee root-span behavior:suggestion\n }, ROOT_CONTEXT);\n
And add ROOT_CONTEXT to the import on line 3:const { context, propagation, trace, SpanKind, ROOT_CONTEXT } = require('@opentelemetry/api');
— Claude Code
| span.end(); | ||
| done(err, completionArgs, finishProcessingTask); | ||
| }); | ||
| }); |
There was a problem hiding this comment.
If _queueProcessor throws synchronously (e.g. TypeError from a null entry field), the span created at line 537 is never ended — it leaks. Wrap the _queueProcessor call in try/catch to ensure span.end() is always called. Same pattern applies to LifecycleConductor.processBuckets (line 393).
— Claude Code
Review by Claude Code |
Not human-reviewed yet. Not asking for reviews at the moment!
Summary
Add OpenTelemetry tracing to backbeat so that async replication work can be
traced back to the original S3 request in Jaeger. This connects the last
missing link: arsenal already stamps
traceContext.traceparenton MongoDBoplog entries, and this PR extracts it and propagates it through backbeat's
Kafka pipeline.
Scope: oplog-populator + replication-data-processor only. Other extensions
(lifecycle, GC, notification) follow the same pattern and can be added
incrementally.
What it does
OTEL SDK bootstrap (
lib/otel.js): mirrors cloudserver's setup — gatedby
ENABLE_OTEL=trueenv var, configurable viaOTEL_SERVICE_NAME,OTEL_EXPORTER_OTLP_TRACES_ENDPOINT, andOTEL_SAMPLING_RATIO. DisablesMongoDB/Express instrumentation (not used by backbeat), enables HTTP and
AWS SDK auto-instrumentation for outbound S3 calls.
Kafka trace context helpers (
lib/tracing/kafkaTraceContext.js):traceHeadersFromEntry()— extractstraceparent/tracestatefromparsed ObjectMD oplog values
contextFromKafkaHeaders()— reconstructs OTEL context from node-rdkafkaconsumer headers
startSpanFromKafkaEntry()— creates a consumer span linked to theoriginal S3 request trace
Producer-side propagation:
BackbeatProducer.produce()now passesitem.headersas the 7th arg to node-rdkafka.QueuePopulatorExtension.publish()accepts an optional
headersparam.ReplicationQueuePopulatorextractstrace context from oplog entries and passes it through.
Consumer-side spans:
BackbeatConsumer._processTask()creates an OTELconsumer span from Kafka headers and wraps all downstream processing in
context.with(). Auto-instrumented HTTP calls automatically injecttraceparenton outbound requests to source/destination cloudservers.Design decisions
produce()@opentelemetry/instrumentation-http+instrumentation-aws-sdk— zero manual worklib/otel.js,ENABLE_OTELenv var, OTLP/HTTP exporter — consistent across servicesOTEL_SERVICE_NAMEenv var"backbeat", override per pod (e.g.backbeat-replication-data-processor)ENABLE_OTELis unset, all trace calls are no-ops with negligible costEnd-to-end trace
Verification
ENABLE_OTEL=trueOTEL_SERVICE_NAME=backbeat-oplog-populator/backbeat-replication-data-processorOTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://jaeger:4318/v1/tracesIssue: BB-764