Skip to content

BB-764: Add OpenTelemetry tracing to backbeat replication pipeline#2733

Draft
delthas wants to merge 3 commits intodevelopment/9.3from
improvement/BB-764/otel-replication-tracing
Draft

BB-764: Add OpenTelemetry tracing to backbeat replication pipeline#2733
delthas wants to merge 3 commits intodevelopment/9.3from
improvement/BB-764/otel-replication-tracing

Conversation

@delthas
Copy link
Copy Markdown
Contributor

@delthas delthas commented Apr 14, 2026

Not human-reviewed yet. Not asking for reviews at the moment!

Summary

Add OpenTelemetry tracing to backbeat so that async replication work can be
traced back to the original S3 request in Jaeger. This connects the last
missing link: arsenal already stamps traceContext.traceparent on MongoDB
oplog entries, and this PR extracts it and propagates it through backbeat's
Kafka pipeline.

Scope: oplog-populator + replication-data-processor only. Other extensions
(lifecycle, GC, notification) follow the same pattern and can be added
incrementally.

What it does

  1. OTEL SDK bootstrap (lib/otel.js): mirrors cloudserver's setup — gated
    by ENABLE_OTEL=true env var, configurable via OTEL_SERVICE_NAME,
    OTEL_EXPORTER_OTLP_TRACES_ENDPOINT, and OTEL_SAMPLING_RATIO. Disables
    MongoDB/Express instrumentation (not used by backbeat), enables HTTP and
    AWS SDK auto-instrumentation for outbound S3 calls.

  2. Kafka trace context helpers (lib/tracing/kafkaTraceContext.js):

    • traceHeadersFromEntry() — extracts traceparent/tracestate from
      parsed ObjectMD oplog values
    • contextFromKafkaHeaders() — reconstructs OTEL context from node-rdkafka
      consumer headers
    • startSpanFromKafkaEntry() — creates a consumer span linked to the
      original S3 request trace
  3. Producer-side propagation: BackbeatProducer.produce() now passes
    item.headers as the 7th arg to node-rdkafka. QueuePopulatorExtension.publish()
    accepts an optional headers param. ReplicationQueuePopulator extracts
    trace context from oplog entries and passes it through.

  4. Consumer-side spans: BackbeatConsumer._processTask() creates an OTEL
    consumer span from Kafka headers and wraps all downstream processing in
    context.with(). Auto-instrumented HTTP calls automatically inject
    traceparent on outbound requests to source/destination cloudservers.

Design decisions

Decision Choice Rationale
Kafka trace format Message headers OTEL standard for messaging; node-rdkafka supports it as 7th arg to produce()
Outbound HTTP Auto-instrumentation @opentelemetry/instrumentation-http + instrumentation-aws-sdk — zero manual work
SDK pattern Mirror cloudserver Same lib/otel.js, ENABLE_OTEL env var, OTLP/HTTP exporter — consistent across services
Service naming OTEL_SERVICE_NAME env var Defaults to "backbeat", override per pod (e.g. backbeat-replication-data-processor)
Scope Replication only Most interesting trace path; other extensions follow same pattern in follow-up PRs
Zero overhead when disabled No-op OTEL API When ENABLE_OTEL is unset, all trace calls are no-ops with negligible cost

End-to-end trace

S3 PutObject (trace_id=abc123)
  ├─ cloudserver span
  │    └─ mongodb span → oplog carries traceContext.traceparent
  │
  └─ [async, seconds later]
     backbeat-replication-data-processor span (backbeat-replication.process)
       ├─ linked to abc123 via traceparent from Kafka header
       ├─ S3 GET span → source cloudserver (auto-instrumented)
       └─ S3 PUT span → destination cloudserver (auto-instrumented)

Verification

  1. Deploy oplog-populator and replication-data-processor with:
    • ENABLE_OTEL=true
    • OTEL_SERVICE_NAME=backbeat-oplog-populator / backbeat-replication-data-processor
    • OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://jaeger:4318/v1/traces
  2. Put an object into a CRR-enabled bucket
  3. In Jaeger, the PUT trace should show a child span from backbeat linked to the same trace ID

Issue: BB-764

@bert-e
Copy link
Copy Markdown
Contributor

bert-e commented Apr 14, 2026

Hello delthas,

My role is to assist you with the merge of this
pull request. Please type @bert-e help to get information
on this process, or consult the user documentation.

Available options
name description privileged authored
/after_pull_request Wait for the given pull request id to be merged before continuing with the current one.
/bypass_author_approval Bypass the pull request author's approval
/bypass_build_status Bypass the build and test status
/bypass_commit_size Bypass the check on the size of the changeset TBA
/bypass_incompatible_branch Bypass the check on the source branch prefix
/bypass_jira_check Bypass the Jira issue check
/bypass_peer_approval Bypass the pull request peers' approval
/bypass_leader_approval Bypass the pull request leaders' approval
/approve Instruct Bert-E that the author has approved the pull request. ✍️
/create_pull_requests Allow the creation of integration pull requests.
/create_integration_branches Allow the creation of integration branches.
/no_octopus Prevent Wall-E from doing any octopus merge and use multiple consecutive merge instead
/unanimity Change review acceptance criteria from one reviewer at least to all reviewers
/wait Instruct Bert-E not to run until further notice.
Available commands
name description privileged
/help Print Bert-E's manual in the pull request.
/status Print Bert-E's current status in the pull request TBA
/clear Remove all comments from Bert-E from the history TBA
/retry Re-start a fresh build TBA
/build Re-start a fresh build TBA
/force_reset Delete integration branches & pull requests, and restart merge process from the beginning.
/reset Try to remove integration branches unless there are commits on them which do not appear on the source branch.

Status report is not available.

@bert-e
Copy link
Copy Markdown
Contributor

bert-e commented Apr 14, 2026

Request integration branches

Waiting for integration branch creation to be requested by the user.

To request integration branches, please comment on this pull request with the following command:

/create_integration_branches

Alternatively, the /approve and /create_pull_requests commands will automatically
create the integration branches.

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 14, 2026

Codecov Report

❌ Patch coverage is 68.53933% with 56 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.62%. Comparing base (79e1ace) to head (d4c1cea).

Files with missing lines Patch % Lines
lib/otel.js 50.98% 25 Missing ⚠️
lib/tracing/kafkaTraceContext.js 69.09% 17 Missing ⚠️
lib/BackbeatConsumer.js 81.25% 3 Missing ⚠️
...cation/destination/KafkaNotificationDestination.js 66.66% 2 Missing ⚠️
bin/queuePopulator.js 0.00% 1 Missing ⚠️
extensions/gc/service.js 0.00% 1 Missing ⚠️
extensions/lifecycle/bucketProcessor/task.js 0.00% 1 Missing ⚠️
extensions/lifecycle/conductor/service.js 0.00% 1 Missing ⚠️
extensions/lifecycle/objectProcessor/task.js 0.00% 1 Missing ⚠️
extensions/notification/queueProcessor/task.js 0.00% 1 Missing ⚠️
... and 3 more

❌ Your patch check has failed because the patch coverage (68.53%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

Impacted file tree graph

Files with missing lines Coverage Δ
...tensions/lifecycle/conductor/LifecycleConductor.js 84.23% <100.00%> (+0.52%) ⬆️
extensions/lifecycle/tasks/LifecycleTask.js 91.65% <100.00%> (+0.10%) ⬆️
...ensions/notification/NotificationQueuePopulator.js 98.21% <100.00%> (+0.03%) ⬆️
extensions/replication/ReplicationAPI.js 87.50% <100.00%> (+0.83%) ⬆️
...xtensions/replication/ReplicationQueuePopulator.js 93.10% <100.00%> (+2.03%) ⬆️
lib/BackbeatProducer.js 89.28% <ø> (ø)
lib/tracing/healthPaths.js 100.00% <100.00%> (ø)
bin/queuePopulator.js 0.00% <0.00%> (ø)
extensions/gc/service.js 0.00% <0.00%> (ø)
extensions/lifecycle/bucketProcessor/task.js 0.00% <0.00%> (ø)
... and 10 more

... and 3 files with indirect coverage changes

Components Coverage Δ
Bucket Notification 80.20% <66.66%> (-0.18%) ⬇️
Core Library 80.79% <65.15%> (+0.23%) ⬆️
Ingestion 70.53% <ø> (-0.62%) ⬇️
Lifecycle 79.06% <86.20%> (+0.04%) ⬆️
Oplog Populator 85.83% <ø> (ø)
Replication 59.68% <71.42%> (+0.06%) ⬆️
Bucket Scanner 85.76% <ø> (ø)
@@                 Coverage Diff                 @@
##           development/9.3    #2733      +/-   ##
===================================================
+ Coverage            74.50%   74.62%   +0.11%     
===================================================
  Files                  200      203       +3     
  Lines                13610    13783     +173     
===================================================
+ Hits                 10140    10285     +145     
- Misses                3460     3488      +28     
  Partials                10       10              
Flag Coverage Δ
api:retry 9.02% <0.00%> (-0.12%) ⬇️
api:routes 8.84% <0.00%> (-0.12%) ⬇️
bucket-scanner 85.76% <ø> (ø)
ft_test:queuepopulator 10.49% <3.37%> (+1.44%) ⬆️
ingestion 12.37% <3.37%> (-0.17%) ⬇️
lib 7.65% <10.11%> (+0.03%) ⬆️
lifecycle 18.85% <29.21%> (+0.01%) ⬆️
notification 1.01% <0.00%> (-0.02%) ⬇️
oplogPopulator 0.14% <0.00%> (-0.01%) ⬇️
replication 18.38% <10.67%> (-0.11%) ⬇️
unit 51.18% <55.05%> (+0.06%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

const tracer = trace.getTracer('backbeat');
const span = tracer.startSpan(operationName, {
kind: 1, // SpanKind.CONSUMER
}, parentCtx);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kind: 1 is SpanKind.SERVER, not SpanKind.CONSUMER. The OTEL enum values are: INTERNAL=0, SERVER=1, CLIENT=2, PRODUCER=3, CONSUMER=4. This will cause consumer spans to show up as SERVER spans in Jaeger. Import and use the constant to avoid magic numbers. — Claude Code

Comment thread lib/otel.js Outdated
'service.namespace': process.env.OTEL_SERVICE_NAMESPACE || 'scality',
}),
traceExporter,
sampler: new TraceIdRatioBasedSampler(samplingRatio),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TraceIdRatioBasedSampler ignores the parent's sampling decision. If cloudserver samples a trace (sampled flag=1 in traceparent) but backbeat's ratio-based sampler decides to drop it, the trace will have a gap — the S3 side shows a span but backbeat's consumer span is missing. Wrap with ParentBasedSampler so that incoming sampled traces are always recorded: sampler: new ParentBasedSampler({ root: new TraceIdRatioBasedSampler(samplingRatio) }). — Claude Code

Comment thread lib/BackbeatConsumer.js
otelContext.with(ctx, () => {
this._queueProcessor(entry, (err, completionArgs) => {
if (err) span.recordException(err);
span.end();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

recordException only adds an event to the span — it does not set the span status. Without calling span.setStatus({ code: SpanStatusCode.ERROR }), failed spans will still appear as OK in Jaeger. Add span.setStatus({ code: 2 }) (SpanStatusCode.ERROR) alongside recordException. — Claude Code

@claude
Copy link
Copy Markdown

claude bot commented Apr 14, 2026

  • Wrong SpanKind value: kind: 1 is SpanKind.SERVER, not SpanKind.CONSUMER (which is 4). Consumer spans will appear as SERVER in Jaeger.
    • Use SpanKind.CONSUMER (4) or import the constant from @opentelemetry/api.
  • Sampler does not respect parent context: TraceIdRatioBasedSampler alone will drop traces that were sampled upstream by cloudserver, creating gaps in end-to-end traces.
    • Wrap with ParentBasedSampler: new ParentBasedSampler({ root: new TraceIdRatioBasedSampler(ratio) }).
  • Missing span status on error: recordException adds an event but does not mark the span as failed. Errors will appear as OK in Jaeger.
    • Add span.setStatus({ code: SpanStatusCode.ERROR }) alongside recordException.

Review by Claude Code

@delthas delthas force-pushed the improvement/BB-764/otel-replication-tracing branch from 9d08f7b to 2f7afb0 Compare April 14, 2026 10:44
Comment thread package.json Outdated
"@opentelemetry/instrumentation-http": "^0.55.0",
"@opentelemetry/resources": "^1.30.1",
"@opentelemetry/sdk-node": "^0.55.0",
"@opentelemetry/sdk-trace-base": "^2.0.1",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@opentelemetry/sdk-trace-base is pinned at ^2.0.1 (resolves to 2.6.1), but @opentelemetry/sdk-node@0.55.0 depends on sdk-trace-base@1.28.0. The TraceIdRatioBasedSampler imported in lib/otel.js comes from v2, while the NodeSDK internals use v1. This cross-major-version mismatch can cause silent sampling failures at runtime. Pin to ^1.28.0 to match the version used by sdk-node.

— Claude Code

Comment thread lib/BackbeatConsumer.js
otelContext.with(ctx, () => {
this._queueProcessor(entry, (err, completionArgs) => {
if (err) span.recordException(err);
span.end();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

recordException adds an event to the span but does not mark the span as failed. Without setStatus, error spans will appear as successful in Jaeger/Grafana Tempo. Add span.setStatus({ code: 2, message: err.message }) alongside recordException.

Suggested change
span.end();
if (err) {
span.recordException(err);
span.setStatus({ code: 2, message: err.message });
}

— Claude Code

Comment thread lib/tracing/kafkaTraceContext.js Outdated
const parentCtx = contextFromKafkaHeaders(kafkaEntry.headers);
const tracer = trace.getTracer('backbeat');
const span = tracer.startSpan(operationName, {
kind: 1, // SpanKind.CONSUMER
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Magic number 1 for SpanKind.CONSUMER. Since @opentelemetry/api is already imported in this file, use the named constant for readability.

Suggested change
kind: 1, // SpanKind.CONSUMER
const span = tracer.startSpan(operationName, {
kind: 4, // SpanKind.CONSUMER (from @opentelemetry/api)

Actually, SpanKind should be imported from @opentelemetry/api: const { SpanKind } = require('@opentelemetry/api') then use kind: SpanKind.CONSUMER.

— Claude Code

@claude
Copy link
Copy Markdown

claude bot commented Apr 14, 2026

  • @opentelemetry/sdk-trace-base version mismatch (package.json:58): ^2.0.1 resolves to v2.6.1, but @opentelemetry/sdk-node@0.55.0 uses sdk-trace-base@1.28.0 internally. The TraceIdRatioBasedSampler from v2 is passed to a NodeSDK expecting v1, risking silent sampling failures.
    • Pin @opentelemetry/sdk-trace-base to ^1.28.0 to match the version used by sdk-node.
  • Missing span.setStatus on error (lib/BackbeatConsumer.js:527): recordException adds an event but does not mark the span as failed. Error spans will appear as successful in Jaeger/Grafana Tempo.
    • Add span.setStatus({ code: 2, message: err.message }) alongside recordException.
  • Magic number for SpanKind (lib/tracing/kafkaTraceContext.js:48): kind: 1 is used instead of the named constant from @opentelemetry/api.
    • Import SpanKind from @opentelemetry/api and use SpanKind.CONSUMER.

Review by Claude Code

@delthas delthas force-pushed the improvement/BB-764/otel-replication-tracing branch from 2f7afb0 to d562a0a Compare April 14, 2026 10:47
Comment thread lib/BackbeatConsumer.js Outdated

otelContext.with(ctx, () => {
this._queueProcessor(entry, (err, completionArgs) => {
if (err) span.recordException(err);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

recordException adds an exception event but does not set the span status to ERROR. Without span.setStatus({ code: SpanStatusCode.ERROR }), error traces won't appear as failures in Jaeger/tracing UIs.

```suggestion
if (err) {
const { SpanStatusCode } = require('@opentelemetry/api');
span.recordException(err);
span.setStatus({ code: SpanStatusCode.ERROR, message: err.message });
}

Comment thread package.json Outdated
"@opentelemetry/instrumentation-http": "^0.55.0",
"@opentelemetry/resources": "^1.30.1",
"@opentelemetry/sdk-node": "^0.55.0",
"@opentelemetry/sdk-trace-base": "^2.0.1",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@opentelemetry/sdk-trace-base is pinned at ^2.0.1 (resolves to 2.6.1), but @opentelemetry/sdk-node@0.55.0 depends on sdk-trace-base@1.28.0. This installs two major versions side-by-side. The TraceIdRatioBasedSampler imported from v2 is passed into NodeSDK which internally uses v1 — if the Sampler interface diverges between majors, this will break silently. Pin to ^1.28.0 to match what sdk-node expects, or upgrade sdk-node to a version that depends on sdk-trace-base v2.

— Claude Code

Comment thread lib/tracing/kafkaTraceContext.js Outdated
const parentCtx = contextFromKafkaHeaders(kafkaEntry.headers);
const tracer = trace.getTracer('backbeat');
const span = tracer.startSpan(operationName, {
kind: 1, // SpanKind.CONSUMER
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use SpanKind.CONSUMER from @opentelemetry/api instead of the magic number 1. The numeric value is an internal detail that could change across versions.

suggestion const span = tracer.startSpan(operationName, { kind: trace.SpanKind ? trace.SpanKind.CONSUMER : 1, // SpanKind.CONSUMER }, parentCtx);
Actually, the cleaner fix is to import SpanKind at the top of the file:
const { context, propagation, trace, SpanKind } = require('@opentelemetry/api');
then use kind: SpanKind.CONSUMER.

— Claude Code

@claude
Copy link
Copy Markdown

claude bot commented Apr 14, 2026

  • Missing span error status (lib/BackbeatConsumer.js:526): span.recordException(err) adds an event but does not mark the span as failed. Must also call span.setStatus({ code: SpanStatusCode.ERROR }) or error traces won't show as failures in Jaeger.
    • Import SpanStatusCode from @opentelemetry/api and call span.setStatus() alongside recordException.
  • OTEL dependency version conflict (package.json:58): @opentelemetry/sdk-trace-base@^2.0.1 resolves to v2.6.1 but @opentelemetry/sdk-node@0.55.0 depends on sdk-trace-base@1.28.0. Two major versions are installed side-by-side; the TraceIdRatioBasedSampler from v2 is passed to NodeSDK which uses v1 internals.
    • Pin sdk-trace-base to ^1.28.0 to align with sdk-node, or upgrade all OTEL packages to a consistent release line.
  • Magic number for SpanKind (lib/tracing/kafkaTraceContext.js:48): kind: 1 should use the named constant SpanKind.CONSUMER from @opentelemetry/api.
    • Import SpanKind and use kind: SpanKind.CONSUMER.

Review by Claude Code

Comment thread package.json
"dependencies": {
"@opentelemetry/api": "^1.9.0",
"@opentelemetry/auto-instrumentations-node": "^0.50.2",
"@opentelemetry/context-async-hooks": "^1.28.0",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@opentelemetry/auto-instrumentations-node@0.50.2 is ~5 minor versions behind @opentelemetry/sdk-node@0.55.0. This installs duplicate versions of core OTEL packages (e.g. @opentelemetry/instrumentation 0.53.0 and 0.55.0, @opentelemetry/api-logs 0.53.0 and 0.55.0). Two copies of the instrumentation core can cause hooks registered by one version not to be visible to the other.

Bump auto-instrumentations-node to ^0.55.0 (or whichever minor matches sdk-node) so all OTEL packages resolve to a single version.

— Claude Code

Comment thread package.json
"@opentelemetry/api": "^1.9.0",
"@opentelemetry/auto-instrumentations-node": "^0.50.2",
"@opentelemetry/context-async-hooks": "^1.28.0",
"@opentelemetry/exporter-trace-otlp-http": "^0.55.0",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto-instrumentations-node bundles instrumentations for ~35 libraries (cassandra, pg, mysql, graphql, grpc, etc.). Only HTTP, AWS SDK, and ioredis are enabled here. Using the three individual packages directly would cut ~1200 lines from yarn.lock and avoid pulling unused transitive deps into the image.

— Claude Code

Comment thread lib/BackbeatConsumer.js
const { ctx, span } = startSpanFromKafkaEntry(entry, `${topic}.process`);
span.setAttribute('messaging.kafka.topic', topic);
span.setAttribute('messaging.kafka.partition', partition);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding messaging.kafka.consumer_group — it helps distinguish spans from different consumer groups in Jaeger (e.g. replication vs lifecycle processors running in the same cluster).

```suggestion
span.setAttribute('messaging.kafka.topic', topic);
span.setAttribute('messaging.kafka.partition', partition);
span.setAttribute('messaging.kafka.consumer_group', this._groupId);

@claude
Copy link
Copy Markdown

claude bot commented Apr 14, 2026

  • OTEL package version mismatch: auto-instrumentations-node@0.50.2 vs sdk-node@0.55.0 installs duplicate versions of core OTEL internals (instrumentation 0.53.0 + 0.55.0, api-logs 0.53.0 + 0.55.0). Align versions to avoid subtle instrumentation registration issues.
    • Bump auto-instrumentations-node to ^0.55.0 or match the sdk-node minor
  • auto-instrumentations-node bundles ~35 instrumentations, only 3 are enabled. Individual packages would significantly reduce dependency weight.
    • Replace with direct deps on instrumentation-http, instrumentation-aws-sdk, instrumentation-ioredis
  • Consumer spans missing consumer_group attribute, useful for filtering in Jaeger.
    • Add span.setAttribute('messaging.kafka.consumer_group', this._groupId)

Review by Claude Code

@delthas delthas force-pushed the improvement/BB-764/otel-replication-tracing branch from d562a0a to 51a9f61 Compare April 14, 2026 11:00
Comment thread package.json
"homepage": "https://github.com/scality/backbeat#readme",
"dependencies": {
"@opentelemetry/api": "^1.9.0",
"@opentelemetry/auto-instrumentations-node": "^0.50.2",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OTEL package version skew: auto-instrumentations-node@^0.50.2 is from the 0.53 OTEL release line, while sdk-node, instrumentation-http, and exporter-trace-otlp-http are all ^0.55. The yarn.lock confirms duplicates — e.g. @opentelemetry/instrumentation installs as both 0.53.0 and 0.55.0, and instrumentation-http as both 0.53.0 and 0.55.0. Because the SDK and the auto-instrumentation plugins end up referencing different copies of the base @opentelemetry/instrumentation module, auto-instrumentation can silently fail at runtime.

Bump auto-instrumentations-node to a version compatible with the 0.55 ecosystem, and consider dropping instrumentation-http, instrumentation-aws-sdk, and context-async-hooks as direct deps — they are pulled transitively from auto-instrumentations-node and sdk-node.

— Claude Code

@claude
Copy link
Copy Markdown

claude bot commented Apr 14, 2026

  • OTEL package version skew (package.json): auto-instrumentations-node@^0.50.2 belongs to the OTEL 0.53 release line but is mixed with 0.55 packages (sdk-node, instrumentation-http, exporter-trace-otlp-http). yarn.lock confirms duplicate copies of @opentelemetry/instrumentation (0.53 and 0.55), which can cause auto-instrumentation to silently break at runtime.
    - Bump auto-instrumentations-node to a version compatible with the 0.55 OTEL ecosystem.
    - Drop instrumentation-http, instrumentation-aws-sdk, and context-async-hooks as direct deps — they come transitively.

    Review by Claude Code

Add OTEL SDK setup and trace context propagation through backbeat's
Kafka-based replication pipeline, linking async replication work back
to the original S3 request trace in Jaeger.

- Add lib/otel.js mirroring cloudserver's SDK setup pattern
  (gated by ENABLE_OTEL env var, configurable sampling/exporter)
- Add lib/tracing/kafkaTraceContext.js with helpers to extract
  traceparent from oplog entries and Kafka headers
- Extend BackbeatProducer.produce() to pass message headers (7th arg)
- Extend QueuePopulatorExtension.publish() to accept headers
- Extract traceContext from oplog entries in ReplicationQueuePopulator
  and forward as Kafka message headers
- Wrap BackbeatConsumer._processTask() in OTEL consumer spans linked
  to the original trace via Kafka headers
- Bootstrap OTEL SDK in queuePopulator and replication processor
  entry points
- Add unit tests for kafkaTraceContext helpers

Issue: BB-764
@delthas delthas force-pushed the improvement/BB-764/otel-replication-tracing branch from 51a9f61 to b9d3528 Compare April 14, 2026 16:07
Comment thread lib/BackbeatConsumer.js
span.end();
done(err, completionArgs, finishProcessingTask);
});
});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If _queueProcessor throws synchronously, span.end() is never called and the span leaks. Wrap the call in try/catch to ensure cleanup:

suggestion<br> otelContext.with(ctx, () => {<br> try {<br> this._queueProcessor(entry, (err, completionArgs) => {<br> if (err) {<br> span.recordException(err);<br> span.setStatus({ code: SpanStatusCode.ERROR });<br> }<br> span.end();<br> done(err, completionArgs, finishProcessingTask);<br> });<br> } catch (err) {<br> span.recordException(err);<br> span.setStatus({ code: SpanStatusCode.ERROR });<br> span.end();<br> throw err;<br> }<br>

— Claude Code

@claude
Copy link
Copy Markdown

claude bot commented Apr 14, 2026

  • Span leak in BackbeatConsumer._processTask: if _queueProcessor throws synchronously, span.end() is never called. Wrap in try/catch to ensure the span is always ended.
    - Add try/catch around this._queueProcessor() inside otelContext.with(), record the exception on the span, end it, then re-throw.

    Review by Claude Code

Comment thread OTEL.md Outdated
**Status**: done on branch `improvement/BB-764/otel-replication-tracing`
(PR [scality/backbeat#2733](https://github.com/scality/backbeat/pull/2733)).

### Trust boundary (Part B)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This references a local Claude Code plan file path (~/.claude/plans/whimsical-pondering-swan.md) that won't exist on other developers' machines. Consider removing this line or replacing it with a link to the Jira ticket or another shared resource.

— Claude Code

@claude
Copy link
Copy Markdown

claude bot commented Apr 15, 2026

Well-designed PR. The OTEL integration follows standard patterns (no-op API when disabled, ParentBasedSampler, trust boundary header stripping), and the link-vs-parent distinction for fan-out consumers is a smart choice to keep Jaeger traces manageable.

- OTEL.md references a local Claude Code plan file path (~/.claude/plans/whimsical-pondering-swan.md) that won't exist on other developers' machines — remove or replace with a shared reference.

Review by Claude Code

Part B — trust boundary:
- Add lib/tracing/healthPaths.js (isHealthPath helper)
- lib/otel.js wires HttpInstrumentation:
  - ignoreIncomingRequestHook drops health/OPTIONS spans
  - requestHook strips traceparent/tracestate on outbound calls to
    untrusted hosts, keeps the client span (prevents leaks to
    AWS/Azure/GCP/remote-Artesca replication destinations)
  - buildTrustedHosts(config) derives allowlist from replication source,
    Kafka brokers, MongoDB, server.host, Redis; refuse-by-default

Part D — extension to remaining pods:
- Bootstrap OTEL in 6 entry points: notification queue processor,
  lifecycle conductor/bucket/object processors, gc, replication-status
- Lifecycle conductor wraps processBuckets in a root
  lifecycle.conductor.scan span (kind=INTERNAL); bucket-task messages
  now carry traceparent so bucket-processor consumers become children
- LifecycleTask produces object-tasks and transition-tasks with
  link-headers (not traceparent) — fan-out break prevents 1M-span
  traces from breaking the Jaeger UI
- BackbeatConsumer auto-detects link-headers and uses
  startLinkedSpanFromKafkaEntry (new OTEL Link instead of parent-child)
- ReplicationAPI.sendDataMoverAction attaches link-headers when used
  by lifecycle transitions (only caller)
- KafkaNotificationDestination.send strips any trace headers before
  producing to the external customer Kafka destination
- New helpers in lib/tracing/kafkaTraceContext.js:
  linkContextFromKafkaHeaders, startLinkedSpanFromKafkaEntry,
  linkHeadersFromCurrentContext, traceHeadersFromCurrentContext

Tests: 35 unit tests covering healthPaths, buildTrustedHosts
(including Config-hosts-subset + bootstrapList-exclusion), and the new
kafkaTraceContext helpers.

OTEL.md updated to reflect completion.

Issue: BB-764
@delthas delthas force-pushed the improvement/BB-764/otel-replication-tracing branch from 970a811 to 849d6b0 Compare April 15, 2026 15:22
links,
}, context.active());

return { ctx: trace.setSpan(context.active(), span), span };
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

startLinkedSpanFromKafkaEntry is documented as creating a NEW root span, but it passes context.active() as the parent context to tracer.startSpan(). If an ambient span exists in the active context (e.g. if this function is ever called inside an otelContext.with() block), the span becomes a child rather than a root — violating the function's contract.

Use ROOT_CONTEXT instead to guarantee a true root span. You'll need to add ROOT_CONTEXT to the destructured import from @opentelemetry/api at the top of the file.

— Claude Code

Comment thread lib/otel.js
requestHook: (span, request) => {
const rawHost = (request.getHeader?.('host') || '').toString();
const host = rawHost.toLowerCase().split(':')[0];
if (host && !TRUSTED_HOSTS.has(host)) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When request.getHeader('host') returns undefined (e.g. HTTP/2 uses :authority instead of Host), host becomes an empty string, and the if (host && ...) guard is falsy — meaning traceparent/tracestate headers are NOT stripped. Consider defaulting to strip when the host is unknown:

if (!host || !TRUSTED_HOSTS.has(host))

— Claude Code

@claude
Copy link
Copy Markdown

claude bot commented Apr 15, 2026

  • startLinkedSpanFromKafkaEntry passes context.active() instead of ROOT_CONTEXT when creating what is documented as a new root span. If called within an existing span context, the span becomes a child rather than a root, breaking the fan-out trace isolation design.
    • Import ROOT_CONTEXT from @opentelemetry/api and use it as the third arg to tracer.startSpan() and in the returned trace.setSpan() call.
  • requestHook in lib/otel.js does not strip traceparent/tracestate when the host header is missing or empty, defaulting to allow rather than deny.
    • Invert the guard: if (!host || !TRUSTED_HOSTS.has(host)) to strip headers for unknown hosts.

Review by Claude Code

Two issues surfaced after enabling OTEL on the cluster:

1. NotificationQueuePopulator dropped the oplog trace context. Replication
   already extracts value.traceContext via traceHeadersFromEntry() and
   passes it to publish(..., headers); the notification producer was
   missed. Result: notification-processor-azure consumer spans appeared
   as orphan roots in Jaeger instead of children of the causing S3 PUT
   trace. Fix mirrors ReplicationQueuePopulator.

2. /metrics on processor pods wasn't in healthPaths.js. Prometheus scrapes
   each pod at :8901/metrics (no /_/ prefix) every 15s; every scrape
   created a server span — ~60 noise spans / 15min × 14 pods. Added
   /metrics to the explicit Set.

Tests: 3 new cases on NotificationQueuePopulator covering traceparent
propagation (with + without tracestate, and no-op when the oplog entry
has no traceContext); healthPaths.spec.js extended for /metrics
(exact, with query string) while keeping the existing /_/metrics and
/_/metrics/* coverage. 71 unit tests pass.

Issue: BB-764
const span = tracer.startSpan(operationName, {
kind: SpanKind.CONSUMER,
links,
}, context.active());
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment on line 97 says "NEW root span — do not pass an active parent" but context.active() is passed as the parent context. If this function is ever called while another span is active (e.g. from within an otelContext.with() block in a future refactor), the span becomes a child instead of a root span, silently breaking the fan-out trace-break semantics.

Use ROOT_CONTEXT to guarantee root-span behavior:

suggestion\n }, ROOT_CONTEXT);\n

And add ROOT_CONTEXT to the import on line 3:
const { context, propagation, trace, SpanKind, ROOT_CONTEXT } = require('@opentelemetry/api');

— Claude Code

Comment thread lib/BackbeatConsumer.js
span.end();
done(err, completionArgs, finishProcessingTask);
});
});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If _queueProcessor throws synchronously (e.g. TypeError from a null entry field), the span created at line 537 is never ended — it leaks. Wrap the _queueProcessor call in try/catch to ensure span.end() is always called. Same pattern applies to LifecycleConductor.processBuckets (line 393).

— Claude Code

@claude
Copy link
Copy Markdown

claude bot commented Apr 15, 2026

  • startLinkedSpanFromKafkaEntry uses context.active() as parent context despite the comment saying "NEW root span — do not pass an active parent." Use ROOT_CONTEXT to guarantee root-span semantics.
    • Import ROOT_CONTEXT from @opentelemetry/api and pass it as the 3rd arg to tracer.startSpan() in lib/tracing/kafkaTraceContext.js:102
  • Span leak if _queueProcessor (or _processBucketsInternal) throws synchronously — the span is never ended.
    • Wrap the callback-based call in try/catch in BackbeatConsumer._processTask and LifecycleConductor.processBuckets to ensure span.end() on synchronous throws

Review by Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants