monitor lifecycle conductor by benzekrimaha · Pull Request #2723 · scality/backbeat

benzekrimaha · 2026-03-02T16:23:04Z

bert-e · 2026-03-02T16:23:08Z

Hello benzekrimaha,

My role is to assist you with the merge of this
pull request. Please type @bert-e help to get information
on this process, or consult the user documentation.

Available options

name	description	privileged	authored
`/after_pull_request`	Wait for the given pull request id to be merged before continuing with the current one.
`/bypass_author_approval`	Bypass the pull request author's approval	⭐
`/bypass_build_status`	Bypass the build and test status	⭐
`/bypass_commit_size`	Bypass the check on the size of the changeset `TBA`	⭐
`/bypass_incompatible_branch`	Bypass the check on the source branch prefix	⭐
`/bypass_jira_check`	Bypass the Jira issue check	⭐
`/bypass_peer_approval`	Bypass the pull request peers' approval	⭐
`/bypass_leader_approval`	Bypass the pull request leaders' approval	⭐
`/approve`	Instruct Bert-E that the author has approved the pull request.		✍️
`/create_pull_requests`	Allow the creation of integration pull requests.
`/create_integration_branches`	Allow the creation of integration branches.
`/no_octopus`	Prevent Wall-E from doing any octopus merge and use multiple consecutive merge instead
`/unanimity`	Change review acceptance criteria from `one reviewer at least` to `all reviewers`
`/wait`	Instruct Bert-E not to run until further notice.

Available commands

name	description	privileged
`/help`	Print Bert-E's manual in the pull request.
`/status`	Print Bert-E's current status in the pull request `TBA`
`/clear`	Remove all comments from Bert-E from the history `TBA`
`/retry`	Re-start a fresh build `TBA`
`/build`	Re-start a fresh build `TBA`
`/force_reset`	Delete integration branches & pull requests, and restart merge process from the beginning.
`/reset`	Try to remove integration branches unless there are commits on them which do not appear on the source branch.

Status report is not available.

bert-e · 2026-03-02T16:23:13Z

Incorrect fix version

The Fix Version/s in issue BB-740 contains:

9.3.0

Considering where you are trying to merge, I ignored possible hotfix versions and I expected to find:

9.3.1

Please check the Fix Version/s of BB-740, or the target
branch of this pull request.

codecov · 2026-03-02T16:48:01Z

Codecov Report

❌ Patch coverage is 94.87179% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.46%. Comparing base (145f7a6) to head (5561284).
⚠️ Report is 33 commits behind head on development/9.3.

Files with missing lines	Patch %	Lines
...sions/lifecycle/tasks/LifecycleDeleteObjectTask.js	66.66%	1 Missing ⚠️
extensions/lifecycle/tasks/LifecycleTask.js	66.66%	1 Missing ⚠️
...s/lifecycle/tasks/LifecycleUpdateExpirationTask.js	66.66%	1 Missing ⚠️
...s/lifecycle/tasks/LifecycleUpdateTransitionTask.js	66.66%	1 Missing ⚠️

Additional details and impacted files

Files with missing lines	Coverage Δ
extensions/lifecycle/LifecycleMetrics.js	`98.82% <100.00%> (+0.60%)`	⬆️
...ecycle/bucketProcessor/LifecycleBucketProcessor.js	`80.48% <100.00%> (+0.61%)`	⬆️
...tensions/lifecycle/conductor/LifecycleConductor.js	`84.23% <100.00%> (+0.52%)`	⬆️
extensions/lifecycle/tasks/LifecycleTaskV2.js	`88.88% <ø> (ø)`
...sions/lifecycle/tasks/LifecycleDeleteObjectTask.js	`92.25% <66.66%> (-0.51%)`	⬇️
extensions/lifecycle/tasks/LifecycleTask.js	`91.41% <66.66%> (-0.14%)`	⬇️
...s/lifecycle/tasks/LifecycleUpdateExpirationTask.js	`80.51% <66.66%> (-0.57%)`	⬇️
...s/lifecycle/tasks/LifecycleUpdateTransitionTask.js	`91.17% <66.66%> (-0.75%)`	⬇️

... and 6 files with indirect coverage changes

Components	Coverage Δ
Bucket Notification	`80.37% <ø> (ø)`
Core Library	`80.59% <ø> (-0.11%)`	⬇️
Ingestion	`70.53% <ø> (-0.62%)`	⬇️
Lifecycle	`78.92% <94.87%> (+0.30%)`	⬆️
Oplog Populator	`85.83% <ø> (ø)`
Replication	`59.61% <ø> (-0.04%)`	⬇️
Bucket Scanner	`85.76% <ø> (ø)`

@@                 Coverage Diff                 @@
##           development/9.3    #2723      +/-   ##
===================================================
- Coverage            74.48%   74.46%   -0.02%     
===================================================
  Files                  200      200              
  Lines                13603    13664      +61     
===================================================
+ Hits                 10132    10175      +43     
- Misses                3461     3479      +18     
  Partials                10       10

Flag	Coverage Δ
api:retry	`9.10% <0.00%> (-0.05%)`	⬇️
api:routes	`8.92% <0.00%> (-0.05%)`	⬇️
bucket-scanner	`85.76% <ø> (ø)`
ft_test:queuepopulator	`9.07% <10.25%> (-0.96%)`	⬇️
ingestion	`12.44% <0.00%> (-0.11%)`	⬇️
lib	`7.59% <0.00%> (-0.03%)`	⬇️
lifecycle	`18.95% <65.38%> (+0.10%)`	⬆️
notification	`1.02% <0.00%> (-0.01%)`	⬇️
oplogPopulator	`0.14% <0.00%> (-0.01%)`	⬇️
replication	`18.47% <10.25%> (-0.03%)`	⬇️
unit	`51.23% <92.30%> (+0.22%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

delthas · 2026-03-18T16:29:03Z

        const log = this.logger.newRequestLogger();
-        const start = new Date();
+        const start = Date.now();
+        this._scanId = uuid();


Hmm, we're storing the scan ID as a "global" field variable, but it sounds like it is really relevant/used only inside this function (through indirect calls). Could we drop the global field and instead pass it through to whatever uses it? Maybe in _createBucketTaskMessages?

claude · 2026-04-27T12:07:15Z

+        const scanId = uuid();
+        this._currentScanId = scanId;
+        log.addDefaultFields({ conductorScanId: scanId });
+        LifecycleMetrics.onProcessBuckets(log);


onProcessBuckets was previously called at scan completion (after the success log), so s3_lifecycle_latest_batch_start_time was really a "latest batch completion" timestamp. Moving it to scan start is semantically correct, but the existing LifecycleLateScan alert in monitoring/lifecycle/alerts.yaml still depends on this metric with the description "Last lifecycle scan was performed more than … ago." With the new timing, a conductor that starts scans but never finishes them would keep refreshing this metric, delaying alert firing. Consider adding a companion alert on s3_lifecycle_latest_batch_end_time to catch scans that start but never complete, or updating the alert description.

— Claude Code

claude · 2026-04-27T12:07:24Z

extensions/lifecycle/conductor/LifecycleConductor.js:383 — onProcessBuckets moved from scan completion to scan start, changing the semantics of s3_lifecycle_latest_batch_start_time. The existing LifecycleLateScan alert in alerts.yaml depends on this metric but its description and behavior assume completion semantics. Consider adding a companion alert on s3_lifecycle_latest_batch_end_time for stuck-scan detection, or updating the alert description.

Overall the PR is well-structured: the conductor scan id threading is thorough across tasks and task versions, the stale-metric cleanup in removeStaleBucketProcessorScanMetrics correctly bounds prom-client memory, error handling is consistent, and test coverage is solid.

Review by Claude Code

claude · 2026-04-29T16:10:41Z

LGTM - well-structured monitoring enhancement. The conductorScanId propagation is consistent across all lifecycle task types, the stale metric cleanup in LifecycleMetrics avoids unbounded prom-client memory growth, the semantic change to onProcessBuckets (called at scan start rather than end) is correctly reflected in the alert wording, and the new onConductorScanComplete metrics cover the completion/duration use case. Test coverage is thorough.

Review by Claude Code

claude · 2026-04-29T16:14:25Z

LGTM

Review by Claude Code

claude · 2026-04-29T16:29:07Z

LGTM — Clean, well-tested implementation. The scan-id propagation is thorough, the conductor_scan_id label cardinality is bounded by the 24h stale-series cleanup, and the semantic shift of onProcessBuckets from scan-completion to scan-start is correctly reflected in the alert wording. New metrics follow existing naming conventions and the dashboard additions are consistent.

Review by Claude Code

claude · 2026-04-29T16:36:27Z

LGTM — clean, well-tested addition of conductor scan monitoring. The scan-id propagation through conductor → bucket tasks → bucket processor is consistent, the per-scan counter cleanup via removeStaleBucketProcessorScanMetrics bounds prom-client memory, the semantic fix moving onProcessBuckets to scan start (with matching alert text updates) is correct, and the new onConductorScanComplete metrics (end time, duration, bucket count) give dashboards what they need without PromQL mid-scan sign-flip issues. Test coverage is thorough (error paths, stale cleanup, no-op on missing scan id, UUID format). No issues found.

Review by Claude Code

claude · 2026-04-29T16:43:57Z

extensions/lifecycle/conductor/LifecycleConductor.js:381 — Race condition: this._currentScanId is set before _controlBacklog check. If a cron-triggered processBuckets call overlaps with an in-flight scan, it overwrites and then nulls out _currentScanId, causing remaining in-flight messages to embed conductorScanId: null. Move the assignment to after backlog control passes (inside the _batchInProgress = true block). Impact is observability-only.

Review by Claude Code

claude · 2026-05-04T07:58:50Z

+        'current/noncurrent/orphan splits in v2) increments this counter once ' +
+        'per slice. Multiple conductor_scan_id label values appearing at the ' +
+        'same time indicate overlapping scans.',
+    labelNames: [LIFECYCLE_LABEL_ORIGIN, LIFECYCLE_LABEL_CONDUCTOR_SCAN_ID],


The conductor_scan_id label creates a new time series per scan (UUID). While the 24h removeStaleBucketProcessorScanMetrics cleanup bounds prom-client process memory, every unique scan ID still produces a distinct series in the Prometheus TSDB until it becomes stale (5 min after the last scrape). With a typical lifecycle interval of ~6 min, that is ~240 distinct label values/day/pod. This is manageable but worth documenting: if the scan interval is ever shortened (e.g. 1 min), cardinality rises proportionally. Consider adding a note in the metric help string about the expected cardinality bounds and the cleanup mechanism.

— Claude Code

claude · 2026-05-04T07:59:00Z

LGTM — clean, well-tested observability improvement for the lifecycle conductor pipeline.

The scan-id propagation through conductor → Kafka → bucket processor → task entries is complete, and the per-scan metric cleanup in removeStaleBucketProcessorScanMetrics correctly bounds prom-client memory. The semantic change to onProcessBuckets (now called at scan start instead of end) is reflected in the alert wording and makes the heartbeat more useful as a liveness signal. The new onConductorScanComplete / s3_lifecycle_latest_batch_end_time fills the gap for scan-completion tracking.

No issues found.

Review by Claude Code

claude · 2026-05-04T07:59:08Z

    }

-    listBucketdBuckets(queue, initMarker, log, cb) {
+    listBucketdBuckets(queue, initMarker, scanId, log, cb) {


Minor inconsistency: processBuckets was converted to Date.now() but the rate calculations in listBucketdBuckets (line 623, 636) and listMongodbBuckets (line 722, 792) still use new Date(). Both work, but since these methods are already touched by this PR (new scanId parameter), it would be cleaner to align them.

— Claude Code

claude · 2026-05-04T07:59:17Z

LifecycleMetrics.js:109 — The conductor_scan_id label on bucketProcessorScanMessagesProcessed produces a new time series per scan (UUID). The 24h prom-client cleanup bounds process memory, but Prometheus TSDB accumulates ~240 distinct values/day/pod at typical scan intervals. Worth documenting the expected cardinality bounds in the help string.
LifecycleConductor.js:619 — listBucketdBuckets and listMongodbBuckets still use new Date() for rate calculations while processBuckets was converted to Date.now(). Minor inconsistency since these methods are already touched by this PR.

Review by Claude Code

claude · 2026-05-04T08:09:31Z

+                scanStarted = true;
+                this._currentScanId = scanId;
+                log.addDefaultFields({ conductorScanId: scanId });
+                LifecycleMetrics.onProcessBuckets(log);


Moving onProcessBuckets from scan completion to scan start changes the semantics of the LifecycleLateScan alert. Previously, the alert fired when no scan completed within the threshold. Now it fires when no scan was started — so a conductor scan that starts but gets stuck indefinitely will no longer trigger this alert.

Consider adding a companion alert on the new s3_lifecycle_latest_batch_end_time metric to detect scans that start but never complete (stuck-scan detection).

— Claude Code

claude · 2026-05-04T08:09:47Z

LifecycleConductor.js:443 — Moving onProcessBuckets from scan completion to scan start changes LifecycleLateScan alert semantics: stuck-in-progress scans will no longer trigger the alert. Consider adding a companion alert on the new s3_lifecycle_latest_batch_end_time metric.
LifecycleMetrics.js:115-124 — removeStaleBucketProcessorScanMetrics errors propagate to the caller's catch, producing misleading logs that suggest the inc() failed. Consider per-entry try/catch in the cleanup loop.

Review by Claude Code

claude · 2026-05-04T08:11:09Z

LGTM

Well-structured PR adding conductor scan tracking (scan ID, duration, bucket count, per-scan message processing) with good test coverage. Specific observations:

- New metrics follow existing naming conventions (s3_lifecycle_*) and use correct metric types (gauges for latest-value state, counter for cumulative messages).
- The conductor_scan_id label on bucket_processor_scan_messages_processed_total introduces per-scan cardinality, but the 24-hour local retention via removeStaleBucketProcessorScanMetrics bounds prom-client memory growth. Prometheus-side retention depends on server config but is documented in the metric help text.
- The semantic change to s3_lifecycle_latest_batch_start_time (now set at scan start instead of scan completion) is correctly reflected in the alert text update ("scheduled" vs "performed"). Note that this means the LifecycleLateScan alert no longer implicitly detects repeated scan failures (only detects that the conductor stopped scheduling). The new s3_lifecycle_latest_batch_end_time metric provides the data for a future scan-completion-staleness alert if needed.
- conductorScanId propagation through task continuation messages (LifecycleTask, LifecycleTaskV2) correctly preserves the original scan ID across re-queued listing slices.
- Stale cleanup ordering in onBucketProcessorScanMessageProcessed is correct: inc() runs before removeStaleBucketProcessorScanMetrics(), so a cleanup failure (caught by the outer try/catch) does not lose the primary metric increment.

Review by Claude Code

Issue: BB-740

claude · 2026-05-04T09:35:11Z

LGTM — well-structured monitoring addition. Scan ID propagation is consistent across conductor → bucket processor → task pipelines. The semantic change to onProcessBuckets (called at scan start instead of end) is correctly reflected in the updated alert text and the new LifecycleStuckScan alert fills the gap. Stale metric cleanup in removeStaleBucketProcessorScanMetrics handles errors gracefully without leaking the retry. Tests cover the key paths including throttling, UUID validation, metric cleanup, and error resilience.

Review by Claude Code

claude · 2026-05-04T09:38:34Z

Well-designed split of the old s3_lifecycle_latest_batch_start_time semantics: it now fires at scan start (matching its name) instead of scan end, with the new s3_lifecycle_latest_batch_end_time and s3_lifecycle_conductor_last_batch_duration_seconds filling the completion role. Note this is a breaking change to the existing metric's semantics — any external dashboards or alerts (outside this repo) that relied on latest_batch_start_time as a scan-completion signal will need to switch to latest_batch_end_time.
- New LifecycleStuckScan alert logic is correct: the or vector(0) fallback handles the first-scan-ever case, and the start > end guard correctly detects in-progress scans.
- Per-scan conductor_scan_id label on bucketProcessorScanMessagesProcessed creates a new time series per scan UUID. The 24h prom-client cleanup is good for process memory, but the series persist in Prometheus until retention. With typical scan intervals (~10min), that's ~144 scan IDs/day/pod — manageable, just worth keeping in mind if scan frequency increases.
- _currentScanId lifecycle is clean: preserved on throttle (previous scan still in flight), cleared on error (if scan started), cleared on success.
- LGTM

Review by Claude Code

benzekrimaha force-pushed the improvement/BB-740-monitor-lifecycle-conductor branch from 380069a to 25ea9d5 Compare March 2, 2026 16:32

benzekrimaha force-pushed the improvement/BB-740-monitor-lifecycle-conductor branch 5 times, most recently from 8316f88 to 408c96c Compare March 11, 2026 16:03

benzekrimaha marked this pull request as ready for review March 11, 2026 16:35

benzekrimaha force-pushed the improvement/BB-740-monitor-lifecycle-conductor branch from 408c96c to e1c5b13 Compare March 11, 2026 16:48

benzekrimaha requested review from a team, SylvainSenechal and francoisferrand March 13, 2026 08:49

francoisferrand requested a review from delthas March 18, 2026 09:19

benzekrimaha force-pushed the improvement/BB-740-monitor-lifecycle-conductor branch from e1c5b13 to aefb677 Compare March 18, 2026 10:04

benzekrimaha changed the title ~~Improvement/bb 740 monitor lifecycle conductor~~ Improvement/BB-740 monitor lifecycle conductor Mar 18, 2026

francoisferrand changed the title ~~Improvement/BB-740 monitor lifecycle conductor~~ monitor lifecycle conductor Mar 18, 2026

delthas reviewed Mar 18, 2026

View reviewed changes