monitor lifecycle conductor#2723
Conversation
Hello benzekrimaha,My role is to assist you with the merge of this Available options
Available commands
Status report is not available. |
380069a to
25ea9d5
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files
... and 6 files with indirect coverage changes
@@ Coverage Diff @@
## development/9.3 #2723 +/- ##
===================================================
- Coverage 74.48% 74.46% -0.02%
===================================================
Files 200 200
Lines 13603 13664 +61
===================================================
+ Hits 10132 10175 +43
- Misses 3461 3479 +18
Partials 10 10
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
8316f88 to
408c96c
Compare
408c96c to
e1c5b13
Compare
e1c5b13 to
aefb677
Compare
| const log = this.logger.newRequestLogger(); | ||
| const start = new Date(); | ||
| const start = Date.now(); | ||
| this._scanId = uuid(); |
There was a problem hiding this comment.
Hmm, we're storing the scan ID as a "global" field variable, but it sounds like it is really relevant/used only inside this function (through indirect calls). Could we drop the global field and instead pass it through to whatever uses it? Maybe in _createBucketTaskMessages?
725c3df to
11a94ea
Compare
a2128cf to
a464b39
Compare
| const scanId = uuid(); | ||
| this._currentScanId = scanId; | ||
| log.addDefaultFields({ conductorScanId: scanId }); | ||
| LifecycleMetrics.onProcessBuckets(log); |
There was a problem hiding this comment.
onProcessBuckets was previously called at scan completion (after the success log), so s3_lifecycle_latest_batch_start_time was really a "latest batch completion" timestamp. Moving it to scan start is semantically correct, but the existing LifecycleLateScan alert in monitoring/lifecycle/alerts.yaml still depends on this metric with the description "Last lifecycle scan was performed more than … ago." With the new timing, a conductor that starts scans but never finishes them would keep refreshing this metric, delaying alert firing. Consider adding a companion alert on s3_lifecycle_latest_batch_end_time to catch scans that start but never complete, or updating the alert description.
— Claude Code
|
3d8286a to
ca6f753
Compare
|
LGTM - well-structured monitoring enhancement. The conductorScanId propagation is consistent across all lifecycle task types, the stale metric cleanup in LifecycleMetrics avoids unbounded prom-client memory growth, the semantic change to onProcessBuckets (called at scan start rather than end) is correctly reflected in the alert wording, and the new onConductorScanComplete metrics cover the completion/duration use case. Test coverage is thorough. |
|
LGTM |
|
LGTM — Clean, well-tested implementation. The scan-id propagation is thorough, the |
bc5ef91 to
1a88160
Compare
|
LGTM — clean, well-tested addition of conductor scan monitoring. The scan-id propagation through conductor → bucket tasks → bucket processor is consistent, the per-scan counter cleanup via |
1a88160 to
6967a0b
Compare
|
2ea683f to
f4fab73
Compare
| 'current/noncurrent/orphan splits in v2) increments this counter once ' + | ||
| 'per slice. Multiple conductor_scan_id label values appearing at the ' + | ||
| 'same time indicate overlapping scans.', | ||
| labelNames: [LIFECYCLE_LABEL_ORIGIN, LIFECYCLE_LABEL_CONDUCTOR_SCAN_ID], |
There was a problem hiding this comment.
The conductor_scan_id label creates a new time series per scan (UUID). While the 24h removeStaleBucketProcessorScanMetrics cleanup bounds prom-client process memory, every unique scan ID still produces a distinct series in the Prometheus TSDB until it becomes stale (5 min after the last scrape). With a typical lifecycle interval of ~6 min, that is ~240 distinct label values/day/pod. This is manageable but worth documenting: if the scan interval is ever shortened (e.g. 1 min), cardinality rises proportionally. Consider adding a note in the metric help string about the expected cardinality bounds and the cleanup mechanism.
— Claude Code
|
LGTM — clean, well-tested observability improvement for the lifecycle conductor pipeline. |
| } | ||
|
|
||
| listBucketdBuckets(queue, initMarker, log, cb) { | ||
| listBucketdBuckets(queue, initMarker, scanId, log, cb) { |
There was a problem hiding this comment.
Minor inconsistency: processBuckets was converted to Date.now() but the rate calculations in listBucketdBuckets (line 623, 636) and listMongodbBuckets (line 722, 792) still use new Date(). Both work, but since these methods are already touched by this PR (new scanId parameter), it would be cleaner to align them.
— Claude Code
Review by Claude Code |
bbef626 to
49a799a
Compare
| scanStarted = true; | ||
| this._currentScanId = scanId; | ||
| log.addDefaultFields({ conductorScanId: scanId }); | ||
| LifecycleMetrics.onProcessBuckets(log); |
There was a problem hiding this comment.
Moving onProcessBuckets from scan completion to scan start changes the semantics of the LifecycleLateScan alert. Previously, the alert fired when no scan completed within the threshold. Now it fires when no scan was started — so a conductor scan that starts but gets stuck indefinitely will no longer trigger this alert.
Consider adding a companion alert on the new s3_lifecycle_latest_batch_end_time metric to detect scans that start but never complete (stuck-scan detection).
— Claude Code
Review by Claude Code |
|
LGTM |
9da1c41 to
5561284
Compare
|
LGTM — well-structured monitoring addition. Scan ID propagation is consistent across conductor → bucket processor → task pipelines. The semantic change to |
|
Issue: BB-740