fix: handle DrainIngress in fake_data_generator to unblock graceful shutdown#2515
fix: handle DrainIngress in fake_data_generator to unblock graceful shutdown#2515sjmsft wants to merge 8 commits intoopen-telemetry:mainfrom
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2515 +/- ##
==========================================
+ Coverage 88.34% 88.37% +0.02%
==========================================
Files 613 613
Lines 222675 222694 +19
==========================================
+ Hits 196731 196805 +74
+ Misses 25420 25365 -55
Partials 524 524
🚀 New features to boost your workflow:
|
|
The sequencing here looks off. In graceful shutdown the runtime does:
This change makes fake_data_generator do
but that Shutdown is not part of the normal post-drain receiver path. For this receiver, once ingress is stopped there is no receiver-local work left to preserve, so it should exit directly on DrainIngress rather than report drained and then block waiting for another shutdown message. |
|
The correct fix should be:
Something like (not tested): Ok(NodeControlMsg::DrainIngress { deadline, .. }) => {
otel_info!("fake_data_generator.drain_ingress");
effect_handler.notify_receiver_drained().await?;
return Ok(TerminalState::new(deadline, [self.metrics.snapshot()]));
} |
|
The fix now looks correct. However from CI failures, there looks like a shutdown race in One option could be to address this in two places:
if signals_per_second.is_some() {
let remaining_time = wait_till - Instant::now();
if remaining_time.as_secs_f64() > 0.0 {
tokio::select! {
biased;
ctrl_msg = ctrl_msg_recv.recv() => {
// handle DrainIngress / Shutdown during the rate-limit wait
// using the same control-message handling as the main loop
}
_ = sleep(remaining_time) => {}
}
}
}
Ok(NodeControlMsg::DrainIngress { deadline, .. }) => {
otel_info!("fake_data_generator.drain_ingress");
let _ = effect_handler.notify_receiver_drained().await;
return Ok(TerminalState::new(deadline, [self.metrics.snapshot()]));
} |
Change Summary
The "Ack nack redesign" PR (3dca283) introduced a two-phase DrainIngress/ReceiverDrained shutdown protocol but missed updating the fake_data_generator receiver. Without the DrainIngress handler, the message falls into the _ => {} catch-all, notify_receiver_drained() is never called, the pipeline controller never removes the receiver from its pending set, and after the deadline expires it emits DrainDeadlineReached. This was causing pipeline-perf-test-basic to fail consistently.
What issue does this PR close?
pipeline-perf-test-basic unit test is failing.
How are these changes tested?
fake_data_generator and runtime_control_metrics tests were executed.
Are there any user-facing changes?
No, fake_data_generator is an internal test/load-generation receiver, not a user-facing component.