fix: handle DrainIngress in fake_data_generator to unblock graceful shutdown by sjmsft · Pull Request #2515 · open-telemetry/otel-arrow

sjmsft · 2026-04-02T16:46:34Z

Change Summary

The "Ack nack redesign" PR (3dca283) introduced a two-phase DrainIngress/ReceiverDrained shutdown protocol but missed updating the fake_data_generator receiver. Without the DrainIngress handler, the message falls into the _ => {} catch-all, notify_receiver_drained() is never called, the pipeline controller never removes the receiver from its pending set, and after the deadline expires it emits DrainDeadlineReached. This was causing pipeline-perf-test-basic to fail consistently.

What issue does this PR close?

pipeline-perf-test-basic unit test is failing.

Closes pipeline-perf-test-basic unit test is failing #2511

How are these changes tested?

fake_data_generator and runtime_control_metrics tests were executed.

Are there any user-facing changes?

No, fake_data_generator is an internal test/load-generation receiver, not a user-facing component.

…hutdown

codecov · 2026-04-02T16:51:31Z

Codecov Report

❌ Patch coverage is 94.73684% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 88.37%. Comparing base (d8e64e0) to head (4862d79).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2515      +/-   ##
==========================================
+ Coverage   88.34%   88.37%   +0.02%     
==========================================
  Files         613      613              
  Lines      222675   222694      +19     
==========================================
+ Hits       196731   196805      +74     
+ Misses      25420    25365      -55     
  Partials      524      524

Components	Coverage Δ
otap-dataflow	`90.27% <94.73%> (+0.03%)`	⬆️
query_abstraction	`80.61% <ø> (ø)`
query_engine	`90.74% <ø> (ø)`
syslog_cef_receivers	`∅ <ø> (∅)`
otel-arrow-go	`52.45% <ø> (ø)`
quiver	`91.92% <ø> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

lalitb · 2026-04-02T17:11:44Z

The sequencing here looks off. In graceful shutdown the runtime does:

DrainIngress -> ReceiverDrained -> downstream Shutdown.

This change makes fake_data_generator do

DrainIngress -> ReceiverDrained -> wait for Shutdown,

but that Shutdown is not part of the normal post-drain receiver path. For this receiver, once ingress is stopped there is no receiver-local work left to preserve, so it should exit directly on DrainIngress rather than report drained and then block waiting for another shutdown message.

lalitb

Please go through the comment here.

lalitb · 2026-04-02T17:14:34Z

The correct fix should be:

DrainIngress -> notify_receiver_drained() -> return TerminalState immediately

Something like (not tested):

  Ok(NodeControlMsg::DrainIngress { deadline, .. }) => {
      otel_info!("fake_data_generator.drain_ingress");                                                                                                                                                                               
      effect_handler.notify_receiver_drained().await?;
      return Ok(TerminalState::new(deadline, [self.metrics.snapshot()]));                                                                                                                                                            
  }

lalitb · 2026-04-02T18:44:00Z

The fix now looks correct. However from CI failures, there looks like a shutdown race in fake_data_generator that is easy to hit on slower runners The test config uses signals_per_second = 1, so the receiver can sleep for close to 1 second between sends, while the shutdown deadline in test_telemetry_registries_cleanup is only 200ms. That means DrainIngress can arrive while the receiver is asleep, the runtime can move into forced shutdown before the receiver handles it, and then notify_receiver_drained().await? can fail with Channel is closed.

One option could be to address this in two places:

make the rate-limit sleep interruptible, since that looks like the root cause here.

if signals_per_second.is_some() {
    let remaining_time = wait_till - Instant::now();
    if remaining_time.as_secs_f64() > 0.0 {
        tokio::select! {
            biased;

            ctrl_msg = ctrl_msg_recv.recv() => {
                // handle DrainIngress / Shutdown during the rate-limit wait
                // using the same control-message handling as the main loop
            }

            _ = sleep(remaining_time) => {}
        }
    }
}

make notify_receiver_drained() best-effort on the terminal DrainIngress path, so a late control-plane teardown does not turn shutdown into an error.

Ok(NodeControlMsg::DrainIngress { deadline, .. }) => {
    otel_info!("fake_data_generator.drain_ingress");
    let _ = effect_handler.notify_receiver_drained().await;
    return Ok(TerminalState::new(deadline, [self.metrics.snapshot()]));
}

…bug_2511

lquerel · 2026-04-04T02:03:32Z

@sjmsft

[in addition to the sequence described by @lalitb ]

The exception is deadline-forced shutdown. If the drain deadline expires before the receiver reports drained, the runtime sends NodeControlMsg::Shutdown { deadline, reason } to any still-pending receivers.

fix: handle DrainIngress in fake_data_generator to unblock graceful s…

88db336

…hutdown

sjmsft requested a review from a team as a code owner April 2, 2026 16:46

github-project-automation bot added this to OTel-Arrow Apr 2, 2026

github-actions bot added the rust Pull requests that update Rust code label Apr 2, 2026

jmacd approved these changes Apr 2, 2026

View reviewed changes

lalitb requested changes Apr 2, 2026

View reviewed changes

Process PR feedback

7d78b40

Process PR feedback

f26fd21

lalitb mentioned this pull request Apr 2, 2026

flaky test - for copilot #2526

Open

Copilot AI mentioned this pull request Apr 2, 2026

Fix flaky shutdown race in fake_data_generator #2528

Closed

sjmsft and others added 3 commits April 2, 2026 15:05

Merge branch 'main' into bug_2511

54f8e53

Fix cargo fmt issue

40745b4

Merge branch 'bug_2511' of https://github.com/sjmsft/otel-arrow into …

b26144d

…bug_2511

sjmsft requested a review from lalitb April 2, 2026 23:39

jmacd approved these changes Apr 3, 2026

View reviewed changes

sjmsft and others added 2 commits April 3, 2026 14:07

Merge branch 'main' into bug_2511

d25286e

Merge branch 'main' into bug_2511

4862d79

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: handle DrainIngress in fake_data_generator to unblock graceful shutdown#2515

fix: handle DrainIngress in fake_data_generator to unblock graceful shutdown#2515
sjmsft wants to merge 8 commits intoopen-telemetry:mainfrom
sjmsft:bug_2511

sjmsft commented Apr 2, 2026

Uh oh!

codecov bot commented Apr 2, 2026 •

edited

Loading

Uh oh!

lalitb commented Apr 2, 2026

Uh oh!

lalitb left a comment

Uh oh!

lalitb commented Apr 2, 2026 •

edited

Loading

Uh oh!

lalitb commented Apr 2, 2026

Uh oh!

lquerel commented Apr 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

sjmsft commented Apr 2, 2026

Change Summary

What issue does this PR close?

How are these changes tested?

Are there any user-facing changes?

Uh oh!

codecov bot commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

lalitb commented Apr 2, 2026

Uh oh!

lalitb left a comment

Choose a reason for hiding this comment

Uh oh!

lalitb commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lalitb commented Apr 2, 2026

Uh oh!

lquerel commented Apr 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov bot commented Apr 2, 2026 •

edited

Loading

lalitb commented Apr 2, 2026 •

edited

Loading