[WIP] Add node-local wakeups to ProcessorInbox#2470
[WIP] Add node-local wakeups to ProcessorInbox#2470lquerel wants to merge 18 commits intoopen-telemetry:mainfrom
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2470 +/- ##
==========================================
+ Coverage 88.34% 88.38% +0.03%
==========================================
Files 613 615 +2
Lines 222675 224019 +1344
==========================================
+ Hits 196731 198005 +1274
- Misses 25420 25490 +70
Partials 524 524
🚀 New features to boost your workflow:
|
651b659 to
44d64d4
Compare
| fn swap_entries(&mut self, left: usize, right: usize) { | ||
| if left == right { | ||
| return; | ||
| } | ||
|
|
||
| self.wakeups.swap(left, right); | ||
|
|
||
| let left_slot = self.wakeups[left].slot; | ||
| let right_slot = self.wakeups[right].slot; | ||
| let _ = self | ||
| .wakeup_indices | ||
| .insert(left_slot, left) | ||
| .expect("left slot index should exist"); | ||
| let _ = self | ||
| .wakeup_indices | ||
| .insert(right_slot, right) | ||
| .expect("right slot index should exist"); | ||
| } | ||
|
|
||
| fn sift_up(&mut self, mut index: usize) { | ||
| while index > 0 { | ||
| let parent = (index - 1) / 2; | ||
| if !Self::wakeup_precedes(&self.wakeups[index], &self.wakeups[parent]) { | ||
| break; | ||
| } | ||
| self.swap_entries(index, parent); | ||
| index = parent; | ||
| } | ||
| } | ||
|
|
||
| fn sift_down(&mut self, mut index: usize) { | ||
| let len = self.wakeups.len(); | ||
| loop { | ||
| let left = index * 2 + 1; | ||
| if left >= len { | ||
| break; | ||
| } | ||
|
|
||
| let right = left + 1; | ||
| let mut smallest = left; | ||
| if right < len && Self::wakeup_precedes(&self.wakeups[right], &self.wakeups[left]) { | ||
| smallest = right; | ||
| } | ||
|
|
||
| if !Self::wakeup_precedes(&self.wakeups[smallest], &self.wakeups[index]) { | ||
| break; | ||
| } | ||
|
|
||
| self.swap_entries(index, smallest); | ||
| index = smallest; | ||
| } | ||
| } | ||
|
|
||
| fn repair_heap_at(&mut self, index: usize) { | ||
| if index > 0 { | ||
| let parent = (index - 1) / 2; | ||
| if Self::wakeup_precedes(&self.wakeups[index], &self.wakeups[parent]) { | ||
| self.sift_up(index); | ||
| return; | ||
| } | ||
| } | ||
| self.sift_down(index); | ||
| } |
There was a problem hiding this comment.
Does Rust really not have a crate for this? Hmm (thinking like Golang https://pkg.go.dev/container/heap). Don't get me wrong-- I love heaps! I would put this heap-helper stuff in a new data structure-only file.
| } | ||
| } | ||
|
|
||
| /// Retry state for a bundle that could not acquire an engine wakeup slot yet. |
There was a problem hiding this comment.
I know this is still a WIP, so not leaving detailed comments yet. Just wanted to chime in early on the approach to hopefully reduce overwork.
My general concern in the durable_buffer changes specifically (not the general mechanism) is that we'd be doing something relatively complex with the 'many wakeup slots + OverflowRetry' approach at a time when we can least afford to do so (i.e. under heavy NACK pressure from a network outage).
Instead of one wakeup slot per bundle, could we simplify by only maintaining a single WakeupSlot that tracks the earliest pending retry deadline. When a NACK arrives (or the wakeup fires and we need to reschedule), we compute the backoff time and call set_wakeup(slot, min(current_scheduled, new_retry_at)). When the wakeup fires, scan Quiver for all bundles whose backoff has elapsed and resend them. We still get exact timing for the next retry without per-bundle slot management. The state the durable_buffer_processor maintains reduces to one Instant and one wakeup slot. The OverflowRetry mechanism goes away in the durable buffer (no longer needed) and we don't need to coordinate between multiple scheduling methods.
There was a problem hiding this comment.
That sounds reasonable, I'll take a look at what I can do.
There was a problem hiding this comment.
Please take a look at the updated code.
|
@AaronRM At the durable buffer retry state level, there is still a potential issue with unbounded heap memory usage. With the new wakeup API, I think the situation is better than before, but the heap footprint can still grow with the number of deferred bundles. During a prolonged transient outage with small bundles, the durable backlog can become large in terms of in-memory retry metadata. I did not try to fix this because it goes well beyond the integration of the wakeup API, but I think a long-term solution would be to move the That said, I don't have your full design in mind, and this may not be the best approach. In any case, I'll leave it to you to take a look. |
JakeDern
left a comment
There was a problem hiding this comment.
Batch processor changes look reasonable to me. I assume we'll remove the DelayedData handling code before merge since it's obsolete now
| /// Wakeups are keyed by [`WakeupSlot`]. Scheduling the same slot again | ||
| /// replaces the previous due time for that slot and assigns a new | ||
| /// scheduler revision for that slot. |
| /// Re-scheduling an existing slot gives it a new revision. Processors can use | ||
| /// the revision carried back in [`NodeControlMsg::Wakeup`] to distinguish a | ||
| /// current wakeup from a stale delivery for the same slot. | ||
| pub type WakeupRevision = u64; |
There was a problem hiding this comment.
I'm glad this is part of the engine! I was wondering if I'd have to do a similar thing at the processor level
Thanks @lquerel. The changes with the single slot redesign now look good to me from the durable_buffer perspective. 👍 To your concern re: unbounded heap growth, I filed #2552 to track that separately. I agree that it's not something your PR should address, although as you note, your changes already improve the situation slightly. |
Not ready for review. #2469 must be merged first to reduce the scope of this PR
Change Summary
Add a node-local wakeup API for processors and integrate it into
ProcessorInbox.This PR introduces keyed, cancelable wakeups that let a processor schedule local wakeup work without sending dummy
pdataback through the old delayed data path. Due wakeups are surfaced throughProcessorInboxas controlmessages.
This is the next step in the redesign tracked in #2465. The goal is to split two different behaviors that were previously mixed together:
pdataresumeThis PR handles the first one only. It migrates the wakeup-style usages in
batch_processoranddurable_buffer_processor, while intentionally leaving retry and the old global delayed-data path in place for the next PRs.What issue does this PR close?
How are these changes tested?
ProcessorInboxdelivery of due wakeupscargo xtask checkAre there any user-facing changes?
No user-facing changes.
This PR is an internal runtime change only.