Skip to content

Improve control delivery, shutdown robustness, and local deferred work #2465

@lquerel

Description

@lquerel

Pre-filing checklist

  • I searched existing issues and didn't find a duplicate

Component(s)

Rust OTAP dataflow (rust/otap-dataflow/)

Objective

This issue describes a multi-step plan to improve node control delivery, inbox behavior, node-local wakeups, and local delayed resume.

A series of PRs will incrementally move the engine in the direction described here.

  • node control delivery
  • inbox arbitration of control and pdata
  • node-local wakeups and delayed resume
  • engine-owned shutdown orchestration

This series solves #2431 and prepares for #2432, #2433, #2424, and #2427.

Rationale

The current runtime mixes several concerns that have different performance and correctness requirements.

In particular:

  • forced shutdown should not depend on ordinary control backlog
  • node-local deferred work should not create unnecessary global contention
  • wakeup-style scheduling and delayed pdata resume are different behaviors and should not share the same overloaded mechanism
  • the runtime should remain robust under heavy Ack/Nack traffic in a thread-per-core, single-threaded async model

Durable and batch processors use the DelayedData mechanism in a roundabout way to implement a wake-up mechanism. Points 2 and 3 are intended to address this gap.

Scope

Planned PR stack:

  1. Add standalone control-aware bounded channel for future engine integration #2466
    • introduce the new control-channel design as an isolated crate with benchmarks and documentation
  2. Rename engine message channels to inboxes #2469
    • rename ProcessorMessageChannel / ExporterMessageChannel to ProcessorInbox / ExporterInbox
    • clarify the role of these runtime abstractions before adding new responsibilities
  3. Add node-local wakeups to ProcessorInbox #2470
    • add keyed wakeup scheduling/cancellation for processors
    • migrate wakeup-style delayed-data uses in batch and durable-buffer
  4. [WIP] Add node-local delayed resume to ProcessorInbox #2471
    • add requeue_later(...) for processor-local delayed pdata resume
    • migrate retry off the global delayed-data path
  5. [WIP] Remove global delayed-data runtime plumbing #2472
    • delete runtime-managed delayed-data scheduling and related plumbing once migration is complete
  6. Integrate the control-channel design into the engine
    • wire the new control-channel into receivers and node inboxes
    • preserve engine-owned shutdown sequencing and liveness guarantees
  7. Observability and admin UI follow-up
    • add control-channel/inbox-focused metrics and admin UI support once the runtime design is stable

Acceptance Criteria

  • The work is split into focused PRs with clear scope boundaries
  • Node-local wakeups and delayed resume no longer rely on the old global delayed-data path
  • Forced shutdown behavior is explicit and reliable under saturated control traffic
  • Observability is updated after the runtime changes land

Dependencies or Blockers

#2431
Solved by PR6, with PR1 providing the standalone control-channel design that PR6 integrates into the engine.

Additional Context

There is also a separate channel/shutdown redesign proposal here https://github.com/gouslu/otel-arrow/blob/gouslu/channel-redesign/rust/otap-dataflow/docs/channel-redesign.md

That proposal raises useful points around shutdown orchestration and hot-path simplification. After PR6, I will evaluate whether shutdown should evolve toward an engine-driven two-phase model that further simplifies ProcessorInbox / ExporterInbox.

Metadata

Metadata

Assignees

No one assigned

    Type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions