feat: Add Phase 1 process-wide memory limiter by lalitb · Pull Request #2542 · open-telemetry/otel-arrow

lalitb · 2026-04-05T07:36:00Z

Summary

This PR adds a process-wide memory limiter to the Rust collector.

The limiter samples process memory on a fixed interval, classifies pressure as
Normal, Soft, or Hard, and exposes that state through metrics and logs.
In enforce mode, receivers reject new ingress only at Hard. In
observe_only mode, the limiter remains telemetry-only.

This complements the collector's existing bounded-buffer backpressure. It does
not add per-group budgets, byte accounting, or strict process-memory caps.

The current implementation keeps sampling and pressure classification
process-wide, but propagates pressure transitions through the pipeline control
plane and lets each receiver maintain receiver-local admission state. This keeps
the enforcement hot path local while preserving the same Phase 1 behavior.

Basic Functionality

Sample process memory from a configured source on a fixed interval
Derive or use configured soft/hard memory limits
Classify memory pressure as Normal, Soft, or Hard
Keep Soft informational in Phase 1
Reject new ingress at receivers when pressure reaches Hard in enforce mode
Optionally fail /readyz under Hard pressure in enforce mode
Support observe_only mode for metrics and logs without enforcement
Emit process-level memory gauges for current sampled usage and effective limits
Emit pressure-state telemetry and receiver rejection counters
Propagate pressure transitions through receiver control messages and enforce from receiver-local admission state
Optionally attempt allocator-specific memory purge on Hard when supported by the build

How to Review

Start with docs/memory-limiter-phase1.md for scope, semantics, and operator-facing behavior.

Then review the core limiter and controller wiring:

Then review receiver enforcement paths:

codecov · 2026-04-05T07:38:56Z

Codecov Report

❌ Patch coverage is 83.84819% with 366 lines in your changes missing coverage. Please review.
✅ Project coverage is 88.41%. Comparing base (14beb52) to head (d92407c).
⚠️ Report is 3 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2542      +/-   ##
==========================================
- Coverage   88.46%   88.41%   -0.05%     
==========================================
  Files         618      620       +2     
  Lines      226148   228394    +2246     
==========================================
+ Hits       200066   201943    +1877     
- Misses      25558    25927     +369     
  Partials      524      524

Components	Coverage Δ
otap-dataflow	`90.27% <83.84%> (-0.09%)`	⬇️
query_abstraction	`80.61% <ø> (ø)`
query_engine	`90.74% <ø> (ø)`
syslog_cef_receivers	`∅ <ø> (∅)`
otel-arrow-go	`52.45% <ø> (ø)`
quiver	`92.27% <ø> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

jmacd

Looks good to me.
I sort of imagine a configurable "stall" in the receivers when the system is in a soft-limit. Maybe 0 by default, but adding a small duration to delay inputs to allow memory to fall and would be complementary with per-tenant or per-pipeline memory admission limits.

lquerel · 2026-04-07T06:29:37Z

@lalitb In my opinion, this is moving in the right direction overall. However, my main concern is that the current enforcement path introduces process wide shared state that is consulted directly from ingress hot paths. That adds hidden cross thread coordination, which is not fully aligned with our thread-per-core, share-nothing, NUMA-aware direction.

I think a small architectural adjustment would make this fit much better: keep the global sampler and pressure classification, but propagate state transitions through the control plane and let each pinned receiver thread maintain its own local admission state. That would preserve the same high level behavior while keeping the fast path local and more predictable from a cache and NUMA perspective.

Longer term, I think we should move toward hierarchical memory budgeting with local fast-path admission: a process-wide budget, then per-NUMA or per-core budgets or leases underneath it, with the current process wide limiter acting more as a supervisory guardrail than the primary admission mechanism.

lquerel · 2026-04-07T06:32:10Z

The sampler could emit a MemoryPressureChanged event only on transitions or configuration changes. Something like.

#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub struct MemoryPressureChanged {
    /// Monotonic update number assigned by the global sampler.
    pub generation: u64,

    /// Newly classified pressure level.
    pub level: MemoryPressureLevel,

    /// Receiver-facing retry hint to use while shedding ingress.
    pub retry_after_secs: u32,

    /// Most recent sampled process memory usage in bytes.
    pub usage_bytes: u64,
}

jmacd · 2026-04-07T15:09:30Z

I support @lquerel's proposal.

lalitb · 2026-04-07T17:20:44Z

@lquerel Thanks for the review. I agree with this direction. I will rework the enforcement path so the sampler and classification remain process-wide, but pressure transitions are propagated through the pipeline control plane and each receiver maintains receiver-local admission state.

The intended shape is:

controller-owned global sampler/classifier remains the supervisory source of truth
transitions are published on a controller-owned watch channel
each pipeline runtime subscribes and fans updates out to its receiver nodes via NodeControlMsg
each receiver updates a single receiver-local admission state object
ingress hot paths, including Tower/HTTP clones, consult only that receiver-local state, not the process-global limiter state

That keeps the Phase 1 behavior the same while removing direct process-wide state reads from ingress hot paths and fitting the pinned/share-nothing/NUMA-aware direction better.

lquerel

@lalitb Could you also update the README.md in the config crate and https://github.com/open-telemetry/otel-arrow/blob/main/rust/otap-dataflow/docs/configuration-model.md as well to describe this new policy

…mory-limiter

jmacd · 2026-04-07T22:35:32Z

@lalitb please update the design document and PR description with the changes involving NodeControlMessage and local state management.

…mory-limiter

lalitb · 2026-04-08T04:06:37Z

@lalitb please update the design document and PR description with the changes involving NodeControlMessage and local state management.

Thanks @jmacd, it's done now.

memory limiter impl

f4022d2

github-project-automation bot added this to OTel-Arrow Apr 5, 2026

github-actions bot added the rust Pull requests that update Rust code label Apr 5, 2026

Merge branch 'main' into memory-limiter

ae52e3e

lalitb added 7 commits April 5, 2026 12:57

Fix OTAP memory pressure handling

f8eafa8

Polish memory limiter overload handling

6d83855

docs(memory-limiter): fix syslog TCP wording and add constant comments

6c1411d

Require explicit memory limiter mode

f10e868

Merge branch 'main' into memory-limiter

37c1cfd

Merge branch 'main' into memory-limiter

a45968b

Merge branch 'main' into memory-limiter

df11f6d

lalitb marked this pull request as ready for review April 6, 2026 16:46

lalitb requested a review from a team as a code owner April 6, 2026 16:46

lalitb changed the title ~~[WIP] Add Phase 1 process-wide memory limiter~~ feat: Add Phase 1 process-wide memory limiter Apr 6, 2026

lalitb added 4 commits April 6, 2026 10:17

Merge branch 'main' into memory-limiter

931c07e

Merge branch 'main' into memory-limiter

7038e0c

Merge branch 'main' into memory-limiter

6487d48

Merge branch 'main' into memory-limiter

8146725

jmacd approved these changes Apr 7, 2026

View reviewed changes

lalitb added 2 commits April 7, 2026 11:01

Localize memory limiter enforcement

1a02061

Merge main into memory-limiter

c16447d

lquerel reviewed Apr 7, 2026

View reviewed changes

lalitb added 3 commits April 7, 2026 12:58

Merge branch 'main' into memory-limiter

55c40b6

config doc update

b2545b5

Merge branch 'memory-limiter' of github.com:lalitb/otel-arrow into me…

1dc8cdc

…mory-limiter

Merge branch 'main' into memory-limiter

8b9d875

jmacd approved these changes Apr 7, 2026

View reviewed changes

lalitb added 3 commits April 7, 2026 16:47

Merge branch 'main' into memory-limiter

97cbc80

fix

5d4436b

Merge branch 'memory-limiter' of github.com:lalitb/otel-arrow into me…

d92407c

…mory-limiter

jmacd added this pull request to the merge queue Apr 8, 2026

Merged via the queue into open-telemetry:main with commit 43638e2 Apr 8, 2026
113 of 119 checks passed

github-project-automation bot moved this to Done in OTel-Arrow Apr 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add Phase 1 process-wide memory limiter#2542

feat: Add Phase 1 process-wide memory limiter#2542
jmacd merged 22 commits intoopen-telemetry:mainfrom
lalitb:memory-limiter

lalitb commented Apr 5, 2026 •

edited

Loading

Uh oh!

codecov bot commented Apr 5, 2026 •

edited

Loading

Uh oh!

jmacd left a comment

Uh oh!

lquerel commented Apr 7, 2026

Uh oh!

lquerel commented Apr 7, 2026

Uh oh!

jmacd commented Apr 7, 2026

Uh oh!

lalitb commented Apr 7, 2026

Uh oh!

lquerel left a comment

Uh oh!

jmacd commented Apr 7, 2026

Uh oh!

lalitb commented Apr 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

lalitb commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Basic Functionality

How to Review

Uh oh!

codecov bot commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jmacd left a comment

Choose a reason for hiding this comment

Uh oh!

lquerel commented Apr 7, 2026

Uh oh!

lquerel commented Apr 7, 2026

Uh oh!

jmacd commented Apr 7, 2026

Uh oh!

lalitb commented Apr 7, 2026

Uh oh!

lquerel left a comment

Choose a reason for hiding this comment

Uh oh!

jmacd commented Apr 7, 2026

Uh oh!

lalitb commented Apr 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lalitb commented Apr 5, 2026 •

edited

Loading

codecov bot commented Apr 5, 2026 •

edited

Loading