Skip to content

feat: Add Phase 1 process-wide memory limiter#2542

Open
lalitb wants to merge 13 commits intoopen-telemetry:mainfrom
lalitb:memory-limiter
Open

feat: Add Phase 1 process-wide memory limiter#2542
lalitb wants to merge 13 commits intoopen-telemetry:mainfrom
lalitb:memory-limiter

Conversation

@lalitb
Copy link
Copy Markdown
Member

@lalitb lalitb commented Apr 5, 2026

Summary

This PR adds a process-wide memory limiter to the Rust collector.

The limiter samples process memory on a fixed interval, classifies pressure as
Normal, Soft, or Hard, and exposes that state through metrics and logs.
In enforce mode, receivers reject new ingress only at Hard. In
observe_only mode, the limiter remains telemetry-only.

This complements the collector's existing bounded-buffer backpressure. It does
not add per-group budgets, byte accounting, or strict process-memory caps.

Basic Functionality

  • Sample process memory from a configured source on a fixed interval
  • Derive or use configured soft/hard memory limits
  • Classify memory pressure as Normal, Soft, or Hard
  • Keep Soft informational in Phase 1
  • Reject new ingress at receivers when pressure reaches Hard in enforce mode
  • Optionally fail /readyz under Hard pressure in enforce mode
  • Support observe_only mode for metrics and logs without enforcement
  • Emit process-level memory gauges for current sampled usage and effective limits
  • Emit pressure-state telemetry and receiver rejection counters
  • Optionally attempt allocator-specific memory purge on Hard when supported by the build

How to Review

  • Start with docs/memory-limiter-phase1.md for scope, semantics, and operator-facing behavior.
  • Then review the core limiter and controller wiring:
    • crates/engine/src/memory_limiter.rs
    • crates/controller/src/lib.rs
    • crates/admin/src/health.rs
    • crates/config/src/policy.rs
    • crates/config/src/engine/validate.rs
  • Then review receiver enforcement paths:
    • crates/otap/src/memory_pressure_layer.rs
    • crates/otap/src/otlp_http.rs
    • crates/core-nodes/src/receivers/otlp_receiver/mod.rs
    • crates/core-nodes/src/receivers/otap_receiver/mod.rs
    • crates/otap/src/otap_grpc.rs
    • crates/core-nodes/src/receivers/syslog_cef_receiver/mod.rs

@github-actions github-actions bot added the rust Pull requests that update Rust code label Apr 5, 2026
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 5, 2026

Codecov Report

❌ Patch coverage is 81.78844% with 334 lines in your changes missing coverage. Please review.
✅ Project coverage is 88.31%. Comparing base (a1fa843) to head (8146725).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2542      +/-   ##
==========================================
- Coverage   88.36%   88.31%   -0.05%     
==========================================
  Files         614      616       +2     
  Lines      223275   225088    +1813     
==========================================
+ Hits       197293   198791    +1498     
- Misses      25458    25773     +315     
  Partials      524      524              
Components Coverage Δ
otap-dataflow 90.17% <81.78%> (-0.09%) ⬇️
query_abstraction 80.61% <ø> (ø)
query_engine 90.74% <ø> (ø)
syslog_cef_receivers ∅ <ø> (∅)
otel-arrow-go 52.45% <ø> (ø)
quiver 91.92% <ø> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@lalitb lalitb marked this pull request as ready for review April 6, 2026 16:46
@lalitb lalitb requested a review from a team as a code owner April 6, 2026 16:46
@lalitb lalitb changed the title [WIP] Add Phase 1 process-wide memory limiter feat: Add Phase 1 process-wide memory limiter Apr 6, 2026
Copy link
Copy Markdown
Contributor

@jmacd jmacd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.
I sort of imagine a configurable "stall" in the receivers when the system is in a soft-limit. Maybe 0 by default, but adding a small duration to delay inputs to allow memory to fall and would be complementary with per-tenant or per-pipeline memory admission limits.

@lquerel
Copy link
Copy Markdown
Contributor

lquerel commented Apr 7, 2026

@lalitb In my opinion, this is moving in the right direction overall. However, my main concern is that the current enforcement path introduces process wide shared state that is consulted directly from ingress hot paths. That adds hidden cross thread coordination, which is not fully aligned with our thread-per-core, share-nothing, NUMA-aware direction.

I think a small architectural adjustment would make this fit much better: keep the global sampler and pressure classification, but propagate state transitions through the control plane and let each pinned receiver thread maintain its own local admission state. That would preserve the same high level behavior while keeping the fast path local and more predictable from a cache and NUMA perspective.

Longer term, I think we should move toward hierarchical memory budgeting with local fast-path admission: a process-wide budget, then per-NUMA or per-core budgets or leases underneath it, with the current process wide limiter acting more as a supervisory guardrail than the primary admission mechanism.

@lquerel
Copy link
Copy Markdown
Contributor

lquerel commented Apr 7, 2026

The sampler could emit a MemoryPressureChanged event only on transitions or configuration changes. Something like.

#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub struct MemoryPressureChanged {
    /// Monotonic update number assigned by the global sampler.
    pub generation: u64,

    /// Newly classified pressure level.
    pub level: MemoryPressureLevel,

    /// Receiver-facing retry hint to use while shedding ingress.
    pub retry_after_secs: u32,

    /// Most recent sampled process memory usage in bytes.
    pub usage_bytes: u64,
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

rust Pull requests that update Rust code

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

3 participants