From 036a580d5e1d0d23c32d377476955f9d28c35d6d Mon Sep 17 00:00:00 2001 From: Srikalyan Swayampakula Date: Wed, 24 Dec 2025 09:30:14 -0800 Subject: [PATCH 01/10] KEP: Memory Manager Hugepages Availability Verification This KEP proposes enhancing the Memory Manager's Static policy to verify OS-reported free hugepages availability during pod admission. Problem: The Memory Manager only tracks hugepage allocations for Guaranteed QoS pods. Burstable/BestEffort pods can consume hugepages without being tracked, causing subsequent Guaranteed pods to be admitted but fail at runtime when hugepages are exhausted. Solution: - Add FreePages field to cadvisor's HugePagesInfo (PR google/cadvisor#3804) - Verify OS-reported free hugepages during Allocate() in Static policy - Reject pods when insufficient free hugepages are available Related: https://github.com/kubernetes/kubernetes/issues/134395 --- .../README.md | 475 ++++++++++++++++++ .../kep.yaml | 42 ++ 2 files changed, 517 insertions(+) create mode 100644 keps/sig-node/NNNN-memory-manager-hugepages-verification/README.md create mode 100644 keps/sig-node/NNNN-memory-manager-hugepages-verification/kep.yaml diff --git a/keps/sig-node/NNNN-memory-manager-hugepages-verification/README.md b/keps/sig-node/NNNN-memory-manager-hugepages-verification/README.md new file mode 100644 index 000000000000..7d2bc32a4d0d --- /dev/null +++ b/keps/sig-node/NNNN-memory-manager-hugepages-verification/README.md @@ -0,0 +1,475 @@ +# KEP-NNNN: Memory Manager Hugepages Availability Verification + + +- [Release Signoff Checklist](#release-signoff-checklist) +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [User Stories](#user-stories) + - [Story 1: DPDK Application Admission Failure](#story-1-dpdk-application-admission-failure) + - [Story 2: Database Workload with Hugepages](#story-2-database-workload-with-hugepages) + - [Notes/Constraints/Caveats](#notesconstraintscaveats) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) + - [Implementation Overview](#implementation-overview) + - [cadvisor Changes](#cadvisor-changes) + - [Memory Manager Changes](#memory-manager-changes) + - [Test Plan](#test-plan) + - [Prerequisite testing updates](#prerequisite-testing-updates) + - [Unit tests](#unit-tests) + - [Integration tests](#integration-tests) + - [e2e tests](#e2e-tests) + - [Graduation Criteria](#graduation-criteria) + - [Alpha](#alpha) + - [Beta](#beta) + - [GA](#ga) + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) + - [Version Skew Strategy](#version-skew-strategy) +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) + - [Monitoring Requirements](#monitoring-requirements) + - [Dependencies](#dependencies) + - [Scalability](#scalability) + - [Troubleshooting](#troubleshooting) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) + + +## Release Signoff Checklist + +Items marked with (R) are required *prior to targeting to a milestone / release*. + +- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [ ] (R) KEP approvers have approved the KEP status as `implementable` +- [ ] (R) Design details are appropriately documented +- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) + - [ ] e2e Tests for all Beta API Operations (endpoints) + - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) + - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free +- [ ] (R) Graduation criteria is in place + - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) +- [ ] (R) Production readiness review completed +- [ ] (R) Production readiness review approved +- [ ] "Implementation History" section is up-to-date for milestone +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] +- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes + +[kubernetes.io]: https://kubernetes.io/ +[kubernetes/enhancements]: https://git.k8s.io/enhancements +[kubernetes/kubernetes]: https://git.k8s.io/kubernetes +[kubernetes/website]: https://git.k8s.io/website + +## Summary + +This KEP proposes enhancing the Memory Manager's Static policy to verify OS-reported +free hugepages availability during pod admission. Currently, the Memory Manager only +tracks hugepage allocations for Guaranteed QoS pods but doesn't verify actual +hugepage availability from the operating system. This can lead to pods being admitted +when hugepages aren't actually available, causing runtime failures. + +The enhancement adds verification by reading free hugepages from sysfs +(`/sys/devices/system/node/node/hugepages/hugepages-kB/free_hugepages`) +during pod admission, ensuring pods requesting hugepages are only admitted when +sufficient free hugepages exist. + +## Motivation + +The Memory Manager tracks hugepage allocations for Guaranteed QoS pods to provide +NUMA-aware memory and hugepage pinning. However, it operates on its internal +accounting without verifying the actual state of hugepages on the system. + +This creates a problem when: +1. Burstable or BestEffort pods consume hugepages (via hugetlbfs mounts or + `mmap` with `MAP_HUGETLB`) without being tracked by the Memory Manager +2. External processes or other system components consume hugepages +3. The Memory Manager's internal state becomes stale or inconsistent with reality + +In these scenarios, a Guaranteed pod requesting hugepages may be admitted based +on the Memory Manager's internal tracking, only to fail at runtime when the +container attempts to use the already-exhausted hugepages. + +### Goals + +- Verify OS-reported free hugepages during pod admission for the Static policy +- Reject pods requesting hugepages when insufficient free hugepages are available +- Provide clear error messages when admission fails due to insufficient hugepages +- Maintain backwards compatibility with existing Memory Manager behavior + +### Non-Goals + +- Track hugepage usage by Burstable or BestEffort pods in the Memory Manager +- Modify scheduler behavior or add hugepage awareness to the scheduler +- Provide hugepage reservation or preemption mechanisms +- Support platforms other than Linux + +## Proposal + +Enhance the Memory Manager's Static policy to verify actual hugepage availability +by querying sysfs during pod admission. This involves: + +1. **cadvisor enhancement**: Add a `FreePages` field to `HugePagesInfo` struct + that reports free hugepages per NUMA node, read from sysfs + +2. **Memory Manager enhancement**: During `Allocate()` in the Static policy, + verify that OS-reported free hugepages meet or exceed the requested amount + before admitting the pod + +### User Stories + +#### Story 1: DPDK Application Admission Failure + +As a cluster administrator running DPDK-based network functions, I deploy a +Burstable pod that mounts hugetlbfs and consumes 2GB of 1GB hugepages for packet +buffer pools. Later, I deploy a Guaranteed pod also requesting 2GB of 1GB hugepages. + +**Current behavior**: The Guaranteed pod is admitted (Memory Manager shows +hugepages as available) but fails at container startup when DPDK tries to allocate +hugepages that are already consumed. + +**Desired behavior**: The Guaranteed pod admission fails immediately with a clear +error indicating insufficient free hugepages, allowing the scheduler to try +another node or the administrator to take corrective action. + +#### Story 2: Database Workload with Hugepages + +As a database administrator, I run PostgreSQL with hugepages enabled for shared +buffers. If an external monitoring agent or debugging tool temporarily consumes +hugepages, subsequent Guaranteed pods requesting hugepages should not be admitted +until hugepages are freed. + +**Current behavior**: Pods are admitted based on Memory Manager tracking and fail +at runtime. + +**Desired behavior**: Pods are rejected at admission with informative errors. + +### Notes/Constraints/Caveats + +- **Race condition window**: A small window exists between verification and actual + container startup where hugepages could be consumed. This is inherent to any + admission-time check but significantly reduces the failure window compared to + no verification. + +- **sysfs dependency**: The feature depends on reading from sysfs. If sysfs is + unavailable or the free_hugepages file cannot be read, the feature gracefully + degrades to current behavior (no verification). + +- **Per-NUMA verification**: Verification is performed per-NUMA node, consistent + with the Memory Manager's NUMA-aware design. + +### Risks and Mitigations + +| Risk | Mitigation | +|------|------------| +| sysfs reads add latency to admission | Minimal impact: single file read per hugepage size per NUMA node | +| False rejections due to transient consumption | Acceptable: better to reject than admit and fail at runtime | +| sysfs unavailable in some environments | Graceful degradation: skip verification if sysfs unreadable | + +## Design Details + +### Implementation Overview + +The implementation consists of two parts: + +1. **cadvisor**: Add `FreePages *uint64` field to `HugePagesInfo` struct, populated + from sysfs. Uses pointer with `omitempty` to distinguish between "0 free" and + "data unavailable". + +2. **kubelet Memory Manager**: Add `verifyOSHugepagesAvailability()` function + called during `Allocate()` that compares requested hugepages against OS-reported + free hugepages from cadvisor's machine info. + +### cadvisor Changes + +```go +type HugePagesInfo struct { + // huge page size (in kB) + PageSize uint64 `json:"page_size"` + // number of huge pages + NumPages uint64 `json:"num_pages"` + // number of free huge pages (nil if unavailable) + FreePages *uint64 `json:"free_pages,omitempty"` +} +``` + +The `FreePages` field is populated by reading from: +``` +/sys/devices/system/node/node/hugepages/hugepages-kB/free_hugepages +``` + +### Memory Manager Changes + +During `Allocate()` in the Static policy: + +```go +func (p *staticPolicy) verifyOSHugepagesAvailability( + machineState state.NUMANodeMap, + pod *v1.Pod, + container *v1.Container, +) error { + // For each hugepage size requested by the container: + // 1. Get the OS-reported free hugepages from cadvisor machine info + // 2. Compare against the requested amount + // 3. Return error if insufficient +} +``` + +The verification: +- Only runs when the Static policy is enabled +- Only checks hugepage resources (not regular memory) +- Aggregates free hugepages across candidate NUMA nodes +- Returns admission error if insufficient free hugepages + +### Test Plan + +[x] I/we understand the owners of the involved components may require updates to +existing tests to make this code solid enough prior to committing the changes necessary +to implement this enhancement. + +##### Prerequisite testing updates + +- Existing Memory Manager unit tests cover allocation logic +- cadvisor tests cover sysfs reading functionality + +##### Unit tests + +- `pkg/kubelet/cm/memorymanager`: Add tests for `verifyOSHugepagesAvailability()` + - Test successful verification when free hugepages >= requested + - Test rejection when free hugepages < requested + - Test graceful handling when FreePages is nil (sysfs unavailable) + - Test per-NUMA node verification + +##### Integration tests + +- Test Memory Manager with mocked cadvisor returning various FreePages values +- Test admission flow with hugepage verification enabled/disabled + +##### e2e tests + +- Test pod admission when hugepages are available +- Test pod rejection when hugepages are exhausted +- Test that rejected pods can be admitted after hugepages are freed + +### Graduation Criteria + +#### Alpha + +- Feature implemented behind `MemoryManagerHugepagesVerification` feature gate +- Unit tests for verification logic +- Documentation for feature gate and behavior + +#### Beta + +- E2e tests demonstrating correct behavior +- Metrics for verification failures +- Feedback incorporated from alpha users +- No significant bugs reported + +#### GA + +- Feature enabled by default +- Conformance tests if applicable +- Documentation updated for stable feature + +### Upgrade / Downgrade Strategy + +**Upgrade**: No special handling required. The feature is additive and controlled +by a feature gate. Existing pods are unaffected. + +**Downgrade**: Disabling the feature gate returns to previous behavior where +OS hugepage availability is not verified. No data migration needed. + +### Version Skew Strategy + +The feature is entirely within the kubelet and depends on cadvisor (vendored). +No control plane or cross-component version skew concerns. + +When kubelet is upgraded but cadvisor hasn't been updated to provide `FreePages`: +- The field will be `nil` +- Verification will be skipped (graceful degradation) +- Warning logged indicating verification unavailable + +## Production Readiness Review Questionnaire + +### Feature Enablement and Rollback + +###### How can this feature be enabled / disabled in a live cluster? + +- [x] Feature gate + - Feature gate name: `MemoryManagerHugepagesVerification` + - Components depending on the feature gate: kubelet + +###### Does enabling the feature change any default behavior? + +Yes. Pods requesting hugepages may be rejected at admission if the OS reports +insufficient free hugepages, even if the Memory Manager's internal tracking +shows availability. This is the intended behavior to prevent runtime failures. + +###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? + +Yes. Disabling the feature gate and restarting kubelet returns to previous +behavior. No persistent state is affected. + +###### What happens if we reenable the feature if it was previously rolled back? + +The feature resumes verification on new pod admissions. No special handling needed. + +###### Are there any tests for feature enablement/disablement? + +Unit tests will verify behavior with feature gate enabled and disabled. + +### Rollout, Upgrade and Rollback Planning + +###### How can a rollout or rollback fail? Can it impact already running workloads? + +The feature only affects pod admission, not running workloads. A rollout cannot +impact already running pods. Rollback simply stops verification on new admissions. + +###### What specific metrics should inform a rollback? + +- Unexpected increase in pod admission failures +- `memory_manager_hugepages_verification_failures_total` metric (proposed) + +###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? + +TBD during alpha phase. + +###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? + +No. + +### Monitoring Requirements + +###### How can an operator determine if the feature is in use by workloads? + +- Feature gate is enabled +- Pods request hugepages resources + +###### How can someone using this feature know that it is working for their instance? + +- [ ] Events + - Event Reason: `FailedHugepagesVerification` + - When: Pod admission rejected due to insufficient OS-reported free hugepages +- [ ] Other + - Kubelet logs will indicate verification being performed and results + +###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? + +- Hugepage verification should add < 10ms to pod admission latency +- 99.9% of pods with sufficient free hugepages should be admitted successfully + +###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? + +- [x] Metrics + - Metric name: `memory_manager_hugepages_verification_failures_total` + - Components exposing the metric: kubelet + - Metric name: `memory_manager_hugepages_verification_latency_seconds` + - Components exposing the metric: kubelet + +###### Are there any missing metrics that would be useful to have to improve observability of this feature? + +The proposed metrics should provide adequate observability. + +### Dependencies + +###### Does this feature depend on any specific services running in the cluster? + +- cadvisor (bundled with kubelet) + - Usage: Provides machine info including hugepage free counts + - Impact of outage: Verification skipped, graceful degradation + - Impact of degraded performance: Slightly increased admission latency + +### Scalability + +###### Will enabling / using this feature result in any new API calls? + +No new API calls. The feature reads from local sysfs and cadvisor machine info. + +###### Will enabling / using this feature result in introducing new API types? + +No. + +###### Will enabling / using this feature result in any new calls to the cloud provider? + +No. + +###### Will enabling / using this feature result in increasing size or count of the existing API objects? + +No. + +###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? + +Minimal impact on pod admission latency (< 10ms for sysfs reads). + +###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? + +Negligible: periodic sysfs file reads during pod admission. + +###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? + +No. The feature performs simple file reads. + +### Troubleshooting + +###### How does this feature react if the API server and/or etcd is unavailable? + +No impact. The feature operates entirely within kubelet using local sysfs. + +###### What are other known failure modes? + +- sysfs unavailable or unreadable + - Detection: Warning logs from kubelet, nil FreePages in machine info + - Mitigations: Feature gracefully degrades to previous behavior + - Diagnostics: Check kubelet logs for sysfs read warnings + - Testing: Unit tests cover this scenario + +###### What steps should be taken if SLOs are not being met to determine the problem? + +1. Check kubelet logs for verification-related messages +2. Verify sysfs is accessible and free_hugepages files exist +3. Compare Memory Manager state with actual sysfs values +4. Check for excessive pod admission rate causing contention + +## Implementation History + +- 2024-12-24: Initial KEP draft +- Related issue: https://github.com/kubernetes/kubernetes/issues/134395 +- cadvisor PR: https://github.com/google/cadvisor/pull/3804 + +## Drawbacks + +- Adds complexity to the admission path +- Small race window still exists between verification and container startup +- May reject pods that would have succeeded if hugepages were freed during startup + +## Alternatives + +### Alternative 1: Track all pod hugepage usage + +Extend Memory Manager to track hugepage usage by Burstable and BestEffort pods. + +**Rejected because**: +- Significant refactoring required +- Would not catch external (non-Kubernetes) hugepage consumers +- Changes the scope and purpose of Memory Manager + +### Alternative 2: Query sysfs directly in Memory Manager + +Read sysfs directly in Memory Manager without cadvisor changes. + +**Rejected because**: +- Duplicates sysfs reading logic already in cadvisor +- cadvisor already provides machine info abstraction +- Adding to cadvisor benefits other consumers of machine info + +### Alternative 3: Scheduler-level hugepage awareness + +Add hugepage availability awareness to the Kubernetes scheduler. + +**Rejected because**: +- Much larger scope change +- Scheduler operates on reported capacity, not real-time availability +- Does not solve the admission-time verification problem diff --git a/keps/sig-node/NNNN-memory-manager-hugepages-verification/kep.yaml b/keps/sig-node/NNNN-memory-manager-hugepages-verification/kep.yaml new file mode 100644 index 000000000000..4ae698cc7144 --- /dev/null +++ b/keps/sig-node/NNNN-memory-manager-hugepages-verification/kep.yaml @@ -0,0 +1,42 @@ +title: Memory Manager Hugepages Availability Verification +kep-number: NNNN +authors: + - "@srikalyan" +owning-sig: sig-node +participating-sigs: [] +status: provisional +creation-date: 2024-12-24 +reviewers: + - TBD +approvers: + - TBD + +see-also: + - "/keps/sig-node/1769-memory-manager" + +# The target maturity stage in the current dev cycle for this KEP. +stage: alpha + +# The most recent milestone for which work toward delivery of this KEP has been +# done. This can be the current (upcoming) milestone, if it is being actively +# worked on. +latest-milestone: "v1.33" + +# The milestone at which this feature was, or is targeted to be, at each stage. +milestone: + alpha: "v1.33" + beta: "v1.34" + stable: "v1.35" + +# The following PRR answers are required at alpha release +# List the feature gate name and the components for which it must be enabled +feature-gates: + - name: MemoryManagerHugepagesVerification + components: + - kubelet +disable-supported: true + +# The following PRR answers are required at beta release +metrics: + - memory_manager_hugepages_verification_failures_total + - memory_manager_hugepages_verification_latency_seconds From 9cd22779d36c46b812dcbedb84787f8a9b550e86 Mon Sep 17 00:00:00 2001 From: Srikalyan Swayampakula Date: Fri, 26 Dec 2025 08:49:13 -0800 Subject: [PATCH 02/10] Fix TOC to pass verify-toc CI check --- .../README.md | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/keps/sig-node/NNNN-memory-manager-hugepages-verification/README.md b/keps/sig-node/NNNN-memory-manager-hugepages-verification/README.md index 7d2bc32a4d0d..d36e29c06206 100644 --- a/keps/sig-node/NNNN-memory-manager-hugepages-verification/README.md +++ b/keps/sig-node/NNNN-memory-manager-hugepages-verification/README.md @@ -17,10 +17,10 @@ - [cadvisor Changes](#cadvisor-changes) - [Memory Manager Changes](#memory-manager-changes) - [Test Plan](#test-plan) - - [Prerequisite testing updates](#prerequisite-testing-updates) - - [Unit tests](#unit-tests) - - [Integration tests](#integration-tests) - - [e2e tests](#e2e-tests) + - [Prerequisite testing updates](#prerequisite-testing-updates) + - [Unit tests](#unit-tests) + - [Integration tests](#integration-tests) + - [e2e tests](#e2e-tests) - [Graduation Criteria](#graduation-criteria) - [Alpha](#alpha) - [Beta](#beta) @@ -37,6 +37,9 @@ - [Implementation History](#implementation-history) - [Drawbacks](#drawbacks) - [Alternatives](#alternatives) + - [Alternative 1: Track all pod hugepage usage](#alternative-1-track-all-pod-hugepage-usage) + - [Alternative 2: Query sysfs directly in Memory Manager](#alternative-2-query-sysfs-directly-in-memory-manager) + - [Alternative 3: Scheduler-level hugepage awareness](#alternative-3-scheduler-level-hugepage-awareness) ## Release Signoff Checklist From 9a89040b6b54046ff510b003aad8ee386147ea58 Mon Sep 17 00:00:00 2001 From: Srikalyan Swayampakula Date: Sat, 27 Dec 2025 09:16:53 -0800 Subject: [PATCH 03/10] Address reviewer feedback and close design gaps Key changes: - Update milestones to v1.36/v1.37/v1.38 - Clarify sysfs reading: add GetCurrentHugepagesInfo() for fresh reads (GetMachineInfo() is cached at startup, would be stale) - Add Integration with Topology Manager section with policy behavior table - Add Interaction with CPU Manager section - Address reserved hugepages (free_hugepages is correct metric) - Expand race condition discussion with failure handling details - Rewrite Story 2 as "Rapid Pod Churn" with clear timeline - Add "Static policy only" note (None policy not applicable) - Specify error message format with example - Add kubelet restart behavior note - Update Risks table with new mitigations - Fix unit test description (removed nil reference) - Update TOC with new sections - Link enhancement issue #5759 Related: https://github.com/kubernetes/enhancements/issues/5759 --- .../README.md | 350 ++++++++++++++---- .../kep.yaml | 11 +- 2 files changed, 289 insertions(+), 72 deletions(-) rename keps/sig-node/{NNNN-memory-manager-hugepages-verification => 5759-memory-manager-hugepages-verification}/README.md (50%) rename keps/sig-node/{NNNN-memory-manager-hugepages-verification => 5759-memory-manager-hugepages-verification}/kep.yaml (88%) diff --git a/keps/sig-node/NNNN-memory-manager-hugepages-verification/README.md b/keps/sig-node/5759-memory-manager-hugepages-verification/README.md similarity index 50% rename from keps/sig-node/NNNN-memory-manager-hugepages-verification/README.md rename to keps/sig-node/5759-memory-manager-hugepages-verification/README.md index d36e29c06206..d7ae7b123438 100644 --- a/keps/sig-node/NNNN-memory-manager-hugepages-verification/README.md +++ b/keps/sig-node/5759-memory-manager-hugepages-verification/README.md @@ -1,21 +1,27 @@ -# KEP-NNNN: Memory Manager Hugepages Availability Verification +# KEP-5759: Memory Manager Hugepages Availability Verification - [Release Signoff Checklist](#release-signoff-checklist) - [Summary](#summary) - [Motivation](#motivation) + - [The Tracking Gap](#the-tracking-gap) + - [Real-World Example](#real-world-example) - [Goals](#goals) - [Non-Goals](#non-goals) - [Proposal](#proposal) + - [Current Admission Flow](#current-admission-flow) - [User Stories](#user-stories) - [Story 1: DPDK Application Admission Failure](#story-1-dpdk-application-admission-failure) - - [Story 2: Database Workload with Hugepages](#story-2-database-workload-with-hugepages) + - [Story 2: Rapid Pod Churn with Hugepages](#story-2-rapid-pod-churn-with-hugepages) - [Notes/Constraints/Caveats](#notesconstraintscaveats) - [Risks and Mitigations](#risks-and-mitigations) - [Design Details](#design-details) - [Implementation Overview](#implementation-overview) - [cadvisor Changes](#cadvisor-changes) - [Memory Manager Changes](#memory-manager-changes) + - [Integration with Topology Manager](#integration-with-topology-manager) + - [Interaction with CPU Manager](#interaction-with-cpu-manager) + - [Observability](#observability) - [Test Plan](#test-plan) - [Prerequisite testing updates](#prerequisite-testing-updates) - [Unit tests](#unit-tests) @@ -46,7 +52,8 @@ Items marked with (R) are required *prior to targeting to a milestone / release*. -- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [x] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) + - Enhancement issue: https://github.com/kubernetes/enhancements/issues/5759 - [ ] (R) KEP approvers have approved the KEP status as `implementable` - [ ] (R) Design details are appropriately documented - [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) @@ -81,19 +88,44 @@ sufficient free hugepages exist. ## Motivation -The Memory Manager tracks hugepage allocations for Guaranteed QoS pods to provide -NUMA-aware memory and hugepage pinning. However, it operates on its internal -accounting without verifying the actual state of hugepages on the system. +The Memory Manager's Static policy tracks hugepage allocations for Guaranteed QoS +pods to provide NUMA-aware memory and hugepage pinning. However, it operates on +its internal accounting without verifying the actual state of hugepages on the +system. -This creates a problem when: -1. Burstable or BestEffort pods consume hugepages (via hugetlbfs mounts or - `mmap` with `MAP_HUGETLB`) without being tracked by the Memory Manager -2. External processes or other system components consume hugepages -3. The Memory Manager's internal state becomes stale or inconsistent with reality +### The Tracking Gap -In these scenarios, a Guaranteed pod requesting hugepages may be admitted based -on the Memory Manager's internal tracking, only to fail at runtime when the -container attempts to use the already-exhausted hugepages. +The Kubernetes scheduler tracks hugepages at the **node level** - it knows total +hugepage capacity and allocated amounts per node. The Memory Manager's Static +policy tracks hugepages at the **per-NUMA level**, but only for Guaranteed QoS +pods that it manages for NUMA placement. + +This creates a tracking gap: **Burstable pods can legitimately request hugepages +through standard Kubernetes resource requests** (e.g., `hugepages-2Mi: 1Gi`). +These requests are: +- Properly validated by the scheduler +- Correctly configured in cgroup limits +- Accounted for at the node level + +However, the Memory Manager does not track these Burstable pod allocations for +NUMA placement purposes. When a subsequent Guaranteed pod requests hugepages: +1. The scheduler approves it (node-level accounting shows availability) +2. The Memory Manager's internal state shows hugepages as available +3. But the OS has already allocated those hugepages to the Burstable pod +4. The Guaranteed pod fails at runtime when hugepages are exhausted + +### Real-World Example + +From [issue #134395](https://github.com/kubernetes/kubernetes/issues/134395), +on an m6id.32xlarge instance with 2 NUMA nodes: + +``` +Memory Manager internal state: 15.2 GB free hugepages +Actual OS state (sysfs): 3.2 GB free hugepages +``` + +The 12GB discrepancy was due to Burstable pods consuming hugepages that the +Memory Manager wasn't tracking. ### Goals @@ -121,55 +153,105 @@ by querying sysfs during pod admission. This involves: verify that OS-reported free hugepages meet or exceed the requested amount before admitting the pod +### Current Admission Flow + +Understanding where this enhancement fits in the existing admission flow: + +1. **Scheduler**: Checks node-level hugepage capacity and allocations. Ensures + the node has sufficient total hugepages for the pod's request. + +2. **Kubelet Admission**: When a pod is assigned to a node, kubelet performs + local admission checks including resource availability. + +3. **Memory Manager (Static policy)**: For Guaranteed QoS pods, the Memory + Manager's `Allocate()` function: + - Checks its internal state for available hugepages per NUMA node + - Selects NUMA nodes for the allocation + - Updates its internal tracking + - **Gap**: Does not verify actual OS-reported free hugepages + +4. **Container Runtime**: Creates the container with cgroup limits set. If + hugepages are not actually available, the container fails at startup. + +**This KEP addresses the gap in step 3** by adding OS-level verification before +updating internal tracking. + ### User Stories #### Story 1: DPDK Application Admission Failure As a cluster administrator running DPDK-based network functions, I deploy a -Burstable pod that mounts hugetlbfs and consumes 2GB of 1GB hugepages for packet -buffer pools. Later, I deploy a Guaranteed pod also requesting 2GB of 1GB hugepages. +Burstable pod that requests `hugepages-1Gi: 2Gi` for DPDK packet buffer pools. +Later, I deploy a Guaranteed pod also requesting `hugepages-1Gi: 2Gi`. **Current behavior**: The Guaranteed pod is admitted (Memory Manager shows hugepages as available) but fails at container startup when DPDK tries to allocate -hugepages that are already consumed. +hugepages that are already consumed by the Burstable pod. **Desired behavior**: The Guaranteed pod admission fails immediately with a clear error indicating insufficient free hugepages, allowing the scheduler to try another node or the administrator to take corrective action. -#### Story 2: Database Workload with Hugepages +#### Story 2: Rapid Pod Churn with Hugepages -As a database administrator, I run PostgreSQL with hugepages enabled for shared -buffers. If an external monitoring agent or debugging tool temporarily consumes -hugepages, subsequent Guaranteed pods requesting hugepages should not be admitted -until hugepages are freed. +As a platform engineer, I run batch jobs that use hugepages. Multiple jobs complete +and new jobs start in quick succession: -**Current behavior**: Pods are admitted based on Memory Manager tracking and fail -at runtime. +1. Node has 8GB of 2MB hugepages total +2. Burstable Job A (requests 4GB hugepages) completes, releasing hugepages +3. Guaranteed Job B (requests 6GB hugepages) is scheduled to this node +4. Before Job B's container starts, Burstable Job C (requests 4GB hugepages) starts +5. Job C's container allocates hugepages from the OS -**Desired behavior**: Pods are rejected at admission with informative errors. +**Current behavior**: The scheduler approved Job B based on node capacity (8GB). +Memory Manager's internal state (tracking only Guaranteed pods) shows 8GB available. +Job B is admitted, but when its container starts, only 4GB are actually free. +Job B fails at runtime. + +**Desired behavior**: Memory Manager reads sysfs during admission and sees only +4GB free. Job B is rejected with error: +`insufficient hugepages-2Mi on NUMA node(s) [0,1]: requested 6Gi, available 4Gi` + +Job B can be rescheduled to another node with sufficient hugepages. ### Notes/Constraints/Caveats -- **Race condition window**: A small window exists between verification and actual - container startup where hugepages could be consumed. This is inherent to any - admission-time check but significantly reduces the failure window compared to - no verification. +- **Race condition window**: A window exists between verification and actual + container startup where hugepages could be consumed by another process. This is + inherent to any admission-time check. + + **What happens if verification passes but container still fails?** + 1. Container startup fails with OOM or hugepage allocation error + 2. Kubelet emits `FailedCreatePodContainer` event with details + 3. Pod enters `CrashLoopBackOff` or `Error` state + 4. Scheduler may reschedule to another node (if applicable) + + **Why this is still valuable**: Without verification, the failure window spans + from pod scheduling to container startup (seconds to minutes). With verification, + the window is reduced to milliseconds between sysfs read and container start. + The vast majority of failures are prevented. -- **sysfs dependency**: The feature depends on reading from sysfs. If sysfs is - unavailable or the free_hugepages file cannot be read, the feature gracefully - degrades to current behavior (no verification). +- **Linux-only**: This feature is Linux-specific. The sysfs interface for hugepages + (`/sys/devices/system/node/node/hugepages/`) is a Linux kernel feature. + On Linux systems where hugepages are configured, this sysfs interface is always + available. - **Per-NUMA verification**: Verification is performed per-NUMA node, consistent - with the Memory Manager's NUMA-aware design. + with the Memory Manager's NUMA-aware design and Topology Manager coordination. + +- **Static policy only**: Verification only applies when Memory Manager's Static + policy is enabled. With the "None" policy, Memory Manager doesn't track hugepage + allocations at all, so there's no internal state to become stale. The scheduler's + node-level tracking is the only safeguard with the None policy. ### Risks and Mitigations | Risk | Mitigation | |------|------------| -| sysfs reads add latency to admission | Minimal impact: single file read per hugepage size per NUMA node | -| False rejections due to transient consumption | Acceptable: better to reject than admit and fail at runtime | -| sysfs unavailable in some environments | Graceful degradation: skip verification if sysfs unreadable | +| sysfs reads add latency to admission | Minimal impact: single file read per hugepage size per NUMA node; < 1ms typically | +| False rejections due to transient consumption | Acceptable: better to reject than admit and fail at runtime; pod can be rescheduled | +| Verification passes but container still fails (race) | Window is milliseconds vs seconds/minutes without verification; event emitted for debugging | +| Fresh sysfs reads on every Allocate() | Lightweight operation; only triggered for pods requesting hugepages | ## Design Details @@ -177,55 +259,159 @@ at runtime. The implementation consists of two parts: -1. **cadvisor**: Add `FreePages *uint64` field to `HugePagesInfo` struct, populated - from sysfs. Uses pointer with `omitempty` to distinguish between "0 free" and - "data unavailable". +1. **cadvisor**: Add `FreePages uint64` field to `HugePagesInfo` struct, populated + from sysfs. Also expose a method to read current free hugepages on-demand. 2. **kubelet Memory Manager**: Add `verifyOSHugepagesAvailability()` function - called during `Allocate()` that compares requested hugepages against OS-reported - free hugepages from cadvisor's machine info. + called during `Allocate()` that reads **fresh** hugepage availability from sysfs. + +**Important**: cadvisor's `GetMachineInfo()` is called once at startup and cached. +The `FreePages` field in cached machine info would be stale. Therefore, verification +must read sysfs directly during each `Allocate()` call, not rely on cached values. +We will add a `GetCurrentHugepagesInfo()` method to cadvisor's `Manager` interface +that performs a fresh sysfs read. ### cadvisor Changes +**Struct update**: ```go type HugePagesInfo struct { // huge page size (in kB) PageSize uint64 `json:"page_size"` // number of huge pages NumPages uint64 `json:"num_pages"` - // number of free huge pages (nil if unavailable) - FreePages *uint64 `json:"free_pages,omitempty"` + // number of free huge pages + FreePages uint64 `json:"free_pages"` } ``` +**New method on Manager interface**: +```go +// GetCurrentHugepagesInfo returns fresh hugepage info per NUMA node by reading sysfs. +// This is separate from GetMachineInfo() which returns cached startup data. +func (m *manager) GetCurrentHugepagesInfo() (map[int][]HugePagesInfo, error) +``` + The `FreePages` field is populated by reading from: ``` /sys/devices/system/node/node/hugepages/hugepages-kB/free_hugepages ``` +**Note on reserved hugepages**: Linux tracks `resv_hugepages` (reserved but not +yet faulted). For this implementation, we use `free_hugepages` directly because: +- Reserved pages are committed to specific processes +- A new pod cannot use reserved pages +- `free_hugepages` accurately reflects what's available for new allocations + +**Note**: Since sysfs is always available on Linux systems with hugepages configured, +we use a simple `uint64` rather than a pointer. A value of 0 means zero free +hugepages are available. + ### Memory Manager Changes During `Allocate()` in the Static policy: ```go func (p *staticPolicy) verifyOSHugepagesAvailability( - machineState state.NUMANodeMap, + candidateNUMANodes []int, // NUMA nodes selected by allocation algorithm pod *v1.Pod, container *v1.Container, ) error { - // For each hugepage size requested by the container: - // 1. Get the OS-reported free hugepages from cadvisor machine info - // 2. Compare against the requested amount - // 3. Return error if insufficient + // 1. Call cadvisor's GetCurrentHugepagesInfo() to get fresh sysfs data + // 2. For each hugepage size requested by the container: + // a. Sum free hugepages across candidateNUMANodes only + // b. Compare against the requested amount + // 3. Return error if insufficient, with detailed message } ``` The verification: -- Only runs when the Static policy is enabled +- Only runs when the Static policy is enabled and feature gate is on - Only checks hugepage resources (not regular memory) -- Aggregates free hugepages across candidate NUMA nodes +- **Respects NUMA node selection**: Only checks the specific NUMA nodes that the + Memory Manager's allocation algorithm has selected (see Topology Manager section) - Returns admission error if insufficient free hugepages +**Error message format**: +``` +insufficient hugepages-2Mi on NUMA node(s) [0]: requested 4Gi, available 2Gi +``` + +### Integration with Topology Manager + +The Memory Manager works with Topology Manager to coordinate NUMA-aware resource +allocation. The verification must respect Topology Manager's policy: + +| Topology Policy | Verification Behavior | +|-----------------|----------------------| +| `none` | Not applicable (Memory Manager Static policy requires topology-aware policies) | +| `best-effort` | Check aggregate across all candidate NUMA nodes | +| `restricted` | Check only NUMA nodes that satisfy topology constraints | +| `single-numa-node` | Check only the single selected NUMA node | + +**Critical**: Verification happens **after** the Memory Manager's allocation algorithm +selects candidate NUMA nodes based on topology constraints. We verify against those +specific nodes, not all nodes on the system. + +Example with `single-numa-node` policy: +``` +Node topology: NUMA0 (2GB free), NUMA1 (3GB free) +Pod requests: 2GB hugepages +Allocation selects: NUMA0 (meets the request) +Verification checks: NUMA0 only → 2GB available ≥ 2GB requested ✓ +``` + +Example where aggregate would be misleading: +``` +Node topology: NUMA0 (1GB free), NUMA1 (1GB free) +Pod requests: 2GB hugepages with single-numa-node policy +Allocation fails: Neither NUMA node has 2GB alone +(Verification never reached - allocation algorithm rejects first) +``` + +### Interaction with CPU Manager + +When CPU Manager pins a pod to specific CPUs, those CPUs belong to specific NUMA +nodes. Topology Manager coordinates this to ensure Memory Manager allocates from +the same NUMA node(s). The verification inherits this coordination because it +checks only the candidate NUMA nodes selected by the allocation algorithm. + +### Observability + +This feature provides explicit signals for operators to monitor hugepage verification: + +#### Metrics + +| Metric | Type | Description | +|--------|------|-------------| +| `memory_manager_hugepages_verification_total` | Counter | Total verification checks performed. Labels: `result` (success/failure), `hugepage_size` | +| `memory_manager_hugepages_verification_failures_total` | Counter | Pods rejected due to insufficient OS-reported hugepages. Labels: `hugepage_size`, `numa_node` | +| `memory_manager_hugepages_verification_latency_seconds` | Histogram | Time spent performing verification (buckets: 1ms to 100ms) | + +#### Events + +When a pod is rejected due to insufficient hugepages, a Kubernetes event is generated: + +``` +Type: Warning +Reason: FailedHugepagesVerification +Message: insufficient hugepages-2Mi on NUMA node(s) [0]: requested 4Gi, available 2Gi +``` + +#### Kubelet Logs + +At `--v=4` or higher, kubelet logs verification details: +``` +I0127 10:15:32.123456 12345 policy_static.go:XXX] "Verifying OS hugepages availability" pod="default/dpdk-app" container="dpdk" +I0127 10:15:32.123789 12345 policy_static.go:XXX] "Hugepages verification passed" pod="default/dpdk-app" numaNodes=[0] size="hugepages-2Mi" requested=1073741824 available=2147483648 +``` + +#### Alerting Recommendations + +Operators should consider alerts for: +- `rate(memory_manager_hugepages_verification_failures_total[5m]) > 0`: Pods being rejected +- `histogram_quantile(0.99, memory_manager_hugepages_verification_latency_seconds) > 0.05`: High verification latency + ### Test Plan [x] I/we understand the owners of the involved components may require updates to @@ -242,8 +428,10 @@ to implement this enhancement. - `pkg/kubelet/cm/memorymanager`: Add tests for `verifyOSHugepagesAvailability()` - Test successful verification when free hugepages >= requested - Test rejection when free hugepages < requested - - Test graceful handling when FreePages is nil (sysfs unavailable) - - Test per-NUMA node verification + - Test verification with zero free hugepages (FreePages = 0) + - Test per-NUMA node verification respects candidate node selection + - Test multiple hugepage sizes in same request + - Test with feature gate enabled/disabled ##### Integration tests @@ -262,6 +450,9 @@ to implement this enhancement. - Feature implemented behind `MemoryManagerHugepagesVerification` feature gate - Unit tests for verification logic +- E2e tests demonstrating: + - Pod admission succeeds when sufficient free hugepages exist + - Pod admission fails when insufficient free hugepages exist - Documentation for feature gate and behavior #### Beta @@ -285,15 +476,19 @@ by a feature gate. Existing pods are unaffected. **Downgrade**: Disabling the feature gate returns to previous behavior where OS hugepage availability is not verified. No data migration needed. +**Kubelet restart behavior**: After kubelet restarts, Memory Manager rebuilds its +internal state from checkpoint. Since verification reads fresh sysfs data on each +`Allocate()` call, there's no stale state concern. New pod admissions after restart +will correctly verify against current OS hugepage availability. + ### Version Skew Strategy The feature is entirely within the kubelet and depends on cadvisor (vendored). No control plane or cross-component version skew concerns. -When kubelet is upgraded but cadvisor hasn't been updated to provide `FreePages`: -- The field will be `nil` -- Verification will be skipped (graceful degradation) -- Warning logged indicating verification unavailable +Since cadvisor is vendored into kubelet, the kubelet and cadvisor versions are +always synchronized. The `FreePages` field will be available when the feature +gate is enabled. ## Production Readiness Review Questionnaire @@ -322,7 +517,11 @@ The feature resumes verification on new pod admissions. No special handling need ###### Are there any tests for feature enablement/disablement? -Unit tests will verify behavior with feature gate enabled and disabled. +Yes. Unit tests will verify: +- When feature gate is disabled: verification is skipped, pods are admitted + based on Memory Manager's internal tracking (existing behavior) +- When feature gate is enabled: verification is performed, pods are rejected + if OS-reported free hugepages are insufficient ### Rollout, Upgrade and Rollback Planning @@ -367,14 +566,25 @@ No. ###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? - [x] Metrics + - Metric name: `memory_manager_hugepages_verification_total` + - Components exposing the metric: kubelet + - Description: Total number of hugepages verification checks performed + - Labels: `result` (success, failure), `hugepage_size` (e.g., 2Mi, 1Gi) - Metric name: `memory_manager_hugepages_verification_failures_total` - - Components exposing the metric: kubelet + - Components exposing the metric: kubelet + - Description: Total number of pods rejected due to insufficient OS-reported hugepages + - Labels: `hugepage_size`, `numa_node` - Metric name: `memory_manager_hugepages_verification_latency_seconds` - - Components exposing the metric: kubelet + - Components exposing the metric: kubelet + - Description: Histogram of time spent performing hugepages verification + - Buckets: 0.001, 0.005, 0.01, 0.025, 0.05, 0.1 seconds ###### Are there any missing metrics that would be useful to have to improve observability of this feature? -The proposed metrics should provide adequate observability. +Additional metrics that could be added in Beta: +- `memory_manager_hugepages_discrepancy_bytes`: Gauge showing difference between + Memory Manager's internal tracking and OS-reported free hugepages (useful for + detecting drift) ### Dependencies @@ -423,22 +633,28 @@ No impact. The feature operates entirely within kubelet using local sysfs. ###### What are other known failure modes? -- sysfs unavailable or unreadable - - Detection: Warning logs from kubelet, nil FreePages in machine info - - Mitigations: Feature gracefully degrades to previous behavior - - Diagnostics: Check kubelet logs for sysfs read warnings - - Testing: Unit tests cover this scenario +- Verification rejects pods that would have succeeded + - Detection: Increase in `memory_manager_hugepages_verification_failures_total` + with pods eventually succeeding on retry + - Mitigations: This indicates transient hugepage consumption; the feature is + working correctly by preventing admission during contention + - Diagnostics: Compare verification failure count with actual runtime failures + - Testing: E2e tests verify this scenario ###### What steps should be taken if SLOs are not being met to determine the problem? 1. Check kubelet logs for verification-related messages -2. Verify sysfs is accessible and free_hugepages files exist -3. Compare Memory Manager state with actual sysfs values +2. Review `memory_manager_hugepages_verification_latency_seconds` histogram + for unusually slow verification +3. Compare Memory Manager state with actual sysfs values using: + `cat /sys/devices/system/node/node*/hugepages/hugepages-*/free_hugepages` 4. Check for excessive pod admission rate causing contention ## Implementation History - 2024-12-24: Initial KEP draft +- 2024-12-27: KEP updated based on reviewer feedback +- Enhancement issue: https://github.com/kubernetes/enhancements/issues/5759 - Related issue: https://github.com/kubernetes/kubernetes/issues/134395 - cadvisor PR: https://github.com/google/cadvisor/pull/3804 diff --git a/keps/sig-node/NNNN-memory-manager-hugepages-verification/kep.yaml b/keps/sig-node/5759-memory-manager-hugepages-verification/kep.yaml similarity index 88% rename from keps/sig-node/NNNN-memory-manager-hugepages-verification/kep.yaml rename to keps/sig-node/5759-memory-manager-hugepages-verification/kep.yaml index 4ae698cc7144..9834b8d39f9e 100644 --- a/keps/sig-node/NNNN-memory-manager-hugepages-verification/kep.yaml +++ b/keps/sig-node/5759-memory-manager-hugepages-verification/kep.yaml @@ -1,5 +1,5 @@ title: Memory Manager Hugepages Availability Verification -kep-number: NNNN +kep-number: 5759 authors: - "@srikalyan" owning-sig: sig-node @@ -20,13 +20,13 @@ stage: alpha # The most recent milestone for which work toward delivery of this KEP has been # done. This can be the current (upcoming) milestone, if it is being actively # worked on. -latest-milestone: "v1.33" +latest-milestone: "v1.36" # The milestone at which this feature was, or is targeted to be, at each stage. milestone: - alpha: "v1.33" - beta: "v1.34" - stable: "v1.35" + alpha: "v1.36" + beta: "v1.37" + stable: "v1.38" # The following PRR answers are required at alpha release # List the feature gate name and the components for which it must be enabled @@ -38,5 +38,6 @@ disable-supported: true # The following PRR answers are required at beta release metrics: + - memory_manager_hugepages_verification_total - memory_manager_hugepages_verification_failures_total - memory_manager_hugepages_verification_latency_seconds From 8e6ae0902c2c2e3acdbcb20df3a78e12e574574f Mon Sep 17 00:00:00 2001 From: Srikalyan Swayampakula Date: Sat, 27 Dec 2025 12:11:56 -0800 Subject: [PATCH 04/10] Present implementation options without recommendation - Add two implementation approaches: Option A (direct sysfs) and Option B (cadvisor) - Present pros/cons for each option neutrally for KEP review - Remove cadvisor-specific sections, replace with options discussion - Add Observability section with metrics, events, logs, alerting - Update TOC to pass CI verification - Update KEP number to 5759 throughout The choice between implementation approaches is left to KEP reviewers based on maintainability preferences and timeline considerations. --- .../README.md | 143 +++++++++--------- 1 file changed, 73 insertions(+), 70 deletions(-) diff --git a/keps/sig-node/5759-memory-manager-hugepages-verification/README.md b/keps/sig-node/5759-memory-manager-hugepages-verification/README.md index d7ae7b123438..9eb9ccd4d4db 100644 --- a/keps/sig-node/5759-memory-manager-hugepages-verification/README.md +++ b/keps/sig-node/5759-memory-manager-hugepages-verification/README.md @@ -17,11 +17,18 @@ - [Risks and Mitigations](#risks-and-mitigations) - [Design Details](#design-details) - [Implementation Overview](#implementation-overview) - - [cadvisor Changes](#cadvisor-changes) + - [Implementation Approaches](#implementation-approaches) + - [Option A: Direct sysfs Reading in Memory Manager](#option-a-direct-sysfs-reading-in-memory-manager) + - [Option B: Add Fresh-Read Method to cadvisor](#option-b-add-fresh-read-method-to-cadvisor) + - [sysfs Interface](#sysfs-interface) - [Memory Manager Changes](#memory-manager-changes) - [Integration with Topology Manager](#integration-with-topology-manager) - [Interaction with CPU Manager](#interaction-with-cpu-manager) - [Observability](#observability) + - [Metrics](#metrics) + - [Events](#events) + - [Kubelet Logs](#kubelet-logs) + - [Alerting Recommendations](#alerting-recommendations) - [Test Plan](#test-plan) - [Prerequisite testing updates](#prerequisite-testing-updates) - [Unit tests](#unit-tests) @@ -44,8 +51,7 @@ - [Drawbacks](#drawbacks) - [Alternatives](#alternatives) - [Alternative 1: Track all pod hugepage usage](#alternative-1-track-all-pod-hugepage-usage) - - [Alternative 2: Query sysfs directly in Memory Manager](#alternative-2-query-sysfs-directly-in-memory-manager) - - [Alternative 3: Scheduler-level hugepage awareness](#alternative-3-scheduler-level-hugepage-awareness) + - [Alternative 2: Scheduler-level hugepage awareness](#alternative-2-scheduler-level-hugepage-awareness) ## Release Signoff Checklist @@ -144,14 +150,14 @@ Memory Manager wasn't tracking. ## Proposal Enhance the Memory Manager's Static policy to verify actual hugepage availability -by querying sysfs during pod admission. This involves: +by querying sysfs during pod admission: -1. **cadvisor enhancement**: Add a `FreePages` field to `HugePagesInfo` struct - that reports free hugepages per NUMA node, read from sysfs +**Memory Manager enhancement**: During `Allocate()` in the Static policy, +verify that OS-reported free hugepages (read from sysfs) meets or exceeds the +requested amount before admitting the pod. -2. **Memory Manager enhancement**: During `Allocate()` in the Static policy, - verify that OS-reported free hugepages meet or exceed the requested amount - before admitting the pod +See [Implementation Approaches](#implementation-approaches) for options on how +the sysfs reading is performed. ### Current Admission Flow @@ -257,56 +263,62 @@ Job B can be rescheduled to another node with sufficient hugepages. ### Implementation Overview -The implementation consists of two parts: +The core enhancement is adding a `verifyOSHugepagesAvailability()` function to +the Memory Manager's Static policy, called during `Allocate()`. This function +reads fresh hugepage availability and rejects pods when insufficient. -1. **cadvisor**: Add `FreePages uint64` field to `HugePagesInfo` struct, populated - from sysfs. Also expose a method to read current free hugepages on-demand. +### Implementation Approaches -2. **kubelet Memory Manager**: Add `verifyOSHugepagesAvailability()` function - called during `Allocate()` that reads **fresh** hugepage availability from sysfs. +There are two approaches for reading free hugepages: -**Important**: cadvisor's `GetMachineInfo()` is called once at startup and cached. -The `FreePages` field in cached machine info would be stale. Therefore, verification -must read sysfs directly during each `Allocate()` call, not rely on cached values. -We will add a `GetCurrentHugepagesInfo()` method to cadvisor's `Manager` interface -that performs a fresh sysfs read. +#### Option A: Direct sysfs Reading in Memory Manager -### cadvisor Changes +Read sysfs directly in the Memory Manager without cadvisor changes. -**Struct update**: -```go -type HugePagesInfo struct { - // huge page size (in kB) - PageSize uint64 `json:"page_size"` - // number of huge pages - NumPages uint64 `json:"num_pages"` - // number of free huge pages - FreePages uint64 `json:"free_pages"` -} -``` +**Pros:** +- No external dependencies on critical admission path +- Simple implementation (~10 lines of sysfs reading) +- Faster to implement and merge (single repo) +- Memory Manager already reads memory topology from sysfs (precedent) -**New method on Manager interface**: -```go -// GetCurrentHugepagesInfo returns fresh hugepage info per NUMA node by reading sysfs. -// This is separate from GetMachineInfo() which returns cached startup data. -func (m *manager) GetCurrentHugepagesInfo() (map[int][]HugePagesInfo, error) -``` +**Cons:** +- Duplicates sysfs reading logic (though trivial) +- Other cadvisor consumers don't benefit + +#### Option B: Add Fresh-Read Method to cadvisor + +Add `GetCurrentHugepagesInfo()` method to cadvisor that reads sysfs on-demand. + +**Note**: cadvisor's existing `GetMachineInfo()` is cached at startup, so simply +adding a `FreePages` field there would be stale. A new method for fresh reads +would be required. + +**Pros:** +- Single source of truth for hugepage info +- Benefits other cadvisor consumers +- Cleaner abstraction -The `FreePages` field is populated by reading from: +**Cons:** +- Cross-repo dependency (cadvisor PR must merge first) +- Adds API surface to cadvisor +- Longer timeline + +The choice between options should be made during KEP review based on +maintainability preferences and timeline considerations. + +### sysfs Interface + +Regardless of approach, free hugepages are read from: ``` /sys/devices/system/node/node/hugepages/hugepages-kB/free_hugepages ``` **Note on reserved hugepages**: Linux tracks `resv_hugepages` (reserved but not -yet faulted). For this implementation, we use `free_hugepages` directly because: +yet faulted). We use `free_hugepages` directly because: - Reserved pages are committed to specific processes - A new pod cannot use reserved pages - `free_hugepages` accurately reflects what's available for new allocations -**Note**: Since sysfs is always available on Linux systems with hugepages configured, -we use a simple `uint64` rather than a pointer. A value of 0 means zero free -hugepages are available. - ### Memory Manager Changes During `Allocate()` in the Static policy: @@ -317,7 +329,7 @@ func (p *staticPolicy) verifyOSHugepagesAvailability( pod *v1.Pod, container *v1.Container, ) error { - // 1. Call cadvisor's GetCurrentHugepagesInfo() to get fresh sysfs data + // 1. Read free hugepages directly from sysfs for each NUMA node // 2. For each hugepage size requested by the container: // a. Sum free hugepages across candidateNUMANodes only // b. Compare against the requested amount @@ -421,7 +433,7 @@ to implement this enhancement. ##### Prerequisite testing updates - Existing Memory Manager unit tests cover allocation logic -- cadvisor tests cover sysfs reading functionality +- For Option B: cadvisor tests cover sysfs reading functionality ##### Unit tests @@ -435,7 +447,7 @@ to implement this enhancement. ##### Integration tests -- Test Memory Manager with mocked cadvisor returning various FreePages values +- Test Memory Manager with mocked hugepage availability (sysfs or cadvisor depending on chosen approach) - Test admission flow with hugepage verification enabled/disabled ##### e2e tests @@ -483,12 +495,11 @@ will correctly verify against current OS hugepage availability. ### Version Skew Strategy -The feature is entirely within the kubelet and depends on cadvisor (vendored). -No control plane or cross-component version skew concerns. +The feature is entirely within the kubelet. No control plane or cross-component +version skew concerns. -Since cadvisor is vendored into kubelet, the kubelet and cadvisor versions are -always synchronized. The `FreePages` field will be available when the feature -gate is enabled. +- **Option A**: No version skew concerns (direct sysfs reading) +- **Option B**: Since cadvisor is vendored into kubelet, versions are synchronized ## Production Readiness Review Questionnaire @@ -547,8 +558,9 @@ No. ###### How can an operator determine if the feature is in use by workloads? -- Feature gate is enabled -- Pods request hugepages resources +- Feature gate `MemoryManagerHugepagesVerification` is enabled +- Metric `memory_manager_hugepages_verification_total` is incrementing (indicates verification checks are being performed) +- Pods with Guaranteed QoS requesting hugepages resources are being scheduled ###### How can someone using this feature know that it is working for their instance? @@ -590,16 +602,16 @@ Additional metrics that could be added in Beta: ###### Does this feature depend on any specific services running in the cluster? -- cadvisor (bundled with kubelet) - - Usage: Provides machine info including hugepage free counts - - Impact of outage: Verification skipped, graceful degradation - - Impact of degraded performance: Slightly increased admission latency +Depends on the implementation approach chosen (see [Implementation Approaches](#implementation-approaches)): + +- **Option A (Direct sysfs)**: No external dependencies. Reads directly from Linux sysfs. +- **Option B (cadvisor)**: Depends on cadvisor (bundled with kubelet) for fresh hugepage reads. ### Scalability ###### Will enabling / using this feature result in any new API calls? -No new API calls. The feature reads from local sysfs and cadvisor machine info. +No new API calls. The feature reads from local sysfs files. ###### Will enabling / using this feature result in introducing new API types? @@ -653,10 +665,10 @@ No impact. The feature operates entirely within kubelet using local sysfs. ## Implementation History - 2024-12-24: Initial KEP draft -- 2024-12-27: KEP updated based on reviewer feedback +- 2024-12-27: KEP updated based on reviewer feedback; added implementation options - Enhancement issue: https://github.com/kubernetes/enhancements/issues/5759 - Related issue: https://github.com/kubernetes/kubernetes/issues/134395 -- cadvisor PR: https://github.com/google/cadvisor/pull/3804 +- cadvisor PR (for Option B): https://github.com/google/cadvisor/pull/3804 (draft) ## Drawbacks @@ -675,16 +687,7 @@ Extend Memory Manager to track hugepage usage by Burstable and BestEffort pods. - Would not catch external (non-Kubernetes) hugepage consumers - Changes the scope and purpose of Memory Manager -### Alternative 2: Query sysfs directly in Memory Manager - -Read sysfs directly in Memory Manager without cadvisor changes. - -**Rejected because**: -- Duplicates sysfs reading logic already in cadvisor -- cadvisor already provides machine info abstraction -- Adding to cadvisor benefits other consumers of machine info - -### Alternative 3: Scheduler-level hugepage awareness +### Alternative 2: Scheduler-level hugepage awareness Add hugepage availability awareness to the Kubernetes scheduler. From 36099e341fa39240d1a444082d820aa3fe6a1ef2 Mon Sep 17 00:00:00 2001 From: Srikalyan Swayampakula Date: Thu, 22 Jan 2026 13:24:41 -0800 Subject: [PATCH 05/10] KEP-5759: Add reviewers and approvers to kep.yaml - Add ffromani, derekwaynecarr, mrunalp as reviewers - Add dchen1107 as approver (sig-node OWNERS) --- .../5759-memory-manager-hugepages-verification/kep.yaml | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/keps/sig-node/5759-memory-manager-hugepages-verification/kep.yaml b/keps/sig-node/5759-memory-manager-hugepages-verification/kep.yaml index 9834b8d39f9e..e8a539d758fd 100644 --- a/keps/sig-node/5759-memory-manager-hugepages-verification/kep.yaml +++ b/keps/sig-node/5759-memory-manager-hugepages-verification/kep.yaml @@ -7,9 +7,11 @@ participating-sigs: [] status: provisional creation-date: 2024-12-24 reviewers: - - TBD + - "@ffromani" + - "@derekwaynecarr" + - "@mrunalp" approvers: - - TBD + - "@dchen1107" see-also: - "/keps/sig-node/1769-memory-manager" From f81f9228b11cd259750faf2ff7e146e4824f8611 Mon Sep 17 00:00:00 2001 From: Srikalyan Swayampakula Date: Sat, 31 Jan 2026 23:46:54 -0800 Subject: [PATCH 06/10] KEP-5759: Add PRR approval file and update approvers - Add haircommander (Peter Hunt) as KEP approver - Add PRR approval file for alpha stage with johnbelamaric as approver --- keps/prod-readiness/sig-node/5759.yaml | 6 ++++++ .../5759-memory-manager-hugepages-verification/kep.yaml | 1 + 2 files changed, 7 insertions(+) create mode 100644 keps/prod-readiness/sig-node/5759.yaml diff --git a/keps/prod-readiness/sig-node/5759.yaml b/keps/prod-readiness/sig-node/5759.yaml new file mode 100644 index 000000000000..ba028ff810b3 --- /dev/null +++ b/keps/prod-readiness/sig-node/5759.yaml @@ -0,0 +1,6 @@ +# The KEP must have an approver from the +# "prod-readiness-approvers" group +# of http://git.k8s.io/enhancements/OWNERS_ALIASES +kep-number: 5759 +alpha: + approver: "@johnbelamaric" diff --git a/keps/sig-node/5759-memory-manager-hugepages-verification/kep.yaml b/keps/sig-node/5759-memory-manager-hugepages-verification/kep.yaml index e8a539d758fd..0f349e3c39c1 100644 --- a/keps/sig-node/5759-memory-manager-hugepages-verification/kep.yaml +++ b/keps/sig-node/5759-memory-manager-hugepages-verification/kep.yaml @@ -12,6 +12,7 @@ reviewers: - "@mrunalp" approvers: - "@dchen1107" + - "@haircommander" see-also: - "/keps/sig-node/1769-memory-manager" From 4753aad37e8e2a367d4698f6b438e0166f4fb6e3 Mon Sep 17 00:00:00 2001 From: Srikalyan Swayampakula Date: Wed, 11 Feb 2026 22:09:10 -0800 Subject: [PATCH 07/10] Apply suggestion from @wendy-ha18 Co-authored-by: Wendy Ha <139814343+wendy-ha18@users.noreply.github.com> --- .../5759-memory-manager-hugepages-verification/kep.yaml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/keps/sig-node/5759-memory-manager-hugepages-verification/kep.yaml b/keps/sig-node/5759-memory-manager-hugepages-verification/kep.yaml index 0f349e3c39c1..1e7b2aa6dbff 100644 --- a/keps/sig-node/5759-memory-manager-hugepages-verification/kep.yaml +++ b/keps/sig-node/5759-memory-manager-hugepages-verification/kep.yaml @@ -4,7 +4,7 @@ authors: - "@srikalyan" owning-sig: sig-node participating-sigs: [] -status: provisional +status: implementable creation-date: 2024-12-24 reviewers: - "@ffromani" From f80d3accef25ec42f7e2b9c9795f7f2fbca7833d Mon Sep 17 00:00:00 2001 From: Srikalyan Swayampakula Date: Wed, 11 Feb 2026 22:44:25 -0800 Subject: [PATCH 08/10] Address PRR and design review feedback - Move metrics from Beta to Alpha graduation criteria per ffromani's request to have observability available at alpha stage - Change "TBD during alpha phase" to "Will be done during alpha phase" per johnbelamaric's nit on the upgrade/rollback testing question - Add Alternative 3: Standalone NUMA-aware hugepages admission handler with pros/cons analysis per ffromani's suggestion - Expand Alternative 1 with NUMA tracking limitation: without cpuset.mems enforcement, NUMA node allocation is unknown until container runtime, making per-pod tracking infeasible at admission - Reframe race condition caveat to emphasize kubelet/workload contract breach rather than just startup failure timing - Relax milestone timeline: beta v1.38, stable v1.40 - Remove sysfs availability from risk table (sysfs is a kubelet precondition) - Recommend Option A (direct sysfs reading) with rationale - Remove feature gate as safety mechanism framing throughout - Remove hardcoded error message format (not a public API) - Remove specific log format and alerting recommendation sections - Simplify Events section to describe behavior without locking format - Move conformance tests from GA to Beta criteria - Update GA to "feature always enabled (feature gate removed)" - Reword Upgrade/Downgrade without feature gate dependency - Update rollback answer to reflect always-enabled at GA - Replace speculative discrepancy metric with alpha evaluation plan --- .../README.md | 135 ++++++++++-------- .../kep.yaml | 4 +- 2 files changed, 81 insertions(+), 58 deletions(-) diff --git a/keps/sig-node/5759-memory-manager-hugepages-verification/README.md b/keps/sig-node/5759-memory-manager-hugepages-verification/README.md index 9eb9ccd4d4db..f6c072bda6f6 100644 --- a/keps/sig-node/5759-memory-manager-hugepages-verification/README.md +++ b/keps/sig-node/5759-memory-manager-hugepages-verification/README.md @@ -27,8 +27,6 @@ - [Observability](#observability) - [Metrics](#metrics) - [Events](#events) - - [Kubelet Logs](#kubelet-logs) - - [Alerting Recommendations](#alerting-recommendations) - [Test Plan](#test-plan) - [Prerequisite testing updates](#prerequisite-testing-updates) - [Unit tests](#unit-tests) @@ -50,8 +48,9 @@ - [Implementation History](#implementation-history) - [Drawbacks](#drawbacks) - [Alternatives](#alternatives) - - [Alternative 1: Track all pod hugepage usage](#alternative-1-track-all-pod-hugepage-usage) + - [Alternative 1: Track all pod hugepage usage per NUMA node](#alternative-1-track-all-pod-hugepage-usage-per-numa-node) - [Alternative 2: Scheduler-level hugepage awareness](#alternative-2-scheduler-level-hugepage-awareness) + - [Alternative 3: Standalone NUMA-aware hugepages admission handler](#alternative-3-standalone-numa-aware-hugepages-admission-handler) ## Release Signoff Checklist @@ -232,10 +231,15 @@ Job B can be rescheduled to another node with sufficient hugepages. 3. Pod enters `CrashLoopBackOff` or `Error` state 4. Scheduler may reschedule to another node (if applicable) - **Why this is still valuable**: Without verification, the failure window spans - from pod scheduling to container startup (seconds to minutes). With verification, - the window is reduced to milliseconds between sysfs read and container start. - The vast majority of failures are prevented. + **Why this is still valuable**: Beyond startup failures and timing, the core + issue is that without verification the kubelet/workload contract is breached. + The implicit contract is that once a pod is admitted, the requested resources + are available. Without this fix, that contract is violated for hugepages when + the Memory Manager's internal state diverges from OS reality (as demonstrated + in [issue #134395](https://github.com/kubernetes/kubernetes/issues/134395)). + With verification, the failure window is reduced from seconds/minutes to + milliseconds between sysfs read and container start, and the vast majority + of contract violations are prevented. - **Linux-only**: This feature is Linux-specific. The sysfs interface for hugepages (`/sys/devices/system/node/node/hugepages/`) is a Linux kernel feature. @@ -257,7 +261,6 @@ Job B can be rescheduled to another node with sufficient hugepages. | sysfs reads add latency to admission | Minimal impact: single file read per hugepage size per NUMA node; < 1ms typically | | False rejections due to transient consumption | Acceptable: better to reject than admit and fail at runtime; pod can be rescheduled | | Verification passes but container still fails (race) | Window is milliseconds vs seconds/minutes without verification; event emitted for debugging | -| Fresh sysfs reads on every Allocate() | Lightweight operation; only triggered for pods requesting hugepages | ## Design Details @@ -303,8 +306,11 @@ would be required. - Adds API surface to cadvisor - Longer timeline -The choice between options should be made during KEP review based on -maintainability preferences and timeline considerations. +**Recommendation: Option A (Direct sysfs reading)**. The sysfs read is trivial +(single file read per NUMA node per hugepage size), the Memory Manager already +has precedent for reading memory topology from sysfs, and it avoids cross-repo +dependencies on the critical admission path. Option B adds API surface to cadvisor +for a very narrow use case that doesn't clearly fit cadvisor's caching model. ### sysfs Interface @@ -338,16 +344,12 @@ func (p *staticPolicy) verifyOSHugepagesAvailability( ``` The verification: -- Only runs when the Static policy is enabled and feature gate is on +- Only runs when the Memory Manager's Static policy is enabled - Only checks hugepage resources (not regular memory) - **Respects NUMA node selection**: Only checks the specific NUMA nodes that the Memory Manager's allocation algorithm has selected (see Topology Manager section) -- Returns admission error if insufficient free hugepages - -**Error message format**: -``` -insufficient hugepages-2Mi on NUMA node(s) [0]: requested 4Gi, available 2Gi -``` +- Returns admission error if insufficient free hugepages, including the hugepage + size, NUMA node(s), requested amount, and available amount to aid debugging ### Integration with Topology Manager @@ -402,27 +404,11 @@ This feature provides explicit signals for operators to monitor hugepage verific #### Events -When a pod is rejected due to insufficient hugepages, a Kubernetes event is generated: - -``` -Type: Warning -Reason: FailedHugepagesVerification -Message: insufficient hugepages-2Mi on NUMA node(s) [0]: requested 4Gi, available 2Gi -``` - -#### Kubelet Logs - -At `--v=4` or higher, kubelet logs verification details: -``` -I0127 10:15:32.123456 12345 policy_static.go:XXX] "Verifying OS hugepages availability" pod="default/dpdk-app" container="dpdk" -I0127 10:15:32.123789 12345 policy_static.go:XXX] "Hugepages verification passed" pod="default/dpdk-app" numaNodes=[0] size="hugepages-2Mi" requested=1073741824 available=2147483648 -``` - -#### Alerting Recommendations - -Operators should consider alerts for: -- `rate(memory_manager_hugepages_verification_failures_total[5m]) > 0`: Pods being rejected -- `histogram_quantile(0.99, memory_manager_hugepages_verification_latency_seconds) > 0.05`: High verification latency +When a pod is rejected due to insufficient hugepages, a Kubernetes event is +generated with reason `FailedHugepagesVerification` containing details about +the hugepage size, NUMA node(s), and the discrepancy between requested and +available amounts. Operators can use `kubectl get events` to identify affected +pods and take corrective action. ### Test Plan @@ -465,28 +451,29 @@ to implement this enhancement. - E2e tests demonstrating: - Pod admission succeeds when sufficient free hugepages exist - Pod admission fails when insufficient free hugepages exist +- Metrics for verification checks and failures - Documentation for feature gate and behavior #### Beta - E2e tests demonstrating correct behavior -- Metrics for verification failures +- Conformance tests if applicable - Feedback incorporated from alpha users - No significant bugs reported #### GA -- Feature enabled by default -- Conformance tests if applicable +- Feature always enabled (feature gate removed) - Documentation updated for stable feature ### Upgrade / Downgrade Strategy -**Upgrade**: No special handling required. The feature is additive and controlled -by a feature gate. Existing pods are unaffected. +**Upgrade**: No special handling required. The feature is additive and only +affects new pod admissions. Existing running pods are unaffected. -**Downgrade**: Disabling the feature gate returns to previous behavior where -OS hugepage availability is not verified. No data migration needed. +**Downgrade**: Reverting to a kubelet version without this feature returns to +previous behavior where OS hugepage availability is not verified. No data +migration or persistent state cleanup is needed. **Kubelet restart behavior**: After kubelet restarts, Memory Manager rebuilds its internal state from checkpoint. Since verification reads fresh sysfs data on each @@ -519,8 +506,10 @@ shows availability. This is the intended behavior to prevent runtime failures. ###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? -Yes. Disabling the feature gate and restarting kubelet returns to previous -behavior. No persistent state is affected. +During alpha/beta, the feature can be disabled via the feature gate and +restarting kubelet, which returns to previous behavior. No persistent state +is affected. At GA, the feature gate will be removed and verification will +be always-enabled, as it strictly improves correctness of pod admission. ###### What happens if we reenable the feature if it was previously rolled back? @@ -548,7 +537,7 @@ impact already running pods. Rollback simply stops verification on new admission ###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? -TBD during alpha phase. +Will be done during alpha phase. ###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? @@ -558,7 +547,6 @@ No. ###### How can an operator determine if the feature is in use by workloads? -- Feature gate `MemoryManagerHugepagesVerification` is enabled - Metric `memory_manager_hugepages_verification_total` is incrementing (indicates verification checks are being performed) - Pods with Guaranteed QoS requesting hugepages resources are being scheduled @@ -593,10 +581,9 @@ No. ###### Are there any missing metrics that would be useful to have to improve observability of this feature? -Additional metrics that could be added in Beta: -- `memory_manager_hugepages_discrepancy_bytes`: Gauge showing difference between - Memory Manager's internal tracking and OS-reported free hugepages (useful for - detecting drift) +To be evaluated during alpha based on operational experience. Candidates include +metrics that help operators identify the root cause of verification failures +(e.g., which workloads are consuming untracked hugepages). ### Dependencies @@ -678,15 +665,26 @@ No impact. The feature operates entirely within kubelet using local sysfs. ## Alternatives -### Alternative 1: Track all pod hugepage usage +### Alternative 1: Track all pod hugepage usage per NUMA node -Extend Memory Manager to track hugepage usage by Burstable and BestEffort pods. +Extend the Memory Manager and admission logic to listen to every pod admission +and track which NUMA node hugepages are allocated from, regardless of QoS class. **Rejected because**: -- Significant refactoring required +- **Fundamental NUMA tracking limitation**: Without `cpuset.mems` enforcement + (which only applies to Guaranteed pods with the Static policy), there is no way + to know which NUMA node hugepages will be allocated from until the container + processes are actually running -- which is past the admission stage. The kernel + allocates hugepages based on the process's memory policy and NUMA node proximity + at fault time, not at cgroup configuration time. - Would not catch external (non-Kubernetes) hugepage consumers +- Significant refactoring of Memory Manager required - Changes the scope and purpose of Memory Manager +The proposed approach of checking actual free resources from sysfs before each +allocation attempt is the best compromise in the current architecture, as it +reflects ground truth regardless of which process or pod consumed the hugepages. + ### Alternative 2: Scheduler-level hugepage awareness Add hugepage availability awareness to the Kubernetes scheduler. @@ -695,3 +693,28 @@ Add hugepage availability awareness to the Kubernetes scheduler. - Much larger scope change - Scheduler operates on reported capacity, not real-time availability - Does not solve the admission-time verification problem + +### Alternative 3: Standalone NUMA-aware hugepages admission handler + +Instead of extending the Memory Manager, add a separate kubelet admission handler +that verifies OS-reported hugepage availability for all pods regardless of QoS class. + +**Pros:** +- Covers all QoS classes (Guaranteed, Burstable, BestEffort), not just Guaranteed +- Cleaner separation of concerns: verification is decoupled from allocation/tracking +- Same failure model (kubelet admission error) without coupling to Memory Manager +- Could obtain NUMA affinity from existing topology hints without strong coupling + +**Cons:** +- Needs to independently resolve NUMA topology and candidate node selection, which + the Memory Manager already computes during `Allocate()` +- Additional admission handler adds coordination overhead with existing handlers +- For Guaranteed pods, the Memory Manager's allocation algorithm already selects + candidate NUMA nodes -- a standalone handler would duplicate or need to replicate + this selection logic to know which NUMA nodes to check +- Larger implementation scope for alpha + +**Decision**: Extend the Memory Manager for alpha since it already has the NUMA +topology context and candidate node selection computed at the point where +verification is needed. A standalone admission handler could be explored in future +iterations to extend coverage to non-Guaranteed pods. diff --git a/keps/sig-node/5759-memory-manager-hugepages-verification/kep.yaml b/keps/sig-node/5759-memory-manager-hugepages-verification/kep.yaml index 1e7b2aa6dbff..163dfc5f735b 100644 --- a/keps/sig-node/5759-memory-manager-hugepages-verification/kep.yaml +++ b/keps/sig-node/5759-memory-manager-hugepages-verification/kep.yaml @@ -28,8 +28,8 @@ latest-milestone: "v1.36" # The milestone at which this feature was, or is targeted to be, at each stage. milestone: alpha: "v1.36" - beta: "v1.37" - stable: "v1.38" + beta: "v1.38" + stable: "v1.40" # The following PRR answers are required at alpha release # List the feature gate name and the components for which it must be enabled From a22c1b0beb87e17e5ba227d0cd14e775b537c1a0 Mon Sep 17 00:00:00 2001 From: Srikalyan Swayampakula Date: Mon, 2 Mar 2026 10:01:24 -0800 Subject: [PATCH 09/10] KEP-5759: Retarget alpha milestone to v1.37 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Shift all milestones by one release cycle: - alpha: v1.36 → v1.37 - beta: v1.38 → v1.39 - stable: v1.40 → v1.41 --- .../5759-memory-manager-hugepages-verification/kep.yaml | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/keps/sig-node/5759-memory-manager-hugepages-verification/kep.yaml b/keps/sig-node/5759-memory-manager-hugepages-verification/kep.yaml index 163dfc5f735b..e3265180db82 100644 --- a/keps/sig-node/5759-memory-manager-hugepages-verification/kep.yaml +++ b/keps/sig-node/5759-memory-manager-hugepages-verification/kep.yaml @@ -23,13 +23,13 @@ stage: alpha # The most recent milestone for which work toward delivery of this KEP has been # done. This can be the current (upcoming) milestone, if it is being actively # worked on. -latest-milestone: "v1.36" +latest-milestone: "v1.37" # The milestone at which this feature was, or is targeted to be, at each stage. milestone: - alpha: "v1.36" - beta: "v1.38" - stable: "v1.40" + alpha: "v1.37" + beta: "v1.39" + stable: "v1.41" # The following PRR answers are required at alpha release # List the feature gate name and the components for which it must be enabled From dba799534b37b4a1248edefd59b4346c8134bac6 Mon Sep 17 00:00:00 2001 From: Srikalyan Swayampakula Date: Mon, 2 Mar 2026 10:27:35 -0800 Subject: [PATCH 10/10] KEP-5759: Clarify verification approach and trim implementation details - Clarify dual-source verification: min(internal_free, os_free) per NUMA node to handle both untracked Burstable pod consumption and not-yet-faulted Guaranteed pod allocations - Remove specific error message formats from KEP to avoid creating implicit API contracts - Add user-observable behavior note pointing to event reason and metrics as the stable interface for identifying verification failures --- .../README.md | 39 +++++++++++++------ 1 file changed, 27 insertions(+), 12 deletions(-) diff --git a/keps/sig-node/5759-memory-manager-hugepages-verification/README.md b/keps/sig-node/5759-memory-manager-hugepages-verification/README.md index f6c072bda6f6..dd57e9b2e5ec 100644 --- a/keps/sig-node/5759-memory-manager-hugepages-verification/README.md +++ b/keps/sig-node/5759-memory-manager-hugepages-verification/README.md @@ -214,10 +214,9 @@ Job B is admitted, but when its container starts, only 4GB are actually free. Job B fails at runtime. **Desired behavior**: Memory Manager reads sysfs during admission and sees only -4GB free. Job B is rejected with error: -`insufficient hugepages-2Mi on NUMA node(s) [0,1]: requested 6Gi, available 4Gi` - -Job B can be rescheduled to another node with sufficient hugepages. +4GB free. Job B is rejected at admission with an error indicating insufficient +free hugepages on the relevant NUMA node(s), allowing it to be rescheduled to +another node with sufficient hugepages. ### Notes/Constraints/Caveats @@ -268,7 +267,18 @@ Job B can be rescheduled to another node with sufficient hugepages. The core enhancement is adding a `verifyOSHugepagesAvailability()` function to the Memory Manager's Static policy, called during `Allocate()`. This function -reads fresh hugepage availability and rejects pods when insufficient. +combines two sources to determine actual availability: + +1. **Memory Manager internal state**: Tracks hugepage allocations for Guaranteed + pods per NUMA node, including pages allocated but not yet faulted by processes. +2. **OS-reported free hugepages** (sysfs `free_hugepages`): Reflects actual kernel + state, catching consumption by Burstable pods and other untracked sources. + +The effective available hugepages is `min(internal_free, os_free)` per NUMA node: +- `internal_free` prevents double-counting pages committed to existing Guaranteed + pods that haven't been faulted yet (which sysfs still reports as "free") +- `os_free` catches hugepage consumption that the Memory Manager doesn't track + (e.g., Burstable pods) ### Implementation Approaches @@ -335,11 +345,12 @@ func (p *staticPolicy) verifyOSHugepagesAvailability( pod *v1.Pod, container *v1.Container, ) error { - // 1. Read free hugepages directly from sysfs for each NUMA node - // 2. For each hugepage size requested by the container: - // a. Sum free hugepages across candidateNUMANodes only - // b. Compare against the requested amount - // 3. Return error if insufficient, with detailed message + // For each hugepage size requested by the container: + // 1. Get Memory Manager's internal free count per candidate NUMA node + // 2. Read OS free hugepages from sysfs per candidate NUMA node + // 3. Effective available = min(internal_free, os_free) per NUMA node + // 4. Sum effective available across candidate NUMA nodes + // 5. Return error if sum < requested amount } ``` @@ -348,8 +359,12 @@ The verification: - Only checks hugepage resources (not regular memory) - **Respects NUMA node selection**: Only checks the specific NUMA nodes that the Memory Manager's allocation algorithm has selected (see Topology Manager section) -- Returns admission error if insufficient free hugepages, including the hugepage - size, NUMA node(s), requested amount, and available amount to aid debugging +- Returns an admission error if insufficient free hugepages are detected + +**User-observable behavior**: Operators can identify verification failures through +the `FailedHugepagesVerification` event reason and the verification metrics +described in the [Observability](#observability) section. The specific error +message format is an implementation detail and may change between releases. ### Integration with Topology Manager