From 036a580d5e1d0d23c32d377476955f9d28c35d6d Mon Sep 17 00:00:00 2001
From: Srikalyan Swayampakula <srikalyansswayam@gmail.com>
Date: Wed, 24 Dec 2025 09:30:14 -0800
Subject: [PATCH 01/10] KEP: Memory Manager Hugepages Availability Verification

This KEP proposes enhancing the Memory Manager's Static policy to
verify OS-reported free hugepages availability during pod admission.

Problem:
The Memory Manager only tracks hugepage allocations for Guaranteed QoS
pods. Burstable/BestEffort pods can consume hugepages without being
tracked, causing subsequent Guaranteed pods to be admitted but fail
at runtime when hugepages are exhausted.

Solution:
- Add FreePages field to cadvisor's HugePagesInfo (PR google/cadvisor#3804)
- Verify OS-reported free hugepages during Allocate() in Static policy
- Reject pods when insufficient free hugepages are available

Related: https://github.com/kubernetes/kubernetes/issues/134395
---
 .../README.md                                 | 475 ++++++++++++++++++
 .../kep.yaml                                  |  42 ++
 2 files changed, 517 insertions(+)
 create mode 100644 keps/sig-node/NNNN-memory-manager-hugepages-verification/README.md
 create mode 100644 keps/sig-node/NNNN-memory-manager-hugepages-verification/kep.yaml

diff --git a/keps/sig-node/NNNN-memory-manager-hugepages-verification/README.md b/keps/sig-node/NNNN-memory-manager-hugepages-verification/README.md
new file mode 100644
index 000000000000..7d2bc32a4d0d
--- /dev/null
+++ b/keps/sig-node/NNNN-memory-manager-hugepages-verification/README.md
@@ -0,0 +1,475 @@
+# KEP-NNNN: Memory Manager Hugepages Availability Verification
+
+<!-- toc -->
+- [Release Signoff Checklist](#release-signoff-checklist)
+- [Summary](#summary)
+- [Motivation](#motivation)
+  - [Goals](#goals)
+  - [Non-Goals](#non-goals)
+- [Proposal](#proposal)
+  - [User Stories](#user-stories)
+    - [Story 1: DPDK Application Admission Failure](#story-1-dpdk-application-admission-failure)
+    - [Story 2: Database Workload with Hugepages](#story-2-database-workload-with-hugepages)
+  - [Notes/Constraints/Caveats](#notesconstraintscaveats)
+  - [Risks and Mitigations](#risks-and-mitigations)
+- [Design Details](#design-details)
+  - [Implementation Overview](#implementation-overview)
+  - [cadvisor Changes](#cadvisor-changes)
+  - [Memory Manager Changes](#memory-manager-changes)
+  - [Test Plan](#test-plan)
+    - [Prerequisite testing updates](#prerequisite-testing-updates)
+    - [Unit tests](#unit-tests)
+    - [Integration tests](#integration-tests)
+    - [e2e tests](#e2e-tests)
+  - [Graduation Criteria](#graduation-criteria)
+    - [Alpha](#alpha)
+    - [Beta](#beta)
+    - [GA](#ga)
+  - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
+  - [Version Skew Strategy](#version-skew-strategy)
+- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
+  - [Feature Enablement and Rollback](#feature-enablement-and-rollback)
+  - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
+  - [Monitoring Requirements](#monitoring-requirements)
+  - [Dependencies](#dependencies)
+  - [Scalability](#scalability)
+  - [Troubleshooting](#troubleshooting)
+- [Implementation History](#implementation-history)
+- [Drawbacks](#drawbacks)
+- [Alternatives](#alternatives)
+<!-- /toc -->
+
+## Release Signoff Checklist
+
+Items marked with (R) are required *prior to targeting to a milestone / release*.
+
+- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
+- [ ] (R) KEP approvers have approved the KEP status as `implementable`
+- [ ] (R) Design details are appropriately documented
+- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
+  - [ ] e2e Tests for all Beta API Operations (endpoints)
+  - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
+  - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
+- [ ] (R) Graduation criteria is in place
+  - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
+- [ ] (R) Production readiness review completed
+- [ ] (R) Production readiness review approved
+- [ ] "Implementation History" section is up-to-date for milestone
+- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
+- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
+
+[kubernetes.io]: https://kubernetes.io/
+[kubernetes/enhancements]: https://git.k8s.io/enhancements
+[kubernetes/kubernetes]: https://git.k8s.io/kubernetes
+[kubernetes/website]: https://git.k8s.io/website
+
+## Summary
+
+This KEP proposes enhancing the Memory Manager's Static policy to verify OS-reported
+free hugepages availability during pod admission. Currently, the Memory Manager only
+tracks hugepage allocations for Guaranteed QoS pods but doesn't verify actual
+hugepage availability from the operating system. This can lead to pods being admitted
+when hugepages aren't actually available, causing runtime failures.
+
+The enhancement adds verification by reading free hugepages from sysfs
+(`/sys/devices/system/node/node<N>/hugepages/hugepages-<size>kB/free_hugepages`)
+during pod admission, ensuring pods requesting hugepages are only admitted when
+sufficient free hugepages exist.
+
+## Motivation
+
+The Memory Manager tracks hugepage allocations for Guaranteed QoS pods to provide
+NUMA-aware memory and hugepage pinning. However, it operates on its internal
+accounting without verifying the actual state of hugepages on the system.
+
+This creates a problem when:
+1. Burstable or BestEffort pods consume hugepages (via hugetlbfs mounts or
+   `mmap` with `MAP_HUGETLB`) without being tracked by the Memory Manager
+2. External processes or other system components consume hugepages
+3. The Memory Manager's internal state becomes stale or inconsistent with reality
+
+In these scenarios, a Guaranteed pod requesting hugepages may be admitted based
+on the Memory Manager's internal tracking, only to fail at runtime when the
+container attempts to use the already-exhausted hugepages.
+
+### Goals
+
+- Verify OS-reported free hugepages during pod admission for the Static policy
+- Reject pods requesting hugepages when insufficient free hugepages are available
+- Provide clear error messages when admission fails due to insufficient hugepages
+- Maintain backwards compatibility with existing Memory Manager behavior
+
+### Non-Goals
+
+- Track hugepage usage by Burstable or BestEffort pods in the Memory Manager
+- Modify scheduler behavior or add hugepage awareness to the scheduler
+- Provide hugepage reservation or preemption mechanisms
+- Support platforms other than Linux
+
+## Proposal
+
+Enhance the Memory Manager's Static policy to verify actual hugepage availability
+by querying sysfs during pod admission. This involves:
+
+1. **cadvisor enhancement**: Add a `FreePages` field to `HugePagesInfo` struct
+   that reports free hugepages per NUMA node, read from sysfs
+
+2. **Memory Manager enhancement**: During `Allocate()` in the Static policy,
+   verify that OS-reported free hugepages meet or exceed the requested amount
+   before admitting the pod
+
+### User Stories
+
+#### Story 1: DPDK Application Admission Failure
+
+As a cluster administrator running DPDK-based network functions, I deploy a
+Burstable pod that mounts hugetlbfs and consumes 2GB of 1GB hugepages for packet
+buffer pools. Later, I deploy a Guaranteed pod also requesting 2GB of 1GB hugepages.
+
+**Current behavior**: The Guaranteed pod is admitted (Memory Manager shows
+hugepages as available) but fails at container startup when DPDK tries to allocate
+hugepages that are already consumed.
+
+**Desired behavior**: The Guaranteed pod admission fails immediately with a clear
+error indicating insufficient free hugepages, allowing the scheduler to try
+another node or the administrator to take corrective action.
+
+#### Story 2: Database Workload with Hugepages
+
+As a database administrator, I run PostgreSQL with hugepages enabled for shared
+buffers. If an external monitoring agent or debugging tool temporarily consumes
+hugepages, subsequent Guaranteed pods requesting hugepages should not be admitted
+until hugepages are freed.
+
+**Current behavior**: Pods are admitted based on Memory Manager tracking and fail
+at runtime.
+
+**Desired behavior**: Pods are rejected at admission with informative errors.
+
+### Notes/Constraints/Caveats
+
+- **Race condition window**: A small window exists between verification and actual
+  container startup where hugepages could be consumed. This is inherent to any
+  admission-time check but significantly reduces the failure window compared to
+  no verification.
+
+- **sysfs dependency**: The feature depends on reading from sysfs. If sysfs is
+  unavailable or the free_hugepages file cannot be read, the feature gracefully
+  degrades to current behavior (no verification).
+
+- **Per-NUMA verification**: Verification is performed per-NUMA node, consistent
+  with the Memory Manager's NUMA-aware design.
+
+### Risks and Mitigations
+
+| Risk | Mitigation |
+|------|------------|
+| sysfs reads add latency to admission | Minimal impact: single file read per hugepage size per NUMA node |
+| False rejections due to transient consumption | Acceptable: better to reject than admit and fail at runtime |
+| sysfs unavailable in some environments | Graceful degradation: skip verification if sysfs unreadable |
+
+## Design Details
+
+### Implementation Overview
+
+The implementation consists of two parts:
+
+1. **cadvisor**: Add `FreePages *uint64` field to `HugePagesInfo` struct, populated
+   from sysfs. Uses pointer with `omitempty` to distinguish between "0 free" and
+   "data unavailable".
+
+2. **kubelet Memory Manager**: Add `verifyOSHugepagesAvailability()` function
+   called during `Allocate()` that compares requested hugepages against OS-reported
+   free hugepages from cadvisor's machine info.
+
+### cadvisor Changes
+
+```go
+type HugePagesInfo struct {
+    // huge page size (in kB)
+    PageSize uint64 `json:"page_size"`
+    // number of huge pages
+    NumPages uint64 `json:"num_pages"`
+    // number of free huge pages (nil if unavailable)
+    FreePages *uint64 `json:"free_pages,omitempty"`
+}
+```
+
+The `FreePages` field is populated by reading from:
+```
+/sys/devices/system/node/node<N>/hugepages/hugepages-<size>kB/free_hugepages
+```
+
+### Memory Manager Changes
+
+During `Allocate()` in the Static policy:
+
+```go
+func (p *staticPolicy) verifyOSHugepagesAvailability(
+    machineState state.NUMANodeMap,
+    pod *v1.Pod,
+    container *v1.Container,
+) error {
+    // For each hugepage size requested by the container:
+    // 1. Get the OS-reported free hugepages from cadvisor machine info
+    // 2. Compare against the requested amount
+    // 3. Return error if insufficient
+}
+```
+
+The verification:
+- Only runs when the Static policy is enabled
+- Only checks hugepage resources (not regular memory)
+- Aggregates free hugepages across candidate NUMA nodes
+- Returns admission error if insufficient free hugepages
+
+### Test Plan
+
+[x] I/we understand the owners of the involved components may require updates to
+existing tests to make this code solid enough prior to committing the changes necessary
+to implement this enhancement.
+
+##### Prerequisite testing updates
+
+- Existing Memory Manager unit tests cover allocation logic
+- cadvisor tests cover sysfs reading functionality
+
+##### Unit tests
+
+- `pkg/kubelet/cm/memorymanager`: Add tests for `verifyOSHugepagesAvailability()`
+  - Test successful verification when free hugepages >= requested
+  - Test rejection when free hugepages < requested
+  - Test graceful handling when FreePages is nil (sysfs unavailable)
+  - Test per-NUMA node verification
+
+##### Integration tests
+
+- Test Memory Manager with mocked cadvisor returning various FreePages values
+- Test admission flow with hugepage verification enabled/disabled
+
+##### e2e tests
+
+- Test pod admission when hugepages are available
+- Test pod rejection when hugepages are exhausted
+- Test that rejected pods can be admitted after hugepages are freed
+
+### Graduation Criteria
+
+#### Alpha
+
+- Feature implemented behind `MemoryManagerHugepagesVerification` feature gate
+- Unit tests for verification logic
+- Documentation for feature gate and behavior
+
+#### Beta
+
+- E2e tests demonstrating correct behavior
+- Metrics for verification failures
+- Feedback incorporated from alpha users
+- No significant bugs reported
+
+#### GA
+
+- Feature enabled by default
+- Conformance tests if applicable
+- Documentation updated for stable feature
+
+### Upgrade / Downgrade Strategy
+
+**Upgrade**: No special handling required. The feature is additive and controlled
+by a feature gate. Existing pods are unaffected.
+
+**Downgrade**: Disabling the feature gate returns to previous behavior where
+OS hugepage availability is not verified. No data migration needed.
+
+### Version Skew Strategy
+
+The feature is entirely within the kubelet and depends on cadvisor (vendored).
+No control plane or cross-component version skew concerns.
+
+When kubelet is upgraded but cadvisor hasn't been updated to provide `FreePages`:
+- The field will be `nil`
+- Verification will be skipped (graceful degradation)
+- Warning logged indicating verification unavailable
+
+## Production Readiness Review Questionnaire
+
+### Feature Enablement and Rollback
+
+###### How can this feature be enabled / disabled in a live cluster?
+
+- [x] Feature gate
+  - Feature gate name: `MemoryManagerHugepagesVerification`
+  - Components depending on the feature gate: kubelet
+
+###### Does enabling the feature change any default behavior?
+
+Yes. Pods requesting hugepages may be rejected at admission if the OS reports
+insufficient free hugepages, even if the Memory Manager's internal tracking
+shows availability. This is the intended behavior to prevent runtime failures.
+
+###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
+
+Yes. Disabling the feature gate and restarting kubelet returns to previous
+behavior. No persistent state is affected.
+
+###### What happens if we reenable the feature if it was previously rolled back?
+
+The feature resumes verification on new pod admissions. No special handling needed.
+
+###### Are there any tests for feature enablement/disablement?
+
+Unit tests will verify behavior with feature gate enabled and disabled.
+
+### Rollout, Upgrade and Rollback Planning
+
+###### How can a rollout or rollback fail? Can it impact already running workloads?
+
+The feature only affects pod admission, not running workloads. A rollout cannot
+impact already running pods. Rollback simply stops verification on new admissions.
+
+###### What specific metrics should inform a rollback?
+
+- Unexpected increase in pod admission failures
+- `memory_manager_hugepages_verification_failures_total` metric (proposed)
+
+###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
+
+TBD during alpha phase.
+
+###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
+
+No.
+
+### Monitoring Requirements
+
+###### How can an operator determine if the feature is in use by workloads?
+
+- Feature gate is enabled
+- Pods request hugepages resources
+
+###### How can someone using this feature know that it is working for their instance?
+
+- [ ] Events
+  - Event Reason: `FailedHugepagesVerification`
+  - When: Pod admission rejected due to insufficient OS-reported free hugepages
+- [ ] Other
+  - Kubelet logs will indicate verification being performed and results
+
+###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
+
+- Hugepage verification should add < 10ms to pod admission latency
+- 99.9% of pods with sufficient free hugepages should be admitted successfully
+
+###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
+
+- [x] Metrics
+  - Metric name: `memory_manager_hugepages_verification_failures_total`
+  - Components exposing the metric: kubelet
+  - Metric name: `memory_manager_hugepages_verification_latency_seconds`
+  - Components exposing the metric: kubelet
+
+###### Are there any missing metrics that would be useful to have to improve observability of this feature?
+
+The proposed metrics should provide adequate observability.
+
+### Dependencies
+
+###### Does this feature depend on any specific services running in the cluster?
+
+- cadvisor (bundled with kubelet)
+  - Usage: Provides machine info including hugepage free counts
+  - Impact of outage: Verification skipped, graceful degradation
+  - Impact of degraded performance: Slightly increased admission latency
+
+### Scalability
+
+###### Will enabling / using this feature result in any new API calls?
+
+No new API calls. The feature reads from local sysfs and cadvisor machine info.
+
+###### Will enabling / using this feature result in introducing new API types?
+
+No.
+
+###### Will enabling / using this feature result in any new calls to the cloud provider?
+
+No.
+
+###### Will enabling / using this feature result in increasing size or count of the existing API objects?
+
+No.
+
+###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
+
+Minimal impact on pod admission latency (< 10ms for sysfs reads).
+
+###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
+
+Negligible: periodic sysfs file reads during pod admission.
+
+###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
+
+No. The feature performs simple file reads.
+
+### Troubleshooting
+
+###### How does this feature react if the API server and/or etcd is unavailable?
+
+No impact. The feature operates entirely within kubelet using local sysfs.
+
+###### What are other known failure modes?
+
+- sysfs unavailable or unreadable
+  - Detection: Warning logs from kubelet, nil FreePages in machine info
+  - Mitigations: Feature gracefully degrades to previous behavior
+  - Diagnostics: Check kubelet logs for sysfs read warnings
+  - Testing: Unit tests cover this scenario
+
+###### What steps should be taken if SLOs are not being met to determine the problem?
+
+1. Check kubelet logs for verification-related messages
+2. Verify sysfs is accessible and free_hugepages files exist
+3. Compare Memory Manager state with actual sysfs values
+4. Check for excessive pod admission rate causing contention
+
+## Implementation History
+
+- 2024-12-24: Initial KEP draft
+- Related issue: https://github.com/kubernetes/kubernetes/issues/134395
+- cadvisor PR: https://github.com/google/cadvisor/pull/3804
+
+## Drawbacks
+
+- Adds complexity to the admission path
+- Small race window still exists between verification and container startup
+- May reject pods that would have succeeded if hugepages were freed during startup
+
+## Alternatives
+
+### Alternative 1: Track all pod hugepage usage
+
+Extend Memory Manager to track hugepage usage by Burstable and BestEffort pods.
+
+**Rejected because**:
+- Significant refactoring required
+- Would not catch external (non-Kubernetes) hugepage consumers
+- Changes the scope and purpose of Memory Manager
+
+### Alternative 2: Query sysfs directly in Memory Manager
+
+Read sysfs directly in Memory Manager without cadvisor changes.
+
+**Rejected because**:
+- Duplicates sysfs reading logic already in cadvisor
+- cadvisor already provides machine info abstraction
+- Adding to cadvisor benefits other consumers of machine info
+
+### Alternative 3: Scheduler-level hugepage awareness
+
+Add hugepage availability awareness to the Kubernetes scheduler.
+
+**Rejected because**:
+- Much larger scope change
+- Scheduler operates on reported capacity, not real-time availability
+- Does not solve the admission-time verification problem
diff --git a/keps/sig-node/NNNN-memory-manager-hugepages-verification/kep.yaml b/keps/sig-node/NNNN-memory-manager-hugepages-verification/kep.yaml
new file mode 100644
index 000000000000..4ae698cc7144
--- /dev/null
+++ b/keps/sig-node/NNNN-memory-manager-hugepages-verification/kep.yaml
@@ -0,0 +1,42 @@
+title: Memory Manager Hugepages Availability Verification
+kep-number: NNNN
+authors:
+  - "@srikalyan"
+owning-sig: sig-node
+participating-sigs: []
+status: provisional
+creation-date: 2024-12-24
+reviewers:
+  - TBD
+approvers:
+  - TBD
+
+see-also:
+  - "/keps/sig-node/1769-memory-manager"
+
+# The target maturity stage in the current dev cycle for this KEP.
+stage: alpha
+
+# The most recent milestone for which work toward delivery of this KEP has been
+# done. This can be the current (upcoming) milestone, if it is being actively
+# worked on.
+latest-milestone: "v1.33"
+
+# The milestone at which this feature was, or is targeted to be, at each stage.
+milestone:
+  alpha: "v1.33"
+  beta: "v1.34"
+  stable: "v1.35"
+
+# The following PRR answers are required at alpha release
+# List the feature gate name and the components for which it must be enabled
+feature-gates:
+  - name: MemoryManagerHugepagesVerification
+    components:
+      - kubelet
+disable-supported: true
+
+# The following PRR answers are required at beta release
+metrics:
+  - memory_manager_hugepages_verification_failures_total
+  - memory_manager_hugepages_verification_latency_seconds

From 9cd22779d36c46b812dcbedb84787f8a9b550e86 Mon Sep 17 00:00:00 2001
From: Srikalyan Swayampakula <srikalyansswayam@gmail.com>
Date: Fri, 26 Dec 2025 08:49:13 -0800
Subject: [PATCH 02/10] Fix TOC to pass verify-toc CI check

---
 .../README.md                                         | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/keps/sig-node/NNNN-memory-manager-hugepages-verification/README.md b/keps/sig-node/NNNN-memory-manager-hugepages-verification/README.md
index 7d2bc32a4d0d..d36e29c06206 100644
--- a/keps/sig-node/NNNN-memory-manager-hugepages-verification/README.md
+++ b/keps/sig-node/NNNN-memory-manager-hugepages-verification/README.md
@@ -17,10 +17,10 @@
   - [cadvisor Changes](#cadvisor-changes)
   - [Memory Manager Changes](#memory-manager-changes)
   - [Test Plan](#test-plan)
-    - [Prerequisite testing updates](#prerequisite-testing-updates)
-    - [Unit tests](#unit-tests)
-    - [Integration tests](#integration-tests)
-    - [e2e tests](#e2e-tests)
+      - [Prerequisite testing updates](#prerequisite-testing-updates)
+      - [Unit tests](#unit-tests)
+      - [Integration tests](#integration-tests)
+      - [e2e tests](#e2e-tests)
   - [Graduation Criteria](#graduation-criteria)
     - [Alpha](#alpha)
     - [Beta](#beta)
@@ -37,6 +37,9 @@
 - [Implementation History](#implementation-history)
 - [Drawbacks](#drawbacks)
 - [Alternatives](#alternatives)
+  - [Alternative 1: Track all pod hugepage usage](#alternative-1-track-all-pod-hugepage-usage)
+  - [Alternative 2: Query sysfs directly in Memory Manager](#alternative-2-query-sysfs-directly-in-memory-manager)
+  - [Alternative 3: Scheduler-level hugepage awareness](#alternative-3-scheduler-level-hugepage-awareness)
 <!-- /toc -->
 
 ## Release Signoff Checklist

From 9a89040b6b54046ff510b003aad8ee386147ea58 Mon Sep 17 00:00:00 2001
From: Srikalyan Swayampakula <srikalyansswayam@gmail.com>
Date: Sat, 27 Dec 2025 09:16:53 -0800
Subject: [PATCH 03/10] Address reviewer feedback and close design gaps

Key changes:
- Update milestones to v1.36/v1.37/v1.38
- Clarify sysfs reading: add GetCurrentHugepagesInfo() for fresh reads
  (GetMachineInfo() is cached at startup, would be stale)
- Add Integration with Topology Manager section with policy behavior table
- Add Interaction with CPU Manager section
- Address reserved hugepages (free_hugepages is correct metric)
- Expand race condition discussion with failure handling details
- Rewrite Story 2 as "Rapid Pod Churn" with clear timeline
- Add "Static policy only" note (None policy not applicable)
- Specify error message format with example
- Add kubelet restart behavior note
- Update Risks table with new mitigations
- Fix unit test description (removed nil reference)
- Update TOC with new sections
- Link enhancement issue #5759

Related: https://github.com/kubernetes/enhancements/issues/5759
---
 .../README.md                                 | 350 ++++++++++++++----
 .../kep.yaml                                  |  11 +-
 2 files changed, 289 insertions(+), 72 deletions(-)
 rename keps/sig-node/{NNNN-memory-manager-hugepages-verification => 5759-memory-manager-hugepages-verification}/README.md (50%)
 rename keps/sig-node/{NNNN-memory-manager-hugepages-verification => 5759-memory-manager-hugepages-verification}/kep.yaml (88%)

diff --git a/keps/sig-node/NNNN-memory-manager-hugepages-verification/README.md b/keps/sig-node/5759-memory-manager-hugepages-verification/README.md
similarity index 50%
rename from keps/sig-node/NNNN-memory-manager-hugepages-verification/README.md
rename to keps/sig-node/5759-memory-manager-hugepages-verification/README.md
index d36e29c06206..d7ae7b123438 100644
--- a/keps/sig-node/NNNN-memory-manager-hugepages-verification/README.md
+++ b/keps/sig-node/5759-memory-manager-hugepages-verification/README.md
@@ -1,21 +1,27 @@
-# KEP-NNNN: Memory Manager Hugepages Availability Verification
+# KEP-5759: Memory Manager Hugepages Availability Verification
 
 <!-- toc -->
 - [Release Signoff Checklist](#release-signoff-checklist)
 - [Summary](#summary)
 - [Motivation](#motivation)
+  - [The Tracking Gap](#the-tracking-gap)
+  - [Real-World Example](#real-world-example)
   - [Goals](#goals)
   - [Non-Goals](#non-goals)
 - [Proposal](#proposal)
+  - [Current Admission Flow](#current-admission-flow)
   - [User Stories](#user-stories)
     - [Story 1: DPDK Application Admission Failure](#story-1-dpdk-application-admission-failure)
-    - [Story 2: Database Workload with Hugepages](#story-2-database-workload-with-hugepages)
+    - [Story 2: Rapid Pod Churn with Hugepages](#story-2-rapid-pod-churn-with-hugepages)
   - [Notes/Constraints/Caveats](#notesconstraintscaveats)
   - [Risks and Mitigations](#risks-and-mitigations)
 - [Design Details](#design-details)
   - [Implementation Overview](#implementation-overview)
   - [cadvisor Changes](#cadvisor-changes)
   - [Memory Manager Changes](#memory-manager-changes)
+  - [Integration with Topology Manager](#integration-with-topology-manager)
+  - [Interaction with CPU Manager](#interaction-with-cpu-manager)
+  - [Observability](#observability)
   - [Test Plan](#test-plan)
       - [Prerequisite testing updates](#prerequisite-testing-updates)
       - [Unit tests](#unit-tests)
@@ -46,7 +52,8 @@
 
 Items marked with (R) are required *prior to targeting to a milestone / release*.
 
-- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
+- [x] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
+  - Enhancement issue: https://github.com/kubernetes/enhancements/issues/5759
 - [ ] (R) KEP approvers have approved the KEP status as `implementable`
 - [ ] (R) Design details are appropriately documented
 - [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
@@ -81,19 +88,44 @@ sufficient free hugepages exist.
 
 ## Motivation
 
-The Memory Manager tracks hugepage allocations for Guaranteed QoS pods to provide
-NUMA-aware memory and hugepage pinning. However, it operates on its internal
-accounting without verifying the actual state of hugepages on the system.
+The Memory Manager's Static policy tracks hugepage allocations for Guaranteed QoS
+pods to provide NUMA-aware memory and hugepage pinning. However, it operates on
+its internal accounting without verifying the actual state of hugepages on the
+system.
 
-This creates a problem when:
-1. Burstable or BestEffort pods consume hugepages (via hugetlbfs mounts or
-   `mmap` with `MAP_HUGETLB`) without being tracked by the Memory Manager
-2. External processes or other system components consume hugepages
-3. The Memory Manager's internal state becomes stale or inconsistent with reality
+### The Tracking Gap
 
-In these scenarios, a Guaranteed pod requesting hugepages may be admitted based
-on the Memory Manager's internal tracking, only to fail at runtime when the
-container attempts to use the already-exhausted hugepages.
+The Kubernetes scheduler tracks hugepages at the **node level** - it knows total
+hugepage capacity and allocated amounts per node. The Memory Manager's Static
+policy tracks hugepages at the **per-NUMA level**, but only for Guaranteed QoS
+pods that it manages for NUMA placement.
+
+This creates a tracking gap: **Burstable pods can legitimately request hugepages
+through standard Kubernetes resource requests** (e.g., `hugepages-2Mi: 1Gi`).
+These requests are:
+- Properly validated by the scheduler
+- Correctly configured in cgroup limits
+- Accounted for at the node level
+
+However, the Memory Manager does not track these Burstable pod allocations for
+NUMA placement purposes. When a subsequent Guaranteed pod requests hugepages:
+1. The scheduler approves it (node-level accounting shows availability)
+2. The Memory Manager's internal state shows hugepages as available
+3. But the OS has already allocated those hugepages to the Burstable pod
+4. The Guaranteed pod fails at runtime when hugepages are exhausted
+
+### Real-World Example
+
+From [issue #134395](https://github.com/kubernetes/kubernetes/issues/134395),
+on an m6id.32xlarge instance with 2 NUMA nodes:
+
+```
+Memory Manager internal state: 15.2 GB free hugepages
+Actual OS state (sysfs):       3.2 GB free hugepages
+```
+
+The 12GB discrepancy was due to Burstable pods consuming hugepages that the
+Memory Manager wasn't tracking.
 
 ### Goals
 
@@ -121,55 +153,105 @@ by querying sysfs during pod admission. This involves:
    verify that OS-reported free hugepages meet or exceed the requested amount
    before admitting the pod
 
+### Current Admission Flow
+
+Understanding where this enhancement fits in the existing admission flow:
+
+1. **Scheduler**: Checks node-level hugepage capacity and allocations. Ensures
+   the node has sufficient total hugepages for the pod's request.
+
+2. **Kubelet Admission**: When a pod is assigned to a node, kubelet performs
+   local admission checks including resource availability.
+
+3. **Memory Manager (Static policy)**: For Guaranteed QoS pods, the Memory
+   Manager's `Allocate()` function:
+   - Checks its internal state for available hugepages per NUMA node
+   - Selects NUMA nodes for the allocation
+   - Updates its internal tracking
+   - **Gap**: Does not verify actual OS-reported free hugepages
+
+4. **Container Runtime**: Creates the container with cgroup limits set. If
+   hugepages are not actually available, the container fails at startup.
+
+**This KEP addresses the gap in step 3** by adding OS-level verification before
+updating internal tracking.
+
 ### User Stories
 
 #### Story 1: DPDK Application Admission Failure
 
 As a cluster administrator running DPDK-based network functions, I deploy a
-Burstable pod that mounts hugetlbfs and consumes 2GB of 1GB hugepages for packet
-buffer pools. Later, I deploy a Guaranteed pod also requesting 2GB of 1GB hugepages.
+Burstable pod that requests `hugepages-1Gi: 2Gi` for DPDK packet buffer pools.
+Later, I deploy a Guaranteed pod also requesting `hugepages-1Gi: 2Gi`.
 
 **Current behavior**: The Guaranteed pod is admitted (Memory Manager shows
 hugepages as available) but fails at container startup when DPDK tries to allocate
-hugepages that are already consumed.
+hugepages that are already consumed by the Burstable pod.
 
 **Desired behavior**: The Guaranteed pod admission fails immediately with a clear
 error indicating insufficient free hugepages, allowing the scheduler to try
 another node or the administrator to take corrective action.
 
-#### Story 2: Database Workload with Hugepages
+#### Story 2: Rapid Pod Churn with Hugepages
 
-As a database administrator, I run PostgreSQL with hugepages enabled for shared
-buffers. If an external monitoring agent or debugging tool temporarily consumes
-hugepages, subsequent Guaranteed pods requesting hugepages should not be admitted
-until hugepages are freed.
+As a platform engineer, I run batch jobs that use hugepages. Multiple jobs complete
+and new jobs start in quick succession:
 
-**Current behavior**: Pods are admitted based on Memory Manager tracking and fail
-at runtime.
+1. Node has 8GB of 2MB hugepages total
+2. Burstable Job A (requests 4GB hugepages) completes, releasing hugepages
+3. Guaranteed Job B (requests 6GB hugepages) is scheduled to this node
+4. Before Job B's container starts, Burstable Job C (requests 4GB hugepages) starts
+5. Job C's container allocates hugepages from the OS
 
-**Desired behavior**: Pods are rejected at admission with informative errors.
+**Current behavior**: The scheduler approved Job B based on node capacity (8GB).
+Memory Manager's internal state (tracking only Guaranteed pods) shows 8GB available.
+Job B is admitted, but when its container starts, only 4GB are actually free.
+Job B fails at runtime.
+
+**Desired behavior**: Memory Manager reads sysfs during admission and sees only
+4GB free. Job B is rejected with error:
+`insufficient hugepages-2Mi on NUMA node(s) [0,1]: requested 6Gi, available 4Gi`
+
+Job B can be rescheduled to another node with sufficient hugepages.
 
 ### Notes/Constraints/Caveats
 
-- **Race condition window**: A small window exists between verification and actual
-  container startup where hugepages could be consumed. This is inherent to any
-  admission-time check but significantly reduces the failure window compared to
-  no verification.
+- **Race condition window**: A window exists between verification and actual
+  container startup where hugepages could be consumed by another process. This is
+  inherent to any admission-time check.
+
+  **What happens if verification passes but container still fails?**
+  1. Container startup fails with OOM or hugepage allocation error
+  2. Kubelet emits `FailedCreatePodContainer` event with details
+  3. Pod enters `CrashLoopBackOff` or `Error` state
+  4. Scheduler may reschedule to another node (if applicable)
+
+  **Why this is still valuable**: Without verification, the failure window spans
+  from pod scheduling to container startup (seconds to minutes). With verification,
+  the window is reduced to milliseconds between sysfs read and container start.
+  The vast majority of failures are prevented.
 
-- **sysfs dependency**: The feature depends on reading from sysfs. If sysfs is
-  unavailable or the free_hugepages file cannot be read, the feature gracefully
-  degrades to current behavior (no verification).
+- **Linux-only**: This feature is Linux-specific. The sysfs interface for hugepages
+  (`/sys/devices/system/node/node<N>/hugepages/`) is a Linux kernel feature.
+  On Linux systems where hugepages are configured, this sysfs interface is always
+  available.
 
 - **Per-NUMA verification**: Verification is performed per-NUMA node, consistent
-  with the Memory Manager's NUMA-aware design.
+  with the Memory Manager's NUMA-aware design and Topology Manager coordination.
+
+- **Static policy only**: Verification only applies when Memory Manager's Static
+  policy is enabled. With the "None" policy, Memory Manager doesn't track hugepage
+  allocations at all, so there's no internal state to become stale. The scheduler's
+  node-level tracking is the only safeguard with the None policy.
 
 ### Risks and Mitigations
 
 | Risk | Mitigation |
 |------|------------|
-| sysfs reads add latency to admission | Minimal impact: single file read per hugepage size per NUMA node |
-| False rejections due to transient consumption | Acceptable: better to reject than admit and fail at runtime |
-| sysfs unavailable in some environments | Graceful degradation: skip verification if sysfs unreadable |
+| sysfs reads add latency to admission | Minimal impact: single file read per hugepage size per NUMA node; < 1ms typically |
+| False rejections due to transient consumption | Acceptable: better to reject than admit and fail at runtime; pod can be rescheduled |
+| Verification passes but container still fails (race) | Window is milliseconds vs seconds/minutes without verification; event emitted for debugging |
+| Fresh sysfs reads on every Allocate() | Lightweight operation; only triggered for pods requesting hugepages |
 
 ## Design Details
 
@@ -177,55 +259,159 @@ at runtime.
 
 The implementation consists of two parts:
 
-1. **cadvisor**: Add `FreePages *uint64` field to `HugePagesInfo` struct, populated
-   from sysfs. Uses pointer with `omitempty` to distinguish between "0 free" and
-   "data unavailable".
+1. **cadvisor**: Add `FreePages uint64` field to `HugePagesInfo` struct, populated
+   from sysfs. Also expose a method to read current free hugepages on-demand.
 
 2. **kubelet Memory Manager**: Add `verifyOSHugepagesAvailability()` function
-   called during `Allocate()` that compares requested hugepages against OS-reported
-   free hugepages from cadvisor's machine info.
+   called during `Allocate()` that reads **fresh** hugepage availability from sysfs.
+
+**Important**: cadvisor's `GetMachineInfo()` is called once at startup and cached.
+The `FreePages` field in cached machine info would be stale. Therefore, verification
+must read sysfs directly during each `Allocate()` call, not rely on cached values.
+We will add a `GetCurrentHugepagesInfo()` method to cadvisor's `Manager` interface
+that performs a fresh sysfs read.
 
 ### cadvisor Changes
 
+**Struct update**:
 ```go
 type HugePagesInfo struct {
     // huge page size (in kB)
     PageSize uint64 `json:"page_size"`
     // number of huge pages
     NumPages uint64 `json:"num_pages"`
-    // number of free huge pages (nil if unavailable)
-    FreePages *uint64 `json:"free_pages,omitempty"`
+    // number of free huge pages
+    FreePages uint64 `json:"free_pages"`
 }
 ```
 
+**New method on Manager interface**:
+```go
+// GetCurrentHugepagesInfo returns fresh hugepage info per NUMA node by reading sysfs.
+// This is separate from GetMachineInfo() which returns cached startup data.
+func (m *manager) GetCurrentHugepagesInfo() (map[int][]HugePagesInfo, error)
+```
+
 The `FreePages` field is populated by reading from:
 ```
 /sys/devices/system/node/node<N>/hugepages/hugepages-<size>kB/free_hugepages
 ```
 
+**Note on reserved hugepages**: Linux tracks `resv_hugepages` (reserved but not
+yet faulted). For this implementation, we use `free_hugepages` directly because:
+- Reserved pages are committed to specific processes
+- A new pod cannot use reserved pages
+- `free_hugepages` accurately reflects what's available for new allocations
+
+**Note**: Since sysfs is always available on Linux systems with hugepages configured,
+we use a simple `uint64` rather than a pointer. A value of 0 means zero free
+hugepages are available.
+
 ### Memory Manager Changes
 
 During `Allocate()` in the Static policy:
 
 ```go
 func (p *staticPolicy) verifyOSHugepagesAvailability(
-    machineState state.NUMANodeMap,
+    candidateNUMANodes []int,  // NUMA nodes selected by allocation algorithm
     pod *v1.Pod,
     container *v1.Container,
 ) error {
-    // For each hugepage size requested by the container:
-    // 1. Get the OS-reported free hugepages from cadvisor machine info
-    // 2. Compare against the requested amount
-    // 3. Return error if insufficient
+    // 1. Call cadvisor's GetCurrentHugepagesInfo() to get fresh sysfs data
+    // 2. For each hugepage size requested by the container:
+    //    a. Sum free hugepages across candidateNUMANodes only
+    //    b. Compare against the requested amount
+    // 3. Return error if insufficient, with detailed message
 }
 ```
 
 The verification:
-- Only runs when the Static policy is enabled
+- Only runs when the Static policy is enabled and feature gate is on
 - Only checks hugepage resources (not regular memory)
-- Aggregates free hugepages across candidate NUMA nodes
+- **Respects NUMA node selection**: Only checks the specific NUMA nodes that the
+  Memory Manager's allocation algorithm has selected (see Topology Manager section)
 - Returns admission error if insufficient free hugepages
 
+**Error message format**:
+```
+insufficient hugepages-2Mi on NUMA node(s) [0]: requested 4Gi, available 2Gi
+```
+
+### Integration with Topology Manager
+
+The Memory Manager works with Topology Manager to coordinate NUMA-aware resource
+allocation. The verification must respect Topology Manager's policy:
+
+| Topology Policy | Verification Behavior |
+|-----------------|----------------------|
+| `none` | Not applicable (Memory Manager Static policy requires topology-aware policies) |
+| `best-effort` | Check aggregate across all candidate NUMA nodes |
+| `restricted` | Check only NUMA nodes that satisfy topology constraints |
+| `single-numa-node` | Check only the single selected NUMA node |
+
+**Critical**: Verification happens **after** the Memory Manager's allocation algorithm
+selects candidate NUMA nodes based on topology constraints. We verify against those
+specific nodes, not all nodes on the system.
+
+Example with `single-numa-node` policy:
+```
+Node topology: NUMA0 (2GB free), NUMA1 (3GB free)
+Pod requests: 2GB hugepages
+Allocation selects: NUMA0 (meets the request)
+Verification checks: NUMA0 only → 2GB available ≥ 2GB requested ✓
+```
+
+Example where aggregate would be misleading:
+```
+Node topology: NUMA0 (1GB free), NUMA1 (1GB free)
+Pod requests: 2GB hugepages with single-numa-node policy
+Allocation fails: Neither NUMA node has 2GB alone
+(Verification never reached - allocation algorithm rejects first)
+```
+
+### Interaction with CPU Manager
+
+When CPU Manager pins a pod to specific CPUs, those CPUs belong to specific NUMA
+nodes. Topology Manager coordinates this to ensure Memory Manager allocates from
+the same NUMA node(s). The verification inherits this coordination because it
+checks only the candidate NUMA nodes selected by the allocation algorithm.
+
+### Observability
+
+This feature provides explicit signals for operators to monitor hugepage verification:
+
+#### Metrics
+
+| Metric | Type | Description |
+|--------|------|-------------|
+| `memory_manager_hugepages_verification_total` | Counter | Total verification checks performed. Labels: `result` (success/failure), `hugepage_size` |
+| `memory_manager_hugepages_verification_failures_total` | Counter | Pods rejected due to insufficient OS-reported hugepages. Labels: `hugepage_size`, `numa_node` |
+| `memory_manager_hugepages_verification_latency_seconds` | Histogram | Time spent performing verification (buckets: 1ms to 100ms) |
+
+#### Events
+
+When a pod is rejected due to insufficient hugepages, a Kubernetes event is generated:
+
+```
+Type:    Warning
+Reason:  FailedHugepagesVerification
+Message: insufficient hugepages-2Mi on NUMA node(s) [0]: requested 4Gi, available 2Gi
+```
+
+#### Kubelet Logs
+
+At `--v=4` or higher, kubelet logs verification details:
+```
+I0127 10:15:32.123456 12345 policy_static.go:XXX] "Verifying OS hugepages availability" pod="default/dpdk-app" container="dpdk"
+I0127 10:15:32.123789 12345 policy_static.go:XXX] "Hugepages verification passed" pod="default/dpdk-app" numaNodes=[0] size="hugepages-2Mi" requested=1073741824 available=2147483648
+```
+
+#### Alerting Recommendations
+
+Operators should consider alerts for:
+- `rate(memory_manager_hugepages_verification_failures_total[5m]) > 0`: Pods being rejected
+- `histogram_quantile(0.99, memory_manager_hugepages_verification_latency_seconds) > 0.05`: High verification latency
+
 ### Test Plan
 
 [x] I/we understand the owners of the involved components may require updates to
@@ -242,8 +428,10 @@ to implement this enhancement.
 - `pkg/kubelet/cm/memorymanager`: Add tests for `verifyOSHugepagesAvailability()`
   - Test successful verification when free hugepages >= requested
   - Test rejection when free hugepages < requested
-  - Test graceful handling when FreePages is nil (sysfs unavailable)
-  - Test per-NUMA node verification
+  - Test verification with zero free hugepages (FreePages = 0)
+  - Test per-NUMA node verification respects candidate node selection
+  - Test multiple hugepage sizes in same request
+  - Test with feature gate enabled/disabled
 
 ##### Integration tests
 
@@ -262,6 +450,9 @@ to implement this enhancement.
 
 - Feature implemented behind `MemoryManagerHugepagesVerification` feature gate
 - Unit tests for verification logic
+- E2e tests demonstrating:
+  - Pod admission succeeds when sufficient free hugepages exist
+  - Pod admission fails when insufficient free hugepages exist
 - Documentation for feature gate and behavior
 
 #### Beta
@@ -285,15 +476,19 @@ by a feature gate. Existing pods are unaffected.
 **Downgrade**: Disabling the feature gate returns to previous behavior where
 OS hugepage availability is not verified. No data migration needed.
 
+**Kubelet restart behavior**: After kubelet restarts, Memory Manager rebuilds its
+internal state from checkpoint. Since verification reads fresh sysfs data on each
+`Allocate()` call, there's no stale state concern. New pod admissions after restart
+will correctly verify against current OS hugepage availability.
+
 ### Version Skew Strategy
 
 The feature is entirely within the kubelet and depends on cadvisor (vendored).
 No control plane or cross-component version skew concerns.
 
-When kubelet is upgraded but cadvisor hasn't been updated to provide `FreePages`:
-- The field will be `nil`
-- Verification will be skipped (graceful degradation)
-- Warning logged indicating verification unavailable
+Since cadvisor is vendored into kubelet, the kubelet and cadvisor versions are
+always synchronized. The `FreePages` field will be available when the feature
+gate is enabled.
 
 ## Production Readiness Review Questionnaire
 
@@ -322,7 +517,11 @@ The feature resumes verification on new pod admissions. No special handling need
 
 ###### Are there any tests for feature enablement/disablement?
 
-Unit tests will verify behavior with feature gate enabled and disabled.
+Yes. Unit tests will verify:
+- When feature gate is disabled: verification is skipped, pods are admitted
+  based on Memory Manager's internal tracking (existing behavior)
+- When feature gate is enabled: verification is performed, pods are rejected
+  if OS-reported free hugepages are insufficient
 
 ### Rollout, Upgrade and Rollback Planning
 
@@ -367,14 +566,25 @@ No.
 ###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
 
 - [x] Metrics
+  - Metric name: `memory_manager_hugepages_verification_total`
+    - Components exposing the metric: kubelet
+    - Description: Total number of hugepages verification checks performed
+    - Labels: `result` (success, failure), `hugepage_size` (e.g., 2Mi, 1Gi)
   - Metric name: `memory_manager_hugepages_verification_failures_total`
-  - Components exposing the metric: kubelet
+    - Components exposing the metric: kubelet
+    - Description: Total number of pods rejected due to insufficient OS-reported hugepages
+    - Labels: `hugepage_size`, `numa_node`
   - Metric name: `memory_manager_hugepages_verification_latency_seconds`
-  - Components exposing the metric: kubelet
+    - Components exposing the metric: kubelet
+    - Description: Histogram of time spent performing hugepages verification
+    - Buckets: 0.001, 0.005, 0.01, 0.025, 0.05, 0.1 seconds
 
 ###### Are there any missing metrics that would be useful to have to improve observability of this feature?
 
-The proposed metrics should provide adequate observability.
+Additional metrics that could be added in Beta:
+- `memory_manager_hugepages_discrepancy_bytes`: Gauge showing difference between
+  Memory Manager's internal tracking and OS-reported free hugepages (useful for
+  detecting drift)
 
 ### Dependencies
 
@@ -423,22 +633,28 @@ No impact. The feature operates entirely within kubelet using local sysfs.
 
 ###### What are other known failure modes?
 
-- sysfs unavailable or unreadable
-  - Detection: Warning logs from kubelet, nil FreePages in machine info
-  - Mitigations: Feature gracefully degrades to previous behavior
-  - Diagnostics: Check kubelet logs for sysfs read warnings
-  - Testing: Unit tests cover this scenario
+- Verification rejects pods that would have succeeded
+  - Detection: Increase in `memory_manager_hugepages_verification_failures_total`
+    with pods eventually succeeding on retry
+  - Mitigations: This indicates transient hugepage consumption; the feature is
+    working correctly by preventing admission during contention
+  - Diagnostics: Compare verification failure count with actual runtime failures
+  - Testing: E2e tests verify this scenario
 
 ###### What steps should be taken if SLOs are not being met to determine the problem?
 
 1. Check kubelet logs for verification-related messages
-2. Verify sysfs is accessible and free_hugepages files exist
-3. Compare Memory Manager state with actual sysfs values
+2. Review `memory_manager_hugepages_verification_latency_seconds` histogram
+   for unusually slow verification
+3. Compare Memory Manager state with actual sysfs values using:
+   `cat /sys/devices/system/node/node*/hugepages/hugepages-*/free_hugepages`
 4. Check for excessive pod admission rate causing contention
 
 ## Implementation History
 
 - 2024-12-24: Initial KEP draft
+- 2024-12-27: KEP updated based on reviewer feedback
+- Enhancement issue: https://github.com/kubernetes/enhancements/issues/5759
 - Related issue: https://github.com/kubernetes/kubernetes/issues/134395
 - cadvisor PR: https://github.com/google/cadvisor/pull/3804
 
diff --git a/keps/sig-node/NNNN-memory-manager-hugepages-verification/kep.yaml b/keps/sig-node/5759-memory-manager-hugepages-verification/kep.yaml
similarity index 88%
rename from keps/sig-node/NNNN-memory-manager-hugepages-verification/kep.yaml
rename to keps/sig-node/5759-memory-manager-hugepages-verification/kep.yaml
index 4ae698cc7144..9834b8d39f9e 100644
--- a/keps/sig-node/NNNN-memory-manager-hugepages-verification/kep.yaml
+++ b/keps/sig-node/5759-memory-manager-hugepages-verification/kep.yaml
@@ -1,5 +1,5 @@
 title: Memory Manager Hugepages Availability Verification
-kep-number: NNNN
+kep-number: 5759
 authors:
   - "@srikalyan"
 owning-sig: sig-node
@@ -20,13 +20,13 @@ stage: alpha
 # The most recent milestone for which work toward delivery of this KEP has been
 # done. This can be the current (upcoming) milestone, if it is being actively
 # worked on.
-latest-milestone: "v1.33"
+latest-milestone: "v1.36"
 
 # The milestone at which this feature was, or is targeted to be, at each stage.
 milestone:
-  alpha: "v1.33"
-  beta: "v1.34"
-  stable: "v1.35"
+  alpha: "v1.36"
+  beta: "v1.37"
+  stable: "v1.38"
 
 # The following PRR answers are required at alpha release
 # List the feature gate name and the components for which it must be enabled
@@ -38,5 +38,6 @@ disable-supported: true
 
 # The following PRR answers are required at beta release
 metrics:
+  - memory_manager_hugepages_verification_total
   - memory_manager_hugepages_verification_failures_total
   - memory_manager_hugepages_verification_latency_seconds

From 8e6ae0902c2c2e3acdbcb20df3a78e12e574574f Mon Sep 17 00:00:00 2001
From: Srikalyan Swayampakula <srikalyansswayam@gmail.com>
Date: Sat, 27 Dec 2025 12:11:56 -0800
Subject: [PATCH 04/10] Present implementation options without recommendation

- Add two implementation approaches: Option A (direct sysfs) and Option B (cadvisor)
- Present pros/cons for each option neutrally for KEP review
- Remove cadvisor-specific sections, replace with options discussion
- Add Observability section with metrics, events, logs, alerting
- Update TOC to pass CI verification
- Update KEP number to 5759 throughout

The choice between implementation approaches is left to KEP reviewers
based on maintainability preferences and timeline considerations.
---
 .../README.md                                 | 143 +++++++++---------
 1 file changed, 73 insertions(+), 70 deletions(-)

diff --git a/keps/sig-node/5759-memory-manager-hugepages-verification/README.md b/keps/sig-node/5759-memory-manager-hugepages-verification/README.md
index d7ae7b123438..9eb9ccd4d4db 100644
--- a/keps/sig-node/5759-memory-manager-hugepages-verification/README.md
+++ b/keps/sig-node/5759-memory-manager-hugepages-verification/README.md
@@ -17,11 +17,18 @@
   - [Risks and Mitigations](#risks-and-mitigations)
 - [Design Details](#design-details)
   - [Implementation Overview](#implementation-overview)
-  - [cadvisor Changes](#cadvisor-changes)
+  - [Implementation Approaches](#implementation-approaches)
+    - [Option A: Direct sysfs Reading in Memory Manager](#option-a-direct-sysfs-reading-in-memory-manager)
+    - [Option B: Add Fresh-Read Method to cadvisor](#option-b-add-fresh-read-method-to-cadvisor)
+  - [sysfs Interface](#sysfs-interface)
   - [Memory Manager Changes](#memory-manager-changes)
   - [Integration with Topology Manager](#integration-with-topology-manager)
   - [Interaction with CPU Manager](#interaction-with-cpu-manager)
   - [Observability](#observability)
+    - [Metrics](#metrics)
+    - [Events](#events)
+    - [Kubelet Logs](#kubelet-logs)
+    - [Alerting Recommendations](#alerting-recommendations)
   - [Test Plan](#test-plan)
       - [Prerequisite testing updates](#prerequisite-testing-updates)
       - [Unit tests](#unit-tests)
@@ -44,8 +51,7 @@
 - [Drawbacks](#drawbacks)
 - [Alternatives](#alternatives)
   - [Alternative 1: Track all pod hugepage usage](#alternative-1-track-all-pod-hugepage-usage)
-  - [Alternative 2: Query sysfs directly in Memory Manager](#alternative-2-query-sysfs-directly-in-memory-manager)
-  - [Alternative 3: Scheduler-level hugepage awareness](#alternative-3-scheduler-level-hugepage-awareness)
+  - [Alternative 2: Scheduler-level hugepage awareness](#alternative-2-scheduler-level-hugepage-awareness)
 <!-- /toc -->
 
 ## Release Signoff Checklist
@@ -144,14 +150,14 @@ Memory Manager wasn't tracking.
 ## Proposal
 
 Enhance the Memory Manager's Static policy to verify actual hugepage availability
-by querying sysfs during pod admission. This involves:
+by querying sysfs during pod admission:
 
-1. **cadvisor enhancement**: Add a `FreePages` field to `HugePagesInfo` struct
-   that reports free hugepages per NUMA node, read from sysfs
+**Memory Manager enhancement**: During `Allocate()` in the Static policy,
+verify that OS-reported free hugepages (read from sysfs) meets or exceeds the
+requested amount before admitting the pod.
 
-2. **Memory Manager enhancement**: During `Allocate()` in the Static policy,
-   verify that OS-reported free hugepages meet or exceed the requested amount
-   before admitting the pod
+See [Implementation Approaches](#implementation-approaches) for options on how
+the sysfs reading is performed.
 
 ### Current Admission Flow
 
@@ -257,56 +263,62 @@ Job B can be rescheduled to another node with sufficient hugepages.
 
 ### Implementation Overview
 
-The implementation consists of two parts:
+The core enhancement is adding a `verifyOSHugepagesAvailability()` function to
+the Memory Manager's Static policy, called during `Allocate()`. This function
+reads fresh hugepage availability and rejects pods when insufficient.
 
-1. **cadvisor**: Add `FreePages uint64` field to `HugePagesInfo` struct, populated
-   from sysfs. Also expose a method to read current free hugepages on-demand.
+### Implementation Approaches
 
-2. **kubelet Memory Manager**: Add `verifyOSHugepagesAvailability()` function
-   called during `Allocate()` that reads **fresh** hugepage availability from sysfs.
+There are two approaches for reading free hugepages:
 
-**Important**: cadvisor's `GetMachineInfo()` is called once at startup and cached.
-The `FreePages` field in cached machine info would be stale. Therefore, verification
-must read sysfs directly during each `Allocate()` call, not rely on cached values.
-We will add a `GetCurrentHugepagesInfo()` method to cadvisor's `Manager` interface
-that performs a fresh sysfs read.
+#### Option A: Direct sysfs Reading in Memory Manager
 
-### cadvisor Changes
+Read sysfs directly in the Memory Manager without cadvisor changes.
 
-**Struct update**:
-```go
-type HugePagesInfo struct {
-    // huge page size (in kB)
-    PageSize uint64 `json:"page_size"`
-    // number of huge pages
-    NumPages uint64 `json:"num_pages"`
-    // number of free huge pages
-    FreePages uint64 `json:"free_pages"`
-}
-```
+**Pros:**
+- No external dependencies on critical admission path
+- Simple implementation (~10 lines of sysfs reading)
+- Faster to implement and merge (single repo)
+- Memory Manager already reads memory topology from sysfs (precedent)
 
-**New method on Manager interface**:
-```go
-// GetCurrentHugepagesInfo returns fresh hugepage info per NUMA node by reading sysfs.
-// This is separate from GetMachineInfo() which returns cached startup data.
-func (m *manager) GetCurrentHugepagesInfo() (map[int][]HugePagesInfo, error)
-```
+**Cons:**
+- Duplicates sysfs reading logic (though trivial)
+- Other cadvisor consumers don't benefit
+
+#### Option B: Add Fresh-Read Method to cadvisor
+
+Add `GetCurrentHugepagesInfo()` method to cadvisor that reads sysfs on-demand.
+
+**Note**: cadvisor's existing `GetMachineInfo()` is cached at startup, so simply
+adding a `FreePages` field there would be stale. A new method for fresh reads
+would be required.
+
+**Pros:**
+- Single source of truth for hugepage info
+- Benefits other cadvisor consumers
+- Cleaner abstraction
 
-The `FreePages` field is populated by reading from:
+**Cons:**
+- Cross-repo dependency (cadvisor PR must merge first)
+- Adds API surface to cadvisor
+- Longer timeline
+
+The choice between options should be made during KEP review based on
+maintainability preferences and timeline considerations.
+
+### sysfs Interface
+
+Regardless of approach, free hugepages are read from:
 ```
 /sys/devices/system/node/node<N>/hugepages/hugepages-<size>kB/free_hugepages
 ```
 
 **Note on reserved hugepages**: Linux tracks `resv_hugepages` (reserved but not
-yet faulted). For this implementation, we use `free_hugepages` directly because:
+yet faulted). We use `free_hugepages` directly because:
 - Reserved pages are committed to specific processes
 - A new pod cannot use reserved pages
 - `free_hugepages` accurately reflects what's available for new allocations
 
-**Note**: Since sysfs is always available on Linux systems with hugepages configured,
-we use a simple `uint64` rather than a pointer. A value of 0 means zero free
-hugepages are available.
-
 ### Memory Manager Changes
 
 During `Allocate()` in the Static policy:
@@ -317,7 +329,7 @@ func (p *staticPolicy) verifyOSHugepagesAvailability(
     pod *v1.Pod,
     container *v1.Container,
 ) error {
-    // 1. Call cadvisor's GetCurrentHugepagesInfo() to get fresh sysfs data
+    // 1. Read free hugepages directly from sysfs for each NUMA node
     // 2. For each hugepage size requested by the container:
     //    a. Sum free hugepages across candidateNUMANodes only
     //    b. Compare against the requested amount
@@ -421,7 +433,7 @@ to implement this enhancement.
 ##### Prerequisite testing updates
 
 - Existing Memory Manager unit tests cover allocation logic
-- cadvisor tests cover sysfs reading functionality
+- For Option B: cadvisor tests cover sysfs reading functionality
 
 ##### Unit tests
 
@@ -435,7 +447,7 @@ to implement this enhancement.
 
 ##### Integration tests
 
-- Test Memory Manager with mocked cadvisor returning various FreePages values
+- Test Memory Manager with mocked hugepage availability (sysfs or cadvisor depending on chosen approach)
 - Test admission flow with hugepage verification enabled/disabled
 
 ##### e2e tests
@@ -483,12 +495,11 @@ will correctly verify against current OS hugepage availability.
 
 ### Version Skew Strategy
 
-The feature is entirely within the kubelet and depends on cadvisor (vendored).
-No control plane or cross-component version skew concerns.
+The feature is entirely within the kubelet. No control plane or cross-component
+version skew concerns.
 
-Since cadvisor is vendored into kubelet, the kubelet and cadvisor versions are
-always synchronized. The `FreePages` field will be available when the feature
-gate is enabled.
+- **Option A**: No version skew concerns (direct sysfs reading)
+- **Option B**: Since cadvisor is vendored into kubelet, versions are synchronized
 
 ## Production Readiness Review Questionnaire
 
@@ -547,8 +558,9 @@ No.
 
 ###### How can an operator determine if the feature is in use by workloads?
 
-- Feature gate is enabled
-- Pods request hugepages resources
+- Feature gate `MemoryManagerHugepagesVerification` is enabled
+- Metric `memory_manager_hugepages_verification_total` is incrementing (indicates verification checks are being performed)
+- Pods with Guaranteed QoS requesting hugepages resources are being scheduled
 
 ###### How can someone using this feature know that it is working for their instance?
 
@@ -590,16 +602,16 @@ Additional metrics that could be added in Beta:
 
 ###### Does this feature depend on any specific services running in the cluster?
 
-- cadvisor (bundled with kubelet)
-  - Usage: Provides machine info including hugepage free counts
-  - Impact of outage: Verification skipped, graceful degradation
-  - Impact of degraded performance: Slightly increased admission latency
+Depends on the implementation approach chosen (see [Implementation Approaches](#implementation-approaches)):
+
+- **Option A (Direct sysfs)**: No external dependencies. Reads directly from Linux sysfs.
+- **Option B (cadvisor)**: Depends on cadvisor (bundled with kubelet) for fresh hugepage reads.
 
 ### Scalability
 
 ###### Will enabling / using this feature result in any new API calls?
 
-No new API calls. The feature reads from local sysfs and cadvisor machine info.
+No new API calls. The feature reads from local sysfs files.
 
 ###### Will enabling / using this feature result in introducing new API types?
 
@@ -653,10 +665,10 @@ No impact. The feature operates entirely within kubelet using local sysfs.
 ## Implementation History
 
 - 2024-12-24: Initial KEP draft
-- 2024-12-27: KEP updated based on reviewer feedback
+- 2024-12-27: KEP updated based on reviewer feedback; added implementation options
 - Enhancement issue: https://github.com/kubernetes/enhancements/issues/5759
 - Related issue: https://github.com/kubernetes/kubernetes/issues/134395
-- cadvisor PR: https://github.com/google/cadvisor/pull/3804
+- cadvisor PR (for Option B): https://github.com/google/cadvisor/pull/3804 (draft)
 
 ## Drawbacks
 
@@ -675,16 +687,7 @@ Extend Memory Manager to track hugepage usage by Burstable and BestEffort pods.
 - Would not catch external (non-Kubernetes) hugepage consumers
 - Changes the scope and purpose of Memory Manager
 
-### Alternative 2: Query sysfs directly in Memory Manager
-
-Read sysfs directly in Memory Manager without cadvisor changes.
-
-**Rejected because**:
-- Duplicates sysfs reading logic already in cadvisor
-- cadvisor already provides machine info abstraction
-- Adding to cadvisor benefits other consumers of machine info
-
-### Alternative 3: Scheduler-level hugepage awareness
+### Alternative 2: Scheduler-level hugepage awareness
 
 Add hugepage availability awareness to the Kubernetes scheduler.
 

From 36099e341fa39240d1a444082d820aa3fe6a1ef2 Mon Sep 17 00:00:00 2001
From: Srikalyan Swayampakula <srikalyansswayam@gmail.com>
Date: Thu, 22 Jan 2026 13:24:41 -0800
Subject: [PATCH 05/10] KEP-5759: Add reviewers and approvers to kep.yaml

- Add ffromani, derekwaynecarr, mrunalp as reviewers
- Add dchen1107 as approver (sig-node OWNERS)
---
 .../5759-memory-manager-hugepages-verification/kep.yaml     | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/keps/sig-node/5759-memory-manager-hugepages-verification/kep.yaml b/keps/sig-node/5759-memory-manager-hugepages-verification/kep.yaml
index 9834b8d39f9e..e8a539d758fd 100644
--- a/keps/sig-node/5759-memory-manager-hugepages-verification/kep.yaml
+++ b/keps/sig-node/5759-memory-manager-hugepages-verification/kep.yaml
@@ -7,9 +7,11 @@ participating-sigs: []
 status: provisional
 creation-date: 2024-12-24
 reviewers:
-  - TBD
+  - "@ffromani"
+  - "@derekwaynecarr"
+  - "@mrunalp"
 approvers:
-  - TBD
+  - "@dchen1107"
 
 see-also:
   - "/keps/sig-node/1769-memory-manager"

From f81f9228b11cd259750faf2ff7e146e4824f8611 Mon Sep 17 00:00:00 2001
From: Srikalyan Swayampakula <srikalyansswayam@gmail.com>
Date: Sat, 31 Jan 2026 23:46:54 -0800
Subject: [PATCH 06/10] KEP-5759: Add PRR approval file and update approvers

- Add haircommander (Peter Hunt) as KEP approver
- Add PRR approval file for alpha stage with johnbelamaric as approver
---
 keps/prod-readiness/sig-node/5759.yaml                      | 6 ++++++
 .../5759-memory-manager-hugepages-verification/kep.yaml     | 1 +
 2 files changed, 7 insertions(+)
 create mode 100644 keps/prod-readiness/sig-node/5759.yaml

diff --git a/keps/prod-readiness/sig-node/5759.yaml b/keps/prod-readiness/sig-node/5759.yaml
new file mode 100644
index 000000000000..ba028ff810b3
--- /dev/null
+++ b/keps/prod-readiness/sig-node/5759.yaml
@@ -0,0 +1,6 @@
+# The KEP must have an approver from the
+# "prod-readiness-approvers" group
+# of http://git.k8s.io/enhancements/OWNERS_ALIASES
+kep-number: 5759
+alpha:
+  approver: "@johnbelamaric"
diff --git a/keps/sig-node/5759-memory-manager-hugepages-verification/kep.yaml b/keps/sig-node/5759-memory-manager-hugepages-verification/kep.yaml
index e8a539d758fd..0f349e3c39c1 100644
--- a/keps/sig-node/5759-memory-manager-hugepages-verification/kep.yaml
+++ b/keps/sig-node/5759-memory-manager-hugepages-verification/kep.yaml
@@ -12,6 +12,7 @@ reviewers:
   - "@mrunalp"
 approvers:
   - "@dchen1107"
+  - "@haircommander"
 
 see-also:
   - "/keps/sig-node/1769-memory-manager"

From 4753aad37e8e2a367d4698f6b438e0166f4fb6e3 Mon Sep 17 00:00:00 2001
From: Srikalyan Swayampakula <srikalyansswayam@hotmail.com>
Date: Wed, 11 Feb 2026 22:09:10 -0800
Subject: [PATCH 07/10] Apply suggestion from @wendy-ha18

Co-authored-by: Wendy Ha <139814343+wendy-ha18@users.noreply.github.com>
---
 .../5759-memory-manager-hugepages-verification/kep.yaml         | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/keps/sig-node/5759-memory-manager-hugepages-verification/kep.yaml b/keps/sig-node/5759-memory-manager-hugepages-verification/kep.yaml
index 0f349e3c39c1..1e7b2aa6dbff 100644
--- a/keps/sig-node/5759-memory-manager-hugepages-verification/kep.yaml
+++ b/keps/sig-node/5759-memory-manager-hugepages-verification/kep.yaml
@@ -4,7 +4,7 @@ authors:
   - "@srikalyan"
 owning-sig: sig-node
 participating-sigs: []
-status: provisional
+status: implementable
 creation-date: 2024-12-24
 reviewers:
   - "@ffromani"

From f80d3accef25ec42f7e2b9c9795f7f2fbca7833d Mon Sep 17 00:00:00 2001
From: Srikalyan Swayampakula <srikalyansswayam@gmail.com>
Date: Wed, 11 Feb 2026 22:44:25 -0800
Subject: [PATCH 08/10] Address PRR and design review feedback

- Move metrics from Beta to Alpha graduation criteria per ffromani's
  request to have observability available at alpha stage
- Change "TBD during alpha phase" to "Will be done during alpha phase"
  per johnbelamaric's nit on the upgrade/rollback testing question
- Add Alternative 3: Standalone NUMA-aware hugepages admission handler
  with pros/cons analysis per ffromani's suggestion
- Expand Alternative 1 with NUMA tracking limitation: without
  cpuset.mems enforcement, NUMA node allocation is unknown until
  container runtime, making per-pod tracking infeasible at admission
- Reframe race condition caveat to emphasize kubelet/workload contract
  breach rather than just startup failure timing
- Relax milestone timeline: beta v1.38, stable v1.40
- Remove sysfs availability from risk table (sysfs is a kubelet precondition)
- Recommend Option A (direct sysfs reading) with rationale
- Remove feature gate as safety mechanism framing throughout
- Remove hardcoded error message format (not a public API)
- Remove specific log format and alerting recommendation sections
- Simplify Events section to describe behavior without locking format
- Move conformance tests from GA to Beta criteria
- Update GA to "feature always enabled (feature gate removed)"
- Reword Upgrade/Downgrade without feature gate dependency
- Update rollback answer to reflect always-enabled at GA
- Replace speculative discrepancy metric with alpha evaluation plan
---
 .../README.md                                 | 135 ++++++++++--------
 .../kep.yaml                                  |   4 +-
 2 files changed, 81 insertions(+), 58 deletions(-)

diff --git a/keps/sig-node/5759-memory-manager-hugepages-verification/README.md b/keps/sig-node/5759-memory-manager-hugepages-verification/README.md
index 9eb9ccd4d4db..f6c072bda6f6 100644
--- a/keps/sig-node/5759-memory-manager-hugepages-verification/README.md
+++ b/keps/sig-node/5759-memory-manager-hugepages-verification/README.md
@@ -27,8 +27,6 @@
   - [Observability](#observability)
     - [Metrics](#metrics)
     - [Events](#events)
-    - [Kubelet Logs](#kubelet-logs)
-    - [Alerting Recommendations](#alerting-recommendations)
   - [Test Plan](#test-plan)
       - [Prerequisite testing updates](#prerequisite-testing-updates)
       - [Unit tests](#unit-tests)
@@ -50,8 +48,9 @@
 - [Implementation History](#implementation-history)
 - [Drawbacks](#drawbacks)
 - [Alternatives](#alternatives)
-  - [Alternative 1: Track all pod hugepage usage](#alternative-1-track-all-pod-hugepage-usage)
+  - [Alternative 1: Track all pod hugepage usage per NUMA node](#alternative-1-track-all-pod-hugepage-usage-per-numa-node)
   - [Alternative 2: Scheduler-level hugepage awareness](#alternative-2-scheduler-level-hugepage-awareness)
+  - [Alternative 3: Standalone NUMA-aware hugepages admission handler](#alternative-3-standalone-numa-aware-hugepages-admission-handler)
 <!-- /toc -->
 
 ## Release Signoff Checklist
@@ -232,10 +231,15 @@ Job B can be rescheduled to another node with sufficient hugepages.
   3. Pod enters `CrashLoopBackOff` or `Error` state
   4. Scheduler may reschedule to another node (if applicable)
 
-  **Why this is still valuable**: Without verification, the failure window spans
-  from pod scheduling to container startup (seconds to minutes). With verification,
-  the window is reduced to milliseconds between sysfs read and container start.
-  The vast majority of failures are prevented.
+  **Why this is still valuable**: Beyond startup failures and timing, the core
+  issue is that without verification the kubelet/workload contract is breached.
+  The implicit contract is that once a pod is admitted, the requested resources
+  are available. Without this fix, that contract is violated for hugepages when
+  the Memory Manager's internal state diverges from OS reality (as demonstrated
+  in [issue #134395](https://github.com/kubernetes/kubernetes/issues/134395)).
+  With verification, the failure window is reduced from seconds/minutes to
+  milliseconds between sysfs read and container start, and the vast majority
+  of contract violations are prevented.
 
 - **Linux-only**: This feature is Linux-specific. The sysfs interface for hugepages
   (`/sys/devices/system/node/node<N>/hugepages/`) is a Linux kernel feature.
@@ -257,7 +261,6 @@ Job B can be rescheduled to another node with sufficient hugepages.
 | sysfs reads add latency to admission | Minimal impact: single file read per hugepage size per NUMA node; < 1ms typically |
 | False rejections due to transient consumption | Acceptable: better to reject than admit and fail at runtime; pod can be rescheduled |
 | Verification passes but container still fails (race) | Window is milliseconds vs seconds/minutes without verification; event emitted for debugging |
-| Fresh sysfs reads on every Allocate() | Lightweight operation; only triggered for pods requesting hugepages |
 
 ## Design Details
 
@@ -303,8 +306,11 @@ would be required.
 - Adds API surface to cadvisor
 - Longer timeline
 
-The choice between options should be made during KEP review based on
-maintainability preferences and timeline considerations.
+**Recommendation: Option A (Direct sysfs reading)**. The sysfs read is trivial
+(single file read per NUMA node per hugepage size), the Memory Manager already
+has precedent for reading memory topology from sysfs, and it avoids cross-repo
+dependencies on the critical admission path. Option B adds API surface to cadvisor
+for a very narrow use case that doesn't clearly fit cadvisor's caching model.
 
 ### sysfs Interface
 
@@ -338,16 +344,12 @@ func (p *staticPolicy) verifyOSHugepagesAvailability(
 ```
 
 The verification:
-- Only runs when the Static policy is enabled and feature gate is on
+- Only runs when the Memory Manager's Static policy is enabled
 - Only checks hugepage resources (not regular memory)
 - **Respects NUMA node selection**: Only checks the specific NUMA nodes that the
   Memory Manager's allocation algorithm has selected (see Topology Manager section)
-- Returns admission error if insufficient free hugepages
-
-**Error message format**:
-```
-insufficient hugepages-2Mi on NUMA node(s) [0]: requested 4Gi, available 2Gi
-```
+- Returns admission error if insufficient free hugepages, including the hugepage
+  size, NUMA node(s), requested amount, and available amount to aid debugging
 
 ### Integration with Topology Manager
 
@@ -402,27 +404,11 @@ This feature provides explicit signals for operators to monitor hugepage verific
 
 #### Events
 
-When a pod is rejected due to insufficient hugepages, a Kubernetes event is generated:
-
-```
-Type:    Warning
-Reason:  FailedHugepagesVerification
-Message: insufficient hugepages-2Mi on NUMA node(s) [0]: requested 4Gi, available 2Gi
-```
-
-#### Kubelet Logs
-
-At `--v=4` or higher, kubelet logs verification details:
-```
-I0127 10:15:32.123456 12345 policy_static.go:XXX] "Verifying OS hugepages availability" pod="default/dpdk-app" container="dpdk"
-I0127 10:15:32.123789 12345 policy_static.go:XXX] "Hugepages verification passed" pod="default/dpdk-app" numaNodes=[0] size="hugepages-2Mi" requested=1073741824 available=2147483648
-```
-
-#### Alerting Recommendations
-
-Operators should consider alerts for:
-- `rate(memory_manager_hugepages_verification_failures_total[5m]) > 0`: Pods being rejected
-- `histogram_quantile(0.99, memory_manager_hugepages_verification_latency_seconds) > 0.05`: High verification latency
+When a pod is rejected due to insufficient hugepages, a Kubernetes event is
+generated with reason `FailedHugepagesVerification` containing details about
+the hugepage size, NUMA node(s), and the discrepancy between requested and
+available amounts. Operators can use `kubectl get events` to identify affected
+pods and take corrective action.
 
 ### Test Plan
 
@@ -465,28 +451,29 @@ to implement this enhancement.
 - E2e tests demonstrating:
   - Pod admission succeeds when sufficient free hugepages exist
   - Pod admission fails when insufficient free hugepages exist
+- Metrics for verification checks and failures
 - Documentation for feature gate and behavior
 
 #### Beta
 
 - E2e tests demonstrating correct behavior
-- Metrics for verification failures
+- Conformance tests if applicable
 - Feedback incorporated from alpha users
 - No significant bugs reported
 
 #### GA
 
-- Feature enabled by default
-- Conformance tests if applicable
+- Feature always enabled (feature gate removed)
 - Documentation updated for stable feature
 
 ### Upgrade / Downgrade Strategy
 
-**Upgrade**: No special handling required. The feature is additive and controlled
-by a feature gate. Existing pods are unaffected.
+**Upgrade**: No special handling required. The feature is additive and only
+affects new pod admissions. Existing running pods are unaffected.
 
-**Downgrade**: Disabling the feature gate returns to previous behavior where
-OS hugepage availability is not verified. No data migration needed.
+**Downgrade**: Reverting to a kubelet version without this feature returns to
+previous behavior where OS hugepage availability is not verified. No data
+migration or persistent state cleanup is needed.
 
 **Kubelet restart behavior**: After kubelet restarts, Memory Manager rebuilds its
 internal state from checkpoint. Since verification reads fresh sysfs data on each
@@ -519,8 +506,10 @@ shows availability. This is the intended behavior to prevent runtime failures.
 
 ###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
 
-Yes. Disabling the feature gate and restarting kubelet returns to previous
-behavior. No persistent state is affected.
+During alpha/beta, the feature can be disabled via the feature gate and
+restarting kubelet, which returns to previous behavior. No persistent state
+is affected. At GA, the feature gate will be removed and verification will
+be always-enabled, as it strictly improves correctness of pod admission.
 
 ###### What happens if we reenable the feature if it was previously rolled back?
 
@@ -548,7 +537,7 @@ impact already running pods. Rollback simply stops verification on new admission
 
 ###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
 
-TBD during alpha phase.
+Will be done during alpha phase.
 
 ###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
 
@@ -558,7 +547,6 @@ No.
 
 ###### How can an operator determine if the feature is in use by workloads?
 
-- Feature gate `MemoryManagerHugepagesVerification` is enabled
 - Metric `memory_manager_hugepages_verification_total` is incrementing (indicates verification checks are being performed)
 - Pods with Guaranteed QoS requesting hugepages resources are being scheduled
 
@@ -593,10 +581,9 @@ No.
 
 ###### Are there any missing metrics that would be useful to have to improve observability of this feature?
 
-Additional metrics that could be added in Beta:
-- `memory_manager_hugepages_discrepancy_bytes`: Gauge showing difference between
-  Memory Manager's internal tracking and OS-reported free hugepages (useful for
-  detecting drift)
+To be evaluated during alpha based on operational experience. Candidates include
+metrics that help operators identify the root cause of verification failures
+(e.g., which workloads are consuming untracked hugepages).
 
 ### Dependencies
 
@@ -678,15 +665,26 @@ No impact. The feature operates entirely within kubelet using local sysfs.
 
 ## Alternatives
 
-### Alternative 1: Track all pod hugepage usage
+### Alternative 1: Track all pod hugepage usage per NUMA node
 
-Extend Memory Manager to track hugepage usage by Burstable and BestEffort pods.
+Extend the Memory Manager and admission logic to listen to every pod admission
+and track which NUMA node hugepages are allocated from, regardless of QoS class.
 
 **Rejected because**:
-- Significant refactoring required
+- **Fundamental NUMA tracking limitation**: Without `cpuset.mems` enforcement
+  (which only applies to Guaranteed pods with the Static policy), there is no way
+  to know which NUMA node hugepages will be allocated from until the container
+  processes are actually running -- which is past the admission stage. The kernel
+  allocates hugepages based on the process's memory policy and NUMA node proximity
+  at fault time, not at cgroup configuration time.
 - Would not catch external (non-Kubernetes) hugepage consumers
+- Significant refactoring of Memory Manager required
 - Changes the scope and purpose of Memory Manager
 
+The proposed approach of checking actual free resources from sysfs before each
+allocation attempt is the best compromise in the current architecture, as it
+reflects ground truth regardless of which process or pod consumed the hugepages.
+
 ### Alternative 2: Scheduler-level hugepage awareness
 
 Add hugepage availability awareness to the Kubernetes scheduler.
@@ -695,3 +693,28 @@ Add hugepage availability awareness to the Kubernetes scheduler.
 - Much larger scope change
 - Scheduler operates on reported capacity, not real-time availability
 - Does not solve the admission-time verification problem
+
+### Alternative 3: Standalone NUMA-aware hugepages admission handler
+
+Instead of extending the Memory Manager, add a separate kubelet admission handler
+that verifies OS-reported hugepage availability for all pods regardless of QoS class.
+
+**Pros:**
+- Covers all QoS classes (Guaranteed, Burstable, BestEffort), not just Guaranteed
+- Cleaner separation of concerns: verification is decoupled from allocation/tracking
+- Same failure model (kubelet admission error) without coupling to Memory Manager
+- Could obtain NUMA affinity from existing topology hints without strong coupling
+
+**Cons:**
+- Needs to independently resolve NUMA topology and candidate node selection, which
+  the Memory Manager already computes during `Allocate()`
+- Additional admission handler adds coordination overhead with existing handlers
+- For Guaranteed pods, the Memory Manager's allocation algorithm already selects
+  candidate NUMA nodes -- a standalone handler would duplicate or need to replicate
+  this selection logic to know which NUMA nodes to check
+- Larger implementation scope for alpha
+
+**Decision**: Extend the Memory Manager for alpha since it already has the NUMA
+topology context and candidate node selection computed at the point where
+verification is needed. A standalone admission handler could be explored in future
+iterations to extend coverage to non-Guaranteed pods.
diff --git a/keps/sig-node/5759-memory-manager-hugepages-verification/kep.yaml b/keps/sig-node/5759-memory-manager-hugepages-verification/kep.yaml
index 1e7b2aa6dbff..163dfc5f735b 100644
--- a/keps/sig-node/5759-memory-manager-hugepages-verification/kep.yaml
+++ b/keps/sig-node/5759-memory-manager-hugepages-verification/kep.yaml
@@ -28,8 +28,8 @@ latest-milestone: "v1.36"
 # The milestone at which this feature was, or is targeted to be, at each stage.
 milestone:
   alpha: "v1.36"
-  beta: "v1.37"
-  stable: "v1.38"
+  beta: "v1.38"
+  stable: "v1.40"
 
 # The following PRR answers are required at alpha release
 # List the feature gate name and the components for which it must be enabled

From a22c1b0beb87e17e5ba227d0cd14e775b537c1a0 Mon Sep 17 00:00:00 2001
From: Srikalyan Swayampakula <srikalyansswayam@gmail.com>
Date: Mon, 2 Mar 2026 10:01:24 -0800
Subject: [PATCH 09/10] KEP-5759: Retarget alpha milestone to v1.37
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Shift all milestones by one release cycle:
- alpha: v1.36 → v1.37
- beta: v1.38 → v1.39
- stable: v1.40 → v1.41
---
 .../5759-memory-manager-hugepages-verification/kep.yaml   | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/keps/sig-node/5759-memory-manager-hugepages-verification/kep.yaml b/keps/sig-node/5759-memory-manager-hugepages-verification/kep.yaml
index 163dfc5f735b..e3265180db82 100644
--- a/keps/sig-node/5759-memory-manager-hugepages-verification/kep.yaml
+++ b/keps/sig-node/5759-memory-manager-hugepages-verification/kep.yaml
@@ -23,13 +23,13 @@ stage: alpha
 # The most recent milestone for which work toward delivery of this KEP has been
 # done. This can be the current (upcoming) milestone, if it is being actively
 # worked on.
-latest-milestone: "v1.36"
+latest-milestone: "v1.37"
 
 # The milestone at which this feature was, or is targeted to be, at each stage.
 milestone:
-  alpha: "v1.36"
-  beta: "v1.38"
-  stable: "v1.40"
+  alpha: "v1.37"
+  beta: "v1.39"
+  stable: "v1.41"
 
 # The following PRR answers are required at alpha release
 # List the feature gate name and the components for which it must be enabled

From dba799534b37b4a1248edefd59b4346c8134bac6 Mon Sep 17 00:00:00 2001
From: Srikalyan Swayampakula <srikalyansswayam@gmail.com>
Date: Mon, 2 Mar 2026 10:27:35 -0800
Subject: [PATCH 10/10] KEP-5759: Clarify verification approach and trim
 implementation details

- Clarify dual-source verification: min(internal_free, os_free) per NUMA
  node to handle both untracked Burstable pod consumption and not-yet-faulted
  Guaranteed pod allocations
- Remove specific error message formats from KEP to avoid creating
  implicit API contracts
- Add user-observable behavior note pointing to event reason and metrics
  as the stable interface for identifying verification failures
---
 .../README.md                                 | 39 +++++++++++++------
 1 file changed, 27 insertions(+), 12 deletions(-)

diff --git a/keps/sig-node/5759-memory-manager-hugepages-verification/README.md b/keps/sig-node/5759-memory-manager-hugepages-verification/README.md
index f6c072bda6f6..dd57e9b2e5ec 100644
--- a/keps/sig-node/5759-memory-manager-hugepages-verification/README.md
+++ b/keps/sig-node/5759-memory-manager-hugepages-verification/README.md
@@ -214,10 +214,9 @@ Job B is admitted, but when its container starts, only 4GB are actually free.
 Job B fails at runtime.
 
 **Desired behavior**: Memory Manager reads sysfs during admission and sees only
-4GB free. Job B is rejected with error:
-`insufficient hugepages-2Mi on NUMA node(s) [0,1]: requested 6Gi, available 4Gi`
-
-Job B can be rescheduled to another node with sufficient hugepages.
+4GB free. Job B is rejected at admission with an error indicating insufficient
+free hugepages on the relevant NUMA node(s), allowing it to be rescheduled to
+another node with sufficient hugepages.
 
 ### Notes/Constraints/Caveats
 
@@ -268,7 +267,18 @@ Job B can be rescheduled to another node with sufficient hugepages.
 
 The core enhancement is adding a `verifyOSHugepagesAvailability()` function to
 the Memory Manager's Static policy, called during `Allocate()`. This function
-reads fresh hugepage availability and rejects pods when insufficient.
+combines two sources to determine actual availability:
+
+1. **Memory Manager internal state**: Tracks hugepage allocations for Guaranteed
+   pods per NUMA node, including pages allocated but not yet faulted by processes.
+2. **OS-reported free hugepages** (sysfs `free_hugepages`): Reflects actual kernel
+   state, catching consumption by Burstable pods and other untracked sources.
+
+The effective available hugepages is `min(internal_free, os_free)` per NUMA node:
+- `internal_free` prevents double-counting pages committed to existing Guaranteed
+  pods that haven't been faulted yet (which sysfs still reports as "free")
+- `os_free` catches hugepage consumption that the Memory Manager doesn't track
+  (e.g., Burstable pods)
 
 ### Implementation Approaches
 
@@ -335,11 +345,12 @@ func (p *staticPolicy) verifyOSHugepagesAvailability(
     pod *v1.Pod,
     container *v1.Container,
 ) error {
-    // 1. Read free hugepages directly from sysfs for each NUMA node
-    // 2. For each hugepage size requested by the container:
-    //    a. Sum free hugepages across candidateNUMANodes only
-    //    b. Compare against the requested amount
-    // 3. Return error if insufficient, with detailed message
+    // For each hugepage size requested by the container:
+    // 1. Get Memory Manager's internal free count per candidate NUMA node
+    // 2. Read OS free hugepages from sysfs per candidate NUMA node
+    // 3. Effective available = min(internal_free, os_free) per NUMA node
+    // 4. Sum effective available across candidate NUMA nodes
+    // 5. Return error if sum < requested amount
 }
 ```
 
@@ -348,8 +359,12 @@ The verification:
 - Only checks hugepage resources (not regular memory)
 - **Respects NUMA node selection**: Only checks the specific NUMA nodes that the
   Memory Manager's allocation algorithm has selected (see Topology Manager section)
-- Returns admission error if insufficient free hugepages, including the hugepage
-  size, NUMA node(s), requested amount, and available amount to aid debugging
+- Returns an admission error if insufficient free hugepages are detected
+
+**User-observable behavior**: Operators can identify verification failures through
+the `FailedHugepagesVerification` event reason and the verification metrics
+described in the [Observability](#observability) section. The specific error
+message format is an implementation detail and may change between releases.
 
 ### Integration with Topology Manager