Skip to content

KEP-5759: Memory Manager Hugepages Availability Verification#5753

Open
srikalyan wants to merge 10 commits into
kubernetes:masterfrom
srikalyan:kep-memory-manager-hugepages-verification
Open

KEP-5759: Memory Manager Hugepages Availability Verification#5753
srikalyan wants to merge 10 commits into
kubernetes:masterfrom
srikalyan:kep-memory-manager-hugepages-verification

Conversation

@srikalyan
Copy link
Copy Markdown

@srikalyan srikalyan commented Dec 24, 2025

Summary

This KEP proposes enhancing the Memory Manager's Static policy to verify OS-reported free hugepages availability during pod admission.

Problem

The Memory Manager only tracks hugepage allocations for Guaranteed QoS pods. Burstable/BestEffort pods can consume hugepages (via hugetlbfs mounts or mmap with MAP_HUGETLB) without being tracked, causing subsequent Guaranteed pods to be admitted but fail at runtime when hugepages are exhausted.

Solution

  1. cadvisor: Add FreePages field to HugePagesInfo with new GetCurrentHugepagesInfo() method for fresh sysfs reads (PR: Add FreePages to HugePagesInfo for hugepage availability reporting google/cadvisor#3804)
  2. Memory Manager: Verify OS-reported free hugepages during Allocate() in Static policy
  3. Admission: Reject pods when insufficient free hugepages are available

Related

KEP Metadata

  • SIG: sig-node
  • Stage: Alpha (target v1.36)
  • Feature Gate: MemoryManagerHugepagesVerification

/sig node
/kind kep

This KEP proposes enhancing the Memory Manager's Static policy to
verify OS-reported free hugepages availability during pod admission.

Problem:
The Memory Manager only tracks hugepage allocations for Guaranteed QoS
pods. Burstable/BestEffort pods can consume hugepages without being
tracked, causing subsequent Guaranteed pods to be admitted but fail
at runtime when hugepages are exhausted.

Solution:
- Add FreePages field to cadvisor's HugePagesInfo (PR google/cadvisor#3804)
- Verify OS-reported free hugepages during Allocate() in Static policy
- Reject pods when insufficient free hugepages are available

Related: kubernetes/kubernetes#134395
@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory labels Dec 24, 2025
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

@srikalyan: The label(s) area/kubelet cannot be applied, because the repository doesn't have them.

Details

In response to this:

Summary

This KEP proposes enhancing the Memory Manager's Static policy to verify OS-reported free hugepages availability during pod admission.

Problem

The Memory Manager only tracks hugepage allocations for Guaranteed QoS pods. Burstable/BestEffort pods can consume hugepages (via hugetlbfs mounts or mmap with MAP_HUGETLB) without being tracked, causing subsequent Guaranteed pods to be admitted but fail at runtime when hugepages are exhausted.

Solution

  1. cadvisor: Add FreePages field to HugePagesInfo (PR: Add FreePages to HugePagesInfo for hugepage availability reporting google/cadvisor#3804)
  2. Memory Manager: Verify OS-reported free hugepages during Allocate() in Static policy
  3. Admission: Reject pods when insufficient free hugepages are available

Related

KEP Metadata

  • SIG: sig-node
  • Stage: Alpha (target v1.33)
  • Feature Gate: MemoryManagerHugepagesVerification

/sig node
/kind kep
/area kubelet

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Dec 24, 2025
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Welcome @srikalyan!

It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/enhancements has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @srikalyan. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Dec 24, 2025
@srikalyan
Copy link
Copy Markdown
Author

/remove-area kubelet

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

@srikalyan: Those labels are not set on the issue: area/kubelet

Details

In response to this:

/remove-area kubelet

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ffromani
Copy link
Copy Markdown
Contributor

/cc

@ffromani
Copy link
Copy Markdown
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Dec 26, 2025
Copy link
Copy Markdown
Contributor

@ffromani ffromani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution! I'm in favor of improving the accounting and making the memory manager/kubelet more predictable. I think we can benefit from some clarifications before to deep dive into further details.

Comment thread keps/sig-node/5759-memory-manager-hugepages-verification/README.md Outdated
Comment thread keps/sig-node/NNNN-memory-manager-hugepages-verification/README.md Outdated
Comment thread keps/sig-node/NNNN-memory-manager-hugepages-verification/README.md Outdated
Comment thread keps/sig-node/NNNN-memory-manager-hugepages-verification/README.md Outdated
Comment thread keps/sig-node/5759-memory-manager-hugepages-verification/README.md
Comment thread keps/sig-node/5759-memory-manager-hugepages-verification/README.md Outdated
Comment thread keps/sig-node/5759-memory-manager-hugepages-verification/README.md Outdated
Comment thread keps/sig-node/5759-memory-manager-hugepages-verification/README.md
Comment thread keps/sig-node/NNNN-memory-manager-hugepages-verification/kep.yaml Outdated
Comment thread keps/sig-node/NNNN-memory-manager-hugepages-verification/kep.yaml Outdated
Key changes:
- Update milestones to v1.36/v1.37/v1.38
- Clarify sysfs reading: add GetCurrentHugepagesInfo() for fresh reads
  (GetMachineInfo() is cached at startup, would be stale)
- Add Integration with Topology Manager section with policy behavior table
- Add Interaction with CPU Manager section
- Address reserved hugepages (free_hugepages is correct metric)
- Expand race condition discussion with failure handling details
- Rewrite Story 2 as "Rapid Pod Churn" with clear timeline
- Add "Static policy only" note (None policy not applicable)
- Specify error message format with example
- Add kubelet restart behavior note
- Update Risks table with new mitigations
- Fix unit test description (removed nil reference)
- Update TOC with new sections
- Link enhancement issue kubernetes#5759

Related: kubernetes#5759
@srikalyan srikalyan force-pushed the kep-memory-manager-hugepages-verification branch from fed79ac to 9a89040 Compare December 27, 2025 17:41
@ffromani
Copy link
Copy Markdown
Contributor

/retitle KEP-5759: Memory Manager Hugepages Availability Verification

@k8s-ci-robot k8s-ci-robot changed the title KEP: Memory Manager Hugepages Availability Verification KEP-5759: Memory Manager Hugepages Availability Verification Dec 27, 2025
- Add two implementation approaches: Option A (direct sysfs) and Option B (cadvisor)
- Present pros/cons for each option neutrally for KEP review
- Remove cadvisor-specific sections, replace with options discussion
- Add Observability section with metrics, events, logs, alerting
- Update TOC to pass CI verification
- Update KEP number to 5759 throughout

The choice between implementation approaches is left to KEP reviewers
based on maintainability preferences and timeline considerations.
@srikalyan srikalyan force-pushed the kep-memory-manager-hugepages-verification branch from c40cb0b to 8e6ae09 Compare December 27, 2025 20:15
@ffromani
Copy link
Copy Markdown
Contributor

Thanks @srikalyan for leading this effort. I'm in general supportive of this memory manager enhancement and, pending further review and elaborating, I do see the benefit of the proposed approach about checking free hugepages. Because there's some time left before the 1.36 cycle begins, I'd like to explore other options to solve this problem before we commit to the proposed direction. I'll have another review iteration ASAP.

@srikalyan
Copy link
Copy Markdown
Author

srikalyan commented Jan 6, 2026

@ffromani Happy new year to you. Can I request you for another review?

Copy link
Copy Markdown
Contributor

@ffromani ffromani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the updates. The next step is to bring this up on the larger sig-node and in the 1.36 SIG planning. I think this work would be well accepted by the SIG, but let's make sure.

Comment thread keps/sig-node/5759-memory-manager-hugepages-verification/kep.yaml
Comment thread keps/sig-node/5759-memory-manager-hugepages-verification/README.md
Comment thread keps/sig-node/5759-memory-manager-hugepages-verification/README.md
Comment thread keps/sig-node/5759-memory-manager-hugepages-verification/README.md
Comment thread keps/sig-node/5759-memory-manager-hugepages-verification/README.md
Comment on lines +136 to +138
**Desired behavior**: The Guaranteed pod admission fails immediately with a clear
error indicating insufficient free hugepages, allowing the scheduler to try
another node or the administrator to take corrective action.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the alternative I'm thinking about is to extend the memory manager and admission logic to listen to each and every pod admission to track where (= which NUMA node) the hugepages are allocated from. However, If the kubelet doesn't enforce a cpuset.mems restriction, however, there's no way to know from where the hugepage is gonna be taken till the container processes go running, therefore past admission stage. Therefore, the proposed approach to check the actual free resources before each and every allocation attempt seems to be the best compromise (bar the only possible approach) in the current architecture.
We should probably document this in the "discarded alternatives" section.

Comment thread keps/sig-node/5759-memory-manager-hugepages-verification/README.md
Comment thread keps/sig-node/5759-memory-manager-hugepages-verification/README.md Outdated
Comment thread keps/sig-node/5759-memory-manager-hugepages-verification/kep.yaml Outdated
@srikalyan
Copy link
Copy Markdown
Author

thanks for the updates. The next step is to bring this up on the larger sig-node and in the 1.36 SIG planning. I think this work would be well accepted by the SIG, but let's make sure.

How do you recommend, I approach this?

@k8s-ci-robot k8s-ci-robot added the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Jan 22, 2026
- Add ffromani, derekwaynecarr, mrunalp as reviewers
- Add dchen1107 as approver (sig-node OWNERS)
@srikalyan srikalyan force-pushed the kep-memory-manager-hugepages-verification branch from 2be55a9 to 36099e3 Compare January 24, 2026 22:43
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Jan 24, 2026
@wendy-ha18
Copy link
Copy Markdown
Member

thanks for the updates. The next step is to bring this up on the larger sig-node and in the 1.36 SIG planning. I think this work would be well accepted by the SIG, but let's make sure.

How do you recommend, I approach this?

Hi @srikalyan , sig node meeting weekly on Tuesdays at 10:00 PT (Pacific Time) so you can attend this week meeting to discuss more with SIG Node tech leads and chairs. Zoom link and detail can be viewed in here: https://github.com/kubernetes/community/tree/master/sig-node.

This KEP has /lead-opted-in and /milestone v1.36 label from SIG Node for it already, so I think we will target for first deadline is Production Readiness Freeze - 4th February 2026 (AoE) / Thursday 5th February 2026, 12:00 UTC.

@srikalyan
Copy link
Copy Markdown
Author

srikalyan commented Jan 26, 2026 via email

- Add haircommander (Peter Hunt) as KEP approver
- Add PRR approval file for alpha stage with johnbelamaric as approver
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: srikalyan
Once this PR has been reviewed and has the lgtm label, please assign dchen1107, wojtek-t for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@haircommander
Copy link
Copy Markdown
Contributor

FWIW: I think the need is clear and the code is pretty narrowly scoped. I am +1 on this, but we may not have TL bandwidth to get it done now

Copy link
Copy Markdown
Member

@johnbelamaric johnbelamaric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small comment otherwise PRR looks good.

@ffromani I guess this is something that will not be an issue if we use DRA for huge pages, assuming we say create each NUMA node as a device with a consumable capacity? We still need to fix this of course.

Comment thread keps/sig-node/5759-memory-manager-hugepages-verification/README.md Outdated
Comment thread keps/sig-node/5759-memory-manager-hugepages-verification/kep.yaml Outdated
@ffromani
Copy link
Copy Markdown
Contributor

@ffromani I guess this is something that will not be an issue if we use DRA for huge pages, assuming we say create each NUMA node as a device with a consumable capacity?

Yes, I think this is correct. We'd need to agree with the right attributes to expose, which will be an interesting discussion on its own, but the DRA model should prevent this issue completely.

Elaborating a bit on the DRA side (unrelated to this PR) the initial proposal is to use the NUMA node ID as proxy to identify the memory controller and to be able to bind it to a group of CPUs.

@johnbelamaric
Copy link
Copy Markdown
Member

FYI, for PRR just awaiting SIG approval, I have one nit above but I consider it non-blocking. kep.yaml update does need to happen too though.

@srikalyan
Copy link
Copy Markdown
Author

srikalyan commented Feb 11, 2026 via email

Co-authored-by: Wendy Ha <139814343+wendy-ha18@users.noreply.github.com>
@srikalyan
Copy link
Copy Markdown
Author

Thank you all. Will address the feedback soon.Sent from my iPhoneOn Feb 11, 2026, at 11:26 AM, John Belamaric @.> wrote:johnbelamaric left a comment (kubernetes/enhancements#5753) FYI, for PRR just awaiting SIG approval, I have one nit above but I consider it non-blocking. kep.yaml update does need to happen too though. —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.>

Thank you everyone for all the feedback and I have addressed all the feedback. Let me know if you have any questions.

Comment thread keps/sig-node/5759-memory-manager-hugepages-verification/kep.yaml Outdated
Comment thread keps/sig-node/5759-memory-manager-hugepages-verification/kep.yaml Outdated
Comment thread keps/sig-node/5759-memory-manager-hugepages-verification/README.md
Comment thread keps/sig-node/5759-memory-manager-hugepages-verification/README.md
Comment thread keps/sig-node/5759-memory-manager-hugepages-verification/README.md
Comment thread keps/sig-node/5759-memory-manager-hugepages-verification/README.md Outdated
Comment thread keps/sig-node/5759-memory-manager-hugepages-verification/README.md Outdated
Comment thread keps/sig-node/5759-memory-manager-hugepages-verification/README.md
Comment thread keps/sig-node/5759-memory-manager-hugepages-verification/README.md Outdated
Comment thread keps/sig-node/5759-memory-manager-hugepages-verification/README.md
- Move metrics from Beta to Alpha graduation criteria per ffromani's
  request to have observability available at alpha stage
- Change "TBD during alpha phase" to "Will be done during alpha phase"
  per johnbelamaric's nit on the upgrade/rollback testing question
- Add Alternative 3: Standalone NUMA-aware hugepages admission handler
  with pros/cons analysis per ffromani's suggestion
- Expand Alternative 1 with NUMA tracking limitation: without
  cpuset.mems enforcement, NUMA node allocation is unknown until
  container runtime, making per-pod tracking infeasible at admission
- Reframe race condition caveat to emphasize kubelet/workload contract
  breach rather than just startup failure timing
- Relax milestone timeline: beta v1.38, stable v1.40
- Remove sysfs availability from risk table (sysfs is a kubelet precondition)
- Recommend Option A (direct sysfs reading) with rationale
- Remove feature gate as safety mechanism framing throughout
- Remove hardcoded error message format (not a public API)
- Remove specific log format and alerting recommendation sections
- Simplify Events section to describe behavior without locking format
- Move conformance tests from GA to Beta criteria
- Update GA to "feature always enabled (feature gate removed)"
- Reword Upgrade/Downgrade without feature gate dependency
- Update rollback answer to reflect always-enabled at GA
- Replace speculative discrepancy metric with alpha evaluation plan
@srikalyan srikalyan force-pushed the kep-memory-manager-hugepages-verification branch from d0147fb to f80d3ac Compare February 17, 2026 06:56
Comment thread keps/sig-node/5759-memory-manager-hugepages-verification/README.md
Shift all milestones by one release cycle:
- alpha: v1.36 → v1.37
- beta: v1.38 → v1.39
- stable: v1.40 → v1.41
- Clarify dual-source verification: min(internal_free, os_free) per NUMA
  node to handle both untracked Burstable pod consumption and not-yet-faulted
  Guaranteed pod allocations
- Remove specific error message formats from KEP to avoid creating
  implicit API contracts
- Add user-observable behavior note pointing to event reason and metrics
  as the stable interface for identifying verification failures
@srikalyan
Copy link
Copy Markdown
Author

@ffromani need your help to get this landed in next release at least. Can you please give another shot at the review please.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory ok-to-test Indicates a non-member PR verified by an org member that is safe to test. sig/node Categorizes an issue or PR as relevant to SIG Node. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants