Skip to content

metis: add glibc floor qualification test target to Makefile#1036

Open
arvindbr8 wants to merge 8 commits intokubernetes:masterfrom
arvindbr8:presubmit-guard-rail
Open

metis: add glibc floor qualification test target to Makefile#1036
arvindbr8 wants to merge 8 commits intokubernetes:masterfrom
arvindbr8:presubmit-guard-rail

Conversation

@arvindbr8
Copy link
Copy Markdown
Contributor

@arvindbr8 arvindbr8 commented Apr 3, 2026

Enforce a qualification check for the metis CNI binary to ensure it remains compatible with the GKE fleet's glibc 2.35 floor (Ubuntu 22.04 / COS Milestone 117).

Context: Why glibc 2.35?

Because the Metis CNI is executed natively on the host OS by the Kubernetes Kubelet (rather than inside a container namespace), it is strictly bound by the host's C standard library.

Our oldest supported GKE node pools currently run Ubuntu 22.04 LTS and COS Milestone 117, both of which natively provide glibc 2.35. This makes 2.35 the absolute lowest common denominator across our fleet. If the CGO binary links against a glibc version higher than 2.35, it will immediately panic with a version not found error when scheduled on these nodes. See the Container-Optimized OS Release Notes and GKE Release Notes for concrete historical proof of the milestone baselines (Ubuntu 22.04 / COS Milestone 117).

Fleet floor verification

To definitively prove that glibc 2.35 is the correct mathematical floor, we provisioned an ephemeral GKE cluster (1.30.14-gke.2250000) with two node pools reflecting our oldest supported fleet OS images:

  1. COS_CONTAINERD (COS Milestone 117)
  2. UBUNTU_CONTAINERD (Ubuntu 22.04 LTS)

Using debug pods, I queried the host OS's C standard library. The results empirically prove 2.35 is our hard floor, dictated by the Ubuntu nodes:

1. Ubuntu 22.04 Node Pool (UBUNTU_CONTAINERD):

$ kubectl debug node/gke-glibc-test-clust-ubuntu-verificat-36dc13b9-vw47 -it --image=ubuntu --profile=sysadmin
root@gke-glibc-test-clust-ubuntu-verificat-36dc13b9-vw47:/# chroot /host /usr/bin/ldd --version | head -n 1

ldd (Ubuntu GLIBC 2.35-0ubuntu3.13) 2.35  <-- The Fleet Floor

2. COS Node Pool (COS_CONTAINERD):

$ kubectl debug node/gke-glibc-test-clust-cos-verification-82afa5a0-sfv3 -it --image=ubuntu --profile=sysadmin
root@gke-glibc-test-clust-cos-verification-82afa5a0-sfv3:/# chroot /host /lib64/libc.so.6 | head -n 1

GNU C Library (Gentoo 2.37-r15 p12) stable release version 2.37.

Changes

Component: metis/Makefile

  • [NEW] test-glibc-floor target: Builds image, extracts binary, and runs --help natively inside a vanilla ubuntu:22.04 container to guarantee runtime compatibility regardless of host OS.

### Component: GitHub Actions
- [NEW] .github/workflows/metis-glibc-floor-test.yml: Pre-submit guardrail that runs the extraction test on an OS representing the fleet floor (ubuntu-22.04).

Note

The GitHub Actions workflow file (metis-glibc-floor-test.yml) was removed from this PR. The test will be run as a >Prow presubmit job (to be submitted to kubernetes/test-infra). kubernetes/test-infra#36769


Verification Results

1. Symbol Analysis Proof

readelf -V analysis of the binary built on standard golang:1.25.8 (Bookworm) confirms the highest required version is GLIBC_2.34 (safe for 2.35):

Version needs section '.gnu.version_r' contains 1 entry:
  Name: GLIBC_2.34  Flags: none  Version: 2

2. Local Extraction Test

Running make test-glibc-floor succeeded without linkage errors on ubuntu:22.04:

Successfully copied 14.5MB to bin/metis-candidate
docker run --rm -v bin/metis-candidate:/metis ubuntu:22.04 /metis --help
Usage of /metis:
  -alsologtostderr
  ...

The GitHub Actions workflow will be run as part of this PR's presubmit checks (i think!)

Adds a GitHub Actions workflow to qualify glibc floor compatibility on ubuntu-22.04 runners for the metis CNI.

Adds a new test-glibc-floor make target to run the verification locally inside a container, ensuring safety for the glibc 2.35 floor.
@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Apr 3, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

This issue is currently awaiting triage.

If the repository mantainers determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 3, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @arvindbr8. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Tip

We noticed you've done this a few times! Consider joining the org to skip this step and gain /lgtm and other bot rights. We recommend asking approvers on your previous PRs to sponsor you.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 3, 2026
@arvindbr8
Copy link
Copy Markdown
Contributor Author

PTAL: @YifeiZhuang @gnossen

@YifeiZhuang
Copy link
Copy Markdown
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 3, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: arvindbr8, gnossen
Once this PR has been reviewed and has the lgtm label, please assign aojea for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@arvindbr8 arvindbr8 changed the title metis: add glibc floor qualification test and makefile target metis: add glibc floor qualification test target to Makefile Apr 6, 2026
Copy link
Copy Markdown
Contributor

@YifeiZhuang YifeiZhuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the sanity checks for the glibc version skew issue! I don't see why we cannot use both github action and prow.
But it easier to maintain to keep it consistent with prow in this repo https://github.com/kubernetes/test-infra/tree/master/config/jobs/kubernetes/cloud-provider-gcp

#
# WARNING: Do not link this binary against newer GLIBC symbols. Doing so
# will cause immediate runtime panics when scheduled on baseline fleet nodes.
GLIBC_FLOOR_IMAGE := ubuntu:22.04
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we already use 22.04 in dockerfile, it is becomes redundant check then? Given that we develop e2e test to use metis image.

Separate but related: internally in gke release process, we probably should replace/pin a base image with a specific sha per recommendation and approved base images.
https://g3doc.corp.google.com/cloud/kubernetes/g3doc/subgroup/security/ssci/guidance/container_base_image.md?cl=head#oss-distroless (Interesting, it also explains why not scratch/alpine.)

Currently approved image is debian12. It is using glibc 2.36 - higher than your researched oldest version on GKE 1.30. But this is fine because the feature will be version trait guarded to a future version. And nodes can not be two minor versions lower.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great points, Ivy.

Regarding the redundancy: The test actually catches a different failure mode. Because we compile with CGO, the glibc version requirement is permanently baked into the binary during the builder stage, not the runtime stage. If someone accidentally upgrades the builder image to a newer OS, the compilation succeeds, but the resulting binary will panic when the Kubelet executes it natively on an older host. The extraction test physically verifies the ELF headers against the host OS floor to prevent that specific builder regression.

Regarind internal debian12 base image policy:
Because glibc is strictly backwards compatible, I think sticking with the 2.35 floor (compiling via Debian Bullseye) is the safest, most conservative approach for right now. A binary compiled against 2.31 will run flawlessly on both our oldest Ubuntu 22.04 nodes (2.35) and the newer approved debian12 nodes (2.36).

By holding the floor at 2.35 today, we maintain an absolute, fleet-wide safety net just in case of unexpected rollouts or backports. Since it's fully forwards-safe for newer nodes, we can easily upgrade the builder image and this CI test to debian12 next year once the GKE 1.30/Ubuntu 22.04 nodes are fully deprecated and physically out of the fleet.

Does that sound reasonable?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants