Skip to content

feat(hetzner): generate HCLOUD_CLUSTER_CONFIG for cluster-autoscaler addon#18137

Draft
bjornharrtell wants to merge 2 commits intokubernetes:masterfrom
bjornharrtell:hetzner-cluster-autoscaler-config
Draft

feat(hetzner): generate HCLOUD_CLUSTER_CONFIG for cluster-autoscaler addon#18137
bjornharrtell wants to merge 2 commits intokubernetes:masterfrom
bjornharrtell:hetzner-cluster-autoscaler-config

Conversation

@bjornharrtell
Copy link
Copy Markdown
Contributor

Summary

Partial implementation of #18136. This draft establishes the overall approach for generating HCLOUD_CLUSTER_CONFIG and wiring it through the cluster-autoscaler addon for Hetzner.

Builds on #18135 (HCLOUD_TOKEN and --nodes format fixes).

Requires kubernetes/autoscaler#9430 to be merged for the labels to be applied to autoscaler-created servers.

What this PR does

1. HetznerClusterAutoscalerConfig() template function

A new template function in template_functions.go that generates the base64-encoded JSON blob for HCLOUD_CLUSTER_CONFIG. For each autoscalable node instance group it produces a NodeConfig entry containing:

  • labels: The full set of Hetzner server labels computed by CloudTagsForInstanceGroup() — the same labels that kops stamps on servers it creates directly. With autoscaler PR feat(hetzner): add serverLabels field to nodeConfig for Hetzner server labels autoscaler#9430, these are applied to autoscaler-created servers at creation time, so kops cloud instance group reconciliation correctly counts them.
  • imagesForArch: The Hetzner image name from ig.Spec.Image.
  • cloudInit: Intentionally empty (see below).

2. hcloud-autoscaler-config Secret

A new Secret resource added to the cluster-autoscaler addon template (Hetzner only), populated with the output of the template function above.

3. HCLOUD_CLUSTER_CONFIG env var

Added to the autoscaler container's env block, sourced from the new secret, alongside the existing HCLOUD_TOKEN and HCLOUD_NETWORK.

What is still missing — cloud-init generation

The NodeConfig.CloudInit field is left empty in this implementation. This means autoscaler-created nodes will have the correct Hetzner labels but will not bootstrap into the cluster. Completing the implementation requires generating the nodeup bootstrap shell script for each node IG.

This is blocked by the fact that addon-template rendering happens before task execution: the CA keypairs, nodeup binary asset URLs, and NodeupConfigHash are not yet available when TemplateFunctions methods are called.

Two paths to resolve this:

Option A — thread through TemplateFunctions (simpler, focused change):
Pass fi.KeystoreReader, NodeUpAssets map[architectures.Architecture]*assets.MirroredAsset, and NodeUpConfigBuilder into TemplateFunctions at construction time in apply_cluster.go. The NodeupConfigHash can be computed by running the config builder for each IG before tasks are scheduled.

Option B — dedicated post-build task (more correct architecture):
Introduce a HetznerClusterAutoscalerConfigSecret task in hetznertasks/ that declares dependencies on each node IG's BootstrapScript task. After those tasks run, it reads the rendered cloud-init via fi.ResourceAsString(), assembles the ClusterConfig JSON, and creates or updates the hcloud-autoscaler-config Secret via the Kubernetes API. The addon template would still reference the secret; the task guarantees it is populated before the autoscaler deployment starts.

Feedback on which option to pursue would be welcome before completing this draft.

Testing

Partially verified against a live kops 1.35 Hetzner cluster: kops update cluster renders the addon template without error and produces the hcloud-autoscaler-config secret with the expected JSON (correct labels, image name). Full end-to-end test (autoscaler creating nodes that join the cluster) is blocked on completing cloud-init generation.

Two fixes to make the kops-managed cluster-autoscaler addon work
correctly on Hetzner:

1. Pass HCLOUD_TOKEN and HCLOUD_NETWORK env vars to the autoscaler
   pod. The addon template only had an env block for AWS (AWS_REGION);
   without the Hetzner token the autoscaler cannot authenticate and
   fails immediately on startup. The vars are sourced from the existing
   'hcloud' secret in kube-system, which is already created by the
   CCM addon.

2. Fix the --nodes flag format. GetClusterAutoscalerNodeGroups() was
   producing the generic '<name>.<cluster>' suffix for all non-GCE
   providers, giving a 3-field format (min:max:name.cluster) that the
   Hetzner cloud provider does not recognise. Hetzner requires 5
   fields: min:max:instanceType:region:name. The region argument is
   the Hetzner location name, which equals the subnet name stored in
   ig.Spec.Subnets[0] (e.g. 'hel1').
…addon

Add a HetznerClusterAutoscalerConfig template function that builds the
HCLOUD_CLUSTER_CONFIG JSON blob expected by the Hetzner cluster-autoscaler
cloud provider (ClusterConfig struct in hetzner_manager.go).

The config encodes per-node-group entries (NodeConfig) containing the same
Hetzner server labels that kops applies to servers it provisions directly.
With autoscaler PR kubernetes/autoscaler#9430 in place, these labels are
stamped onto autoscaler-created servers at creation time, so kops cloud
instance group reconciliation correctly counts them.

A new hcloud-autoscaler-config Secret is added to the cluster-autoscaler
addon manifest (Hetzner only). HCLOUD_CLUSTER_CONFIG is wired into the
autoscaler deployment from this secret alongside the existing HCLOUD_TOKEN
and HCLOUD_NETWORK vars.

The NodeConfig.CloudInit field is intentionally left empty in this draft:
generating the nodeup bootstrap script requires CA keypairs and node-up
binary asset URLs that are not yet accessible at addon-template render time.
This means autoscaler-created nodes will have the correct labels but will
not bootstrap correctly until cloud-init generation is completed. The
follow-up requires either threading the keystore and NodeUpAssets through
TemplateFunctions or implementing a dedicated post-build task.
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 30, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign johngmyers for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested a review from johngmyers March 30, 2026 16:35
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Mar 30, 2026
@k8s-ci-robot k8s-ci-robot requested a review from olemarkus March 30, 2026 16:35
@k8s-ci-robot k8s-ci-robot added area/addons size/L Denotes a PR that changes 100-499 lines, ignoring generated files. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 30, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @bjornharrtell. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@bjornharrtell
Copy link
Copy Markdown
Contributor Author

This is in draft because it depends on both kubernetes/autoscaler#9430 and #18135.

Also not sure about the options on how to solve the bootstrap script issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/addons cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants