Skip to content

feat: universe#1128

Draft
raisedadead wants to merge 40 commits intomainfrom
feat/k3s-universe
Draft

feat: universe#1128
raisedadead wants to merge 40 commits intomainfrom
feat/k3s-universe

Conversation

@raisedadead
Copy link
Copy Markdown
Member

No description provided.

Salvage useful infrastructure from feat/k3s-rancher onto a clean
branch aligned with the Universe platform direction.

Adds:
- SOPS+age config for application secrets encryption
- ops-mgmt cluster configs (security hardening, Rancher backup
  schedule)
- DO inventory fix (drop unused id attribute)
- Ansible config vault password comment
…ucture

Universe Day 0 spike: provision the first galaxy (gxy-management) with
Cilium CNI.

Adds:
- Ansible role: cilium (Helm-based CNI install on any k3s cluster)
- play-k3s--galaxy.yml: semi-generic 6-play playbook for any Universe
  galaxy, composing roles with security hardening and etcd S3 backups
- k3s/gxy-management/: cluster configs (Cilium values, PSS, audit
  policy), app manifests (Windmill, ArgoCD, Zot)
- Pod CIDR 10.1.0.0/16, Service CIDR 10.11.0.0/16 (ADR-009)
- Traefik for Day 0 ingress, Cilium Gateway API evaluation later
- Internal services via NodePort on Tailscale IPs (ADR-009)
- local-path storage (ADR-008: no Longhorn, Ceph on bare metal later)

Naming convention: ops-gxy-* hybrid (ops- prefix for infra resources,
gxy- for logical galaxy naming).
Merge ansible, k3s, and terraform justfiles into a single root justfile
with grouped recipes. All recipes run from repo root.

Groups: secrets, k3s, ansible, terraform. Uses uv run for ansible
commands since direnv does not activate inside just subprocesses.
Single secrets/ directory with ansible-vault encryption replaces
scattered .secrets.env files and SOPS+age.

- secrets/<name>/.env.sample: plaintext templates (git tracked)
- secrets/<name>/.env: ansible-vault encrypted (gitignored, shared
  via 1Password)
- Remove .sops.yaml (SOPS replaced by ansible-vault)
- Update .gitignore to whitelist secrets/ samples only
- Add secrets/.gitignore to exclude encrypted .env files
- Add secrets/README.md documenting the approach

Encrypted .env files exist for: global, do-legacy, do-universe,
ansible, appsmith, outline. New apps (windmill, argocd, zot) have
samples only until deployed.
Move from per-directory .envrc (ansible/) to root .envrc with
per-cluster overrides via direnv source_env hierarchy.

- Root .envrc: loads .env (global tokens) + adds ansible venv to PATH
- k3s/<cluster>/.envrc: inherits root, loads cluster-specific .env
  (DO_API_TOKEN + KUBECONFIG)
- just secret-bootstrap: decrypts global tokens to root .env
- just secret-bootstrap-cluster: decrypts team tokens to cluster .env
- Rename do-legacy → do-primary
- Update .gitignore: track .envrc files, whitelist secrets/ samples
Resolve all config discrepancies between feat/k3s-universe branch and
the Universe spike-plan/ADR decisions for the gxy-management cluster.

Changes:
- Create galaxy-specific traefik-config.yaml (LoadBalancer via ServiceLB)
  instead of modifying the shared config used by ops-backoffice-tools
- Re-enable ServiceLB in playbook, update traefik source to galaxy path
- Fix region alignment: nyc3 → fra1 for etcd S3 backups and Zot registry
- New FRA1 buckets: universe-backups (etcd) and universe-registry (Zot)
- Clean up PSS admission: remove stale cattle-*/longhorn/cert-manager,
  add cilium and windmill namespace exemptions
- Add Gateway API resources (Gateway + HTTPRoute) for Windmill, ArgoCD,
  and Zot matching the ops-backoffice-tools pattern
- Add TLS secret samples for Cloudflare origin certificates
- Update ArgoCD/Zot comments to Cloudflare Access model
- Add deployment runbook with pre/post ClickOps checklists
- Update k3s/README.md with correct specs, region, and architecture
Add kubeconform as a K8s manifest schema validator:

- justfile: k8s-validate recipe validates all manifests under k3s/
  and k8s/ against K8s 1.30.0 schemas + datreeio/CRDs-catalog for
  CRDs (Gateway API, Traefik, Longhorn, HelmChartConfig)
- CI: k8s--validate.yml workflow runs just k8s-validate on push/PR
  to main, installs kubeconform v0.7.0 + just via curl

Non-manifest YAML (values.yaml, kustomization.yaml, kubeconfig,
PSS admission, audit policy, samples) excluded via filename patterns.
Add ignore patterns for .json files and dashboards/ directory —
Grafana dashboard outputs, package.json, and tsconfig.json are not
K8s manifests.
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 2, 2026

Run Details - tfws-ops-test

Terraform Cloud Plan Output

Plan: 6 to add, 0 to change, 0 to destroy.

Details : https://app.terraform.io/app/freecodecamp/workspaces/tfws-ops-test/runs/run-vDqUdSPVSWb3xiy8

Warning

Please note that the plan output provided may not accurately reflect the impact on the Terraform project you are currently working on in this Pull Request. The CI checks are merely a sanity test to verify that the versions in the lock file are valid and functional.

Confirm the actual Terraform plan by running the corresponding project on your machine or on TFC.

Move all encrypted secrets and samples to a private infra-secrets
sibling repo. Replace ansible-vault with sops+age encryption.

- Add use_sops function to root .envrc for transparent decryption
- Update cluster .envrc files to load team-specific DO tokens
- Replace justfile ansible-vault recipes with sops equivalents
- Update deploy recipe to use sops -d
- Add galaxy-play recipe for k3s playbook with sops decrypt
- Remove secrets/ directory (migrated to private repo)
- Update ansible.cfg and galaxy playbook references
- Add tailscale-install and tailscale-up convenience recipes
- Update README to reference just recipes instead of raw ansible commands
- Remove stale ansible-vault references from deployment runbook
- Add --create-namespace to Helm install commands
- Add SPIKE-STATUS.md with full research, decisions, progress, and next steps
- Add kubeconfig-sync recipe to decrypt kubeconfig from infra-secrets
- Update deploy recipe to handle TLS certs alongside app secrets
- Add DO_API_TOKEN guard to galaxy-play recipe
Justfile: 18 → 12 recipes (parametric, no special-case orchestration)
- Add generic `play` recipe replacing tailscale-install/tailscale-up
- Add `helm-upgrade` recipe with convention-based chart discovery
- Add parametric `tf` recipe replacing 5 separate terraform recipes
- Fix `secret-view` format auto-detection (was hardcoded dotenv)
- Parameterize `k8s-validate` K8s version (was hardcoded 1.30.0)
- Remove `galaxy-play` — playbook reads env vars via direnv now

Playbook: galaxy reads DO_SPACES_* from env instead of vault file
- Replace vars_files with lookup('env', ...) in play-k3s--galaxy.yml
- Add DO_SPACES_ACCESS_KEY/SECRET_KEY to do-universe/.env.enc
- Delete ansible/vault-k3s.yaml.enc from infra-secrets

Docs: replace raw commands with just recipes in all READMEs
- Add repo files for helm-upgrade chart discovery convention
Critical:
- Fix cilium_cluster_id assertion (int has no length, use string filter)
- Quote etcd snapshot cron schedule to survive systemd word-splitting
- Add TLS secretGenerator to argocd/windmill/zot kustomization.yaml

Warning:
- .envrc: replace silent 2>/dev/null with log_error on sops failure
- justfile deploy: set trap incrementally after each sops decrypt
- justfile kubeconfig-sync: umask 077 before writing kubeconfig
- Add Helm install task to cilium role (was missing on remote hosts)
- CI workflow: pass explicit k8s version to kubeconform
- Parameterize galaxy_name in playbook (was hardcoded 6 times)
- Remove redundant tf-fmt recipe (just tf fmt works)
- Fix ops-mgmt README: remove raw commands and stale vault refs
- Fix ssh_import_id: add gh: prefix to raisedadead in cloud-init

Suggestion:
- Remove redundant .envrc patterns from .gitignore
- Remove orphaned comment and blank lines from .gitignore
- Trim gxy-management .gitignore to non-redundant patterns only
- Fix repo.txt → repo in gxy-management README
- Remove dead cilium namespace from PSS exemptions
- Fix stale destination path comments in security configs
- Remove hardcoded postgresPassword and databaseUrl from public values.yaml
- Update helm-upgrade recipe to overlay secret values from infra-secrets
- Secret values file pattern: <app>.values.yaml.enc (sops-encrypted YAML)
- Decrypted to temp file at install time, deleted after helm upgrade
- Mark completed items (Tailscale, TLS certs, code review, justfile overhaul)
- Replace flat task list with phased plan (A→E)
- Document secrets → Helm flow (public values + secret overlay)
- Deploy sequentially: cluster → Windmill → ArgoCD → Zot
- Each phase has verify step before proceeding
- Add kubelet kernel parameters to galaxy playbook Play 2 pre_tasks
  (vm.overcommit_memory, vm.panic_on_oom, kernel.panic, kernel.panic_on_oops)
  Required by --protect-kernel-defaults per k3s CIS hardening guide
- Remove galaxy_name from play-level vars (pass via -e, fail-safe assert)
- Add *args passthrough to play recipe for extra ansible-playbook flags
- Add per-run logging via tee to ansible/.ansible/logs/
- Create inventory/group_vars/gxy_mgmt_k3s.yml with all galaxy-specific config
  (CIDRs, k3s version, Cilium ID, etcd S3 bucket, Gateway API version)
- Strip all hardcoded values from play-k3s--galaxy.yml — now a generic orchestrator
- Add comprehensive assert block validating all required group_vars before execution
- Fix service.env clobbering: replace copy with lineinfile to preserve K3S_TOKEN
- Restore VPC IP range validation that was dropped during earlier refactors
- Add cron quoting comment to prevent future regressions

To add a new galaxy: create a group_vars file. No playbook editing needed.
Nodes stay NotReady without a CNI. With --flannel-backend=none, the wait
must happen AFTER Cilium install, not before. Moved wait + status display
from Play 4 to new Play 6 (after Cilium in Play 5).

Play order: validate → prereqs → k3s server → traefik + CRDs → cilium → verify → kubeconfig
galaxy_name from group_vars isn't resolved when Ansible parses play names,
causing "UNKNOWN" in output. variable_host is passed via -e and available
at parse time.
kubernetes.core.helm defaults to localhost:8080 without KUBECONFIG.
k3s places the kubeconfig at /etc/rancher/k3s/k3s.yaml. Added
environment block to all helm and kubectl tasks in the cilium role.
…rity

Critical:
- Remove YAML comment inside >- folded scalar that leaked into k3s ExecStart
  (etcd snapshot schedule and retention were silently disabled)
- Add kubernetes.core to requirements.yml (Cilium role dependency)

Warning:
- Set hubble.tls.auto.method=cronJob to prevent cert regen on every re-run
- Pin Helm version (v3.17.3) in cilium role install task
- Add no_log to kubeconfig slurp and copy tasks (admin creds in output)
- Fix Gateway CRD changed_when to only match 'created' (not 'configured')
- Move cilium values to /etc/rancher/k3s/ instead of /tmp

Requires re-run to fix etcd snapshot configuration on live nodes.
direnv now sets KUBECONFIG automatically when you cd into the cluster dir.
Uses expand_path to resolve the absolute path to .kubeconfig.yaml.
…RD chart

Traefik's bundled traefik-crd Helm chart includes Gateway API CRDs.
Manual kubectl apply creates CRDs without Helm ownership labels, causing
traefik-crd install to CrashLoopBackOff with "invalid ownership metadata".

Removed the manual install task and gateway_api_version variable.
Existing CRDs must be deleted manually for Traefik to adopt them:
  kubectl delete crds -l gateway.networking.k8s.io/bundle-version
Wraps the k3s-uninstall.sh script for all nodes in an inventory group.
Removes: k3s, Cilium, etcd data, Helm, service env, audit logs, kubeconfig.
Preserves: Tailscale, cloud-init hardening, DO infrastructure, CIS sysctls.

Usage: just play k3s--reset gxy_mgmt_k3s
Static cluster config (CNI, CIDRs, hardening, etcd S3) now lives in
group_vars as server_config_yaml — written to /etc/rancher/k3s/config.yaml
by the k3s-ansible role. Structured YAML, no folded scalar bugs.

extra_server_args retains only per-node flags (node-ip, advertise-address,
tls-san) that vary by host.

Aligns with k3s hardening guide documented format. Added audit log rotation
flags (maxage=30, maxbackup=10, maxsize=100) per CIS recommendations.
- Cilium role: override cluster.name/id via set_values from group_vars
- Cilium role: pin Helm install script to release tag, not main branch
- Reset playbook: clean up Cilium BPF state in /sys/fs/bpf/cilium
- Galaxy playbook: move Traefik config to Play 2 (before k3s starts)
- Galaxy playbook: collapse from 7 plays to 6
- Group vars: add tls-san exclusion comment
- Galaxy playbook: remove undefined vars from debug output
- Reset playbook: use delegate_to localhost instead of connection: local
  (connection: local with remote hosts still uses remote Python interpreter)
- Galaxy playbook: rename kubeconfig context/cluster/user from 'default'
  to galaxy_name so OMP and kubectl context show the actual cluster name
Galaxy playbook (5 plays, down from 7):
- Use server_config_yaml for all static config (k3s hardening guide format)
- Use extra_service_envs for S3 creds (role-native mechanism)
- Set user_kubectl: false (Play 5 handles kubeconfig correctly)
- Document required DO firewall ports in header
- Remove Cilium cluster.name/id from values.yaml (set_values is source of truth)

Reset playbook: simplified, delegate_to for local cleanup

Group vars: remove cluster_context (unused with user_kubectl: false)

Park deployment tasks in SPIKE-STATUS.md pending clean redeploy.
Operational findings, failure analysis, and deployment plan are now
maintained in Universe/spike/field-notes.md (the canonical source).
This file accumulated cruft from multiple failed deployment attempts
and was redundant with the field notes.
k3s-uninstall.sh does not clean Cilium's iptables chains (CILIUM_INPUT,
CILIUM_PRE_mangle, etc). These stale rules block inter-node traffic on
redeploy, causing etcd peer timeouts.
…lean

gather_facts needed for ansible_user_id in kubeconfig path cleanup.
ansible_user is a connection var, not a fact — use ansible_user_id.
- Add bpf.masquerade: true to Cilium values — moves masquerade to eBPF,
  fixes etcd peer communication failure caused by k3s iptables save/restore
  conflicting with Cilium chains (k3s#7736)
- Remove installNoConntrackIptablesRules (incompatible with VXLAN tunnel mode)
- Increase Helm install timeout from 5m to 10m (first install pulls images)
- Add retries to DaemonSet/operator rollout and status verification tasks
  (k3s API is transiently unavailable after Cilium changes network stack)
- Update all README references from k3s--galaxy to k3s--bootstrap
- Set kubeProxyReplacement: false in Cilium values — kube-proxy replacement
  breaks etcd on k3s HA embedded etcd (see field-notes Failure 7).
  Cilium still provides CNI + network policies + Hubble without it.
- Re-enable kube-proxy in k3s config (disable-kube-proxy: false)
- Fix kubeconfig write: use copy + replace instead of chained Jinja2
  regex_replace in folded scalar (was writing 127.0.0.1 instead of
  Tailscale IP)
- Reset playbook: clean /etc/rancher and /var/lib/rancher entirely
Windmill Helm chart does not consume a windmill-secrets Opaque secret.
Database credentials come from the secret values overlay via Helm.
Admin password is set via Windmill UI on first boot.
Keep only the TLS secretGenerator (referenced by Gateway).
Cilium auto-detected tailscale0 (MTU 1280) alongside eth0/eth1
(MTU 1500), setting all pod veths to 1280. This broke cross-node
pod-to-pod HTTP (packets exceeded path MTU and were dropped).

Pin devices to [eth0, eth1] and MTU to 1500 to exclude tailscale0.

Disable metrics-server — pods cannot reach node VPC IPs directly
(connection refused, all ports). Services via kube-proxy DNAT and
pod-to-pod via VXLAN work fine. Root cause under investigation
(Cilium BPF handling of pod-to-host traffic on multi-NIC nodes).

Also inline >- folded scalars in Cilium role tasks.
Pods cannot reach node VPC IPs directly on Cilium multi-NIC nodes
(open issue, see field-notes Failure 8b). metrics-server needs
kubelet access on nodeIP:10250.

Workaround: patch metrics-server deployment in Play 5 to use
hostNetwork with --secure-port=4443 (avoids kubelet port conflict).
Verified: kubectl top nodes returns data for all 3 nodes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant