Draft
Conversation
Salvage useful infrastructure from feat/k3s-rancher onto a clean branch aligned with the Universe platform direction. Adds: - SOPS+age config for application secrets encryption - ops-mgmt cluster configs (security hardening, Rancher backup schedule) - DO inventory fix (drop unused id attribute) - Ansible config vault password comment
…ucture Universe Day 0 spike: provision the first galaxy (gxy-management) with Cilium CNI. Adds: - Ansible role: cilium (Helm-based CNI install on any k3s cluster) - play-k3s--galaxy.yml: semi-generic 6-play playbook for any Universe galaxy, composing roles with security hardening and etcd S3 backups - k3s/gxy-management/: cluster configs (Cilium values, PSS, audit policy), app manifests (Windmill, ArgoCD, Zot) - Pod CIDR 10.1.0.0/16, Service CIDR 10.11.0.0/16 (ADR-009) - Traefik for Day 0 ingress, Cilium Gateway API evaluation later - Internal services via NodePort on Tailscale IPs (ADR-009) - local-path storage (ADR-008: no Longhorn, Ceph on bare metal later) Naming convention: ops-gxy-* hybrid (ops- prefix for infra resources, gxy- for logical galaxy naming).
Merge ansible, k3s, and terraform justfiles into a single root justfile with grouped recipes. All recipes run from repo root. Groups: secrets, k3s, ansible, terraform. Uses uv run for ansible commands since direnv does not activate inside just subprocesses.
Single secrets/ directory with ansible-vault encryption replaces scattered .secrets.env files and SOPS+age. - secrets/<name>/.env.sample: plaintext templates (git tracked) - secrets/<name>/.env: ansible-vault encrypted (gitignored, shared via 1Password) - Remove .sops.yaml (SOPS replaced by ansible-vault) - Update .gitignore to whitelist secrets/ samples only - Add secrets/.gitignore to exclude encrypted .env files - Add secrets/README.md documenting the approach Encrypted .env files exist for: global, do-legacy, do-universe, ansible, appsmith, outline. New apps (windmill, argocd, zot) have samples only until deployed.
Move from per-directory .envrc (ansible/) to root .envrc with per-cluster overrides via direnv source_env hierarchy. - Root .envrc: loads .env (global tokens) + adds ansible venv to PATH - k3s/<cluster>/.envrc: inherits root, loads cluster-specific .env (DO_API_TOKEN + KUBECONFIG) - just secret-bootstrap: decrypts global tokens to root .env - just secret-bootstrap-cluster: decrypts team tokens to cluster .env - Rename do-legacy → do-primary - Update .gitignore: track .envrc files, whitelist secrets/ samples
Resolve all config discrepancies between feat/k3s-universe branch and the Universe spike-plan/ADR decisions for the gxy-management cluster. Changes: - Create galaxy-specific traefik-config.yaml (LoadBalancer via ServiceLB) instead of modifying the shared config used by ops-backoffice-tools - Re-enable ServiceLB in playbook, update traefik source to galaxy path - Fix region alignment: nyc3 → fra1 for etcd S3 backups and Zot registry - New FRA1 buckets: universe-backups (etcd) and universe-registry (Zot) - Clean up PSS admission: remove stale cattle-*/longhorn/cert-manager, add cilium and windmill namespace exemptions - Add Gateway API resources (Gateway + HTTPRoute) for Windmill, ArgoCD, and Zot matching the ops-backoffice-tools pattern - Add TLS secret samples for Cloudflare origin certificates - Update ArgoCD/Zot comments to Cloudflare Access model - Add deployment runbook with pre/post ClickOps checklists - Update k3s/README.md with correct specs, region, and architecture
Add kubeconform as a K8s manifest schema validator: - justfile: k8s-validate recipe validates all manifests under k3s/ and k8s/ against K8s 1.30.0 schemas + datreeio/CRDs-catalog for CRDs (Gateway API, Traefik, Longhorn, HelmChartConfig) - CI: k8s--validate.yml workflow runs just k8s-validate on push/PR to main, installs kubeconform v0.7.0 + just via curl Non-manifest YAML (values.yaml, kustomization.yaml, kubeconfig, PSS admission, audit policy, samples) excluded via filename patterns.
Add ignore patterns for .json files and dashboards/ directory — Grafana dashboard outputs, package.json, and tsconfig.json are not K8s manifests.
|
Run Details - tfws-ops-test Terraform Cloud Plan Output Details : https://app.terraform.io/app/freecodecamp/workspaces/tfws-ops-test/runs/run-vDqUdSPVSWb3xiy8
|
Move all encrypted secrets and samples to a private infra-secrets sibling repo. Replace ansible-vault with sops+age encryption. - Add use_sops function to root .envrc for transparent decryption - Update cluster .envrc files to load team-specific DO tokens - Replace justfile ansible-vault recipes with sops equivalents - Update deploy recipe to use sops -d - Add galaxy-play recipe for k3s playbook with sops decrypt - Remove secrets/ directory (migrated to private repo) - Update ansible.cfg and galaxy playbook references
- Add tailscale-install and tailscale-up convenience recipes - Update README to reference just recipes instead of raw ansible commands - Remove stale ansible-vault references from deployment runbook - Add --create-namespace to Helm install commands
- Add SPIKE-STATUS.md with full research, decisions, progress, and next steps - Add kubeconfig-sync recipe to decrypt kubeconfig from infra-secrets - Update deploy recipe to handle TLS certs alongside app secrets - Add DO_API_TOKEN guard to galaxy-play recipe
Justfile: 18 → 12 recipes (parametric, no special-case orchestration)
- Add generic `play` recipe replacing tailscale-install/tailscale-up
- Add `helm-upgrade` recipe with convention-based chart discovery
- Add parametric `tf` recipe replacing 5 separate terraform recipes
- Fix `secret-view` format auto-detection (was hardcoded dotenv)
- Parameterize `k8s-validate` K8s version (was hardcoded 1.30.0)
- Remove `galaxy-play` — playbook reads env vars via direnv now
Playbook: galaxy reads DO_SPACES_* from env instead of vault file
- Replace vars_files with lookup('env', ...) in play-k3s--galaxy.yml
- Add DO_SPACES_ACCESS_KEY/SECRET_KEY to do-universe/.env.enc
- Delete ansible/vault-k3s.yaml.enc from infra-secrets
Docs: replace raw commands with just recipes in all READMEs
- Add repo files for helm-upgrade chart discovery convention
Critical: - Fix cilium_cluster_id assertion (int has no length, use string filter) - Quote etcd snapshot cron schedule to survive systemd word-splitting - Add TLS secretGenerator to argocd/windmill/zot kustomization.yaml Warning: - .envrc: replace silent 2>/dev/null with log_error on sops failure - justfile deploy: set trap incrementally after each sops decrypt - justfile kubeconfig-sync: umask 077 before writing kubeconfig - Add Helm install task to cilium role (was missing on remote hosts) - CI workflow: pass explicit k8s version to kubeconform - Parameterize galaxy_name in playbook (was hardcoded 6 times) - Remove redundant tf-fmt recipe (just tf fmt works) - Fix ops-mgmt README: remove raw commands and stale vault refs - Fix ssh_import_id: add gh: prefix to raisedadead in cloud-init Suggestion: - Remove redundant .envrc patterns from .gitignore - Remove orphaned comment and blank lines from .gitignore - Trim gxy-management .gitignore to non-redundant patterns only - Fix repo.txt → repo in gxy-management README - Remove dead cilium namespace from PSS exemptions - Fix stale destination path comments in security configs
- Remove hardcoded postgresPassword and databaseUrl from public values.yaml - Update helm-upgrade recipe to overlay secret values from infra-secrets - Secret values file pattern: <app>.values.yaml.enc (sops-encrypted YAML) - Decrypted to temp file at install time, deleted after helm upgrade
- Mark completed items (Tailscale, TLS certs, code review, justfile overhaul) - Replace flat task list with phased plan (A→E) - Document secrets → Helm flow (public values + secret overlay) - Deploy sequentially: cluster → Windmill → ArgoCD → Zot - Each phase has verify step before proceeding
- Add kubelet kernel parameters to galaxy playbook Play 2 pre_tasks (vm.overcommit_memory, vm.panic_on_oom, kernel.panic, kernel.panic_on_oops) Required by --protect-kernel-defaults per k3s CIS hardening guide - Remove galaxy_name from play-level vars (pass via -e, fail-safe assert) - Add *args passthrough to play recipe for extra ansible-playbook flags - Add per-run logging via tee to ansible/.ansible/logs/
- Create inventory/group_vars/gxy_mgmt_k3s.yml with all galaxy-specific config (CIDRs, k3s version, Cilium ID, etcd S3 bucket, Gateway API version) - Strip all hardcoded values from play-k3s--galaxy.yml — now a generic orchestrator - Add comprehensive assert block validating all required group_vars before execution - Fix service.env clobbering: replace copy with lineinfile to preserve K3S_TOKEN - Restore VPC IP range validation that was dropped during earlier refactors - Add cron quoting comment to prevent future regressions To add a new galaxy: create a group_vars file. No playbook editing needed.
Nodes stay NotReady without a CNI. With --flannel-backend=none, the wait must happen AFTER Cilium install, not before. Moved wait + status display from Play 4 to new Play 6 (after Cilium in Play 5). Play order: validate → prereqs → k3s server → traefik + CRDs → cilium → verify → kubeconfig
galaxy_name from group_vars isn't resolved when Ansible parses play names, causing "UNKNOWN" in output. variable_host is passed via -e and available at parse time.
kubernetes.core.helm defaults to localhost:8080 without KUBECONFIG. k3s places the kubeconfig at /etc/rancher/k3s/k3s.yaml. Added environment block to all helm and kubectl tasks in the cilium role.
…rity Critical: - Remove YAML comment inside >- folded scalar that leaked into k3s ExecStart (etcd snapshot schedule and retention were silently disabled) - Add kubernetes.core to requirements.yml (Cilium role dependency) Warning: - Set hubble.tls.auto.method=cronJob to prevent cert regen on every re-run - Pin Helm version (v3.17.3) in cilium role install task - Add no_log to kubeconfig slurp and copy tasks (admin creds in output) - Fix Gateway CRD changed_when to only match 'created' (not 'configured') - Move cilium values to /etc/rancher/k3s/ instead of /tmp Requires re-run to fix etcd snapshot configuration on live nodes.
direnv now sets KUBECONFIG automatically when you cd into the cluster dir. Uses expand_path to resolve the absolute path to .kubeconfig.yaml.
…RD chart Traefik's bundled traefik-crd Helm chart includes Gateway API CRDs. Manual kubectl apply creates CRDs without Helm ownership labels, causing traefik-crd install to CrashLoopBackOff with "invalid ownership metadata". Removed the manual install task and gateway_api_version variable. Existing CRDs must be deleted manually for Traefik to adopt them: kubectl delete crds -l gateway.networking.k8s.io/bundle-version
Wraps the k3s-uninstall.sh script for all nodes in an inventory group. Removes: k3s, Cilium, etcd data, Helm, service env, audit logs, kubeconfig. Preserves: Tailscale, cloud-init hardening, DO infrastructure, CIS sysctls. Usage: just play k3s--reset gxy_mgmt_k3s
Static cluster config (CNI, CIDRs, hardening, etcd S3) now lives in group_vars as server_config_yaml — written to /etc/rancher/k3s/config.yaml by the k3s-ansible role. Structured YAML, no folded scalar bugs. extra_server_args retains only per-node flags (node-ip, advertise-address, tls-san) that vary by host. Aligns with k3s hardening guide documented format. Added audit log rotation flags (maxage=30, maxbackup=10, maxsize=100) per CIS recommendations.
- Cilium role: override cluster.name/id via set_values from group_vars - Cilium role: pin Helm install script to release tag, not main branch - Reset playbook: clean up Cilium BPF state in /sys/fs/bpf/cilium - Galaxy playbook: move Traefik config to Play 2 (before k3s starts) - Galaxy playbook: collapse from 7 plays to 6 - Group vars: add tls-san exclusion comment - Galaxy playbook: remove undefined vars from debug output
- Reset playbook: use delegate_to localhost instead of connection: local (connection: local with remote hosts still uses remote Python interpreter) - Galaxy playbook: rename kubeconfig context/cluster/user from 'default' to galaxy_name so OMP and kubectl context show the actual cluster name
Galaxy playbook (5 plays, down from 7): - Use server_config_yaml for all static config (k3s hardening guide format) - Use extra_service_envs for S3 creds (role-native mechanism) - Set user_kubectl: false (Play 5 handles kubeconfig correctly) - Document required DO firewall ports in header - Remove Cilium cluster.name/id from values.yaml (set_values is source of truth) Reset playbook: simplified, delegate_to for local cleanup Group vars: remove cluster_context (unused with user_kubectl: false) Park deployment tasks in SPIKE-STATUS.md pending clean redeploy.
Operational findings, failure analysis, and deployment plan are now maintained in Universe/spike/field-notes.md (the canonical source). This file accumulated cruft from multiple failed deployment attempts and was redundant with the field notes.
k3s-uninstall.sh does not clean Cilium's iptables chains (CILIUM_INPUT, CILIUM_PRE_mangle, etc). These stale rules block inter-node traffic on redeploy, causing etcd peer timeouts.
…lean gather_facts needed for ansible_user_id in kubeconfig path cleanup. ansible_user is a connection var, not a fact — use ansible_user_id.
- Add bpf.masquerade: true to Cilium values — moves masquerade to eBPF, fixes etcd peer communication failure caused by k3s iptables save/restore conflicting with Cilium chains (k3s#7736) - Remove installNoConntrackIptablesRules (incompatible with VXLAN tunnel mode) - Increase Helm install timeout from 5m to 10m (first install pulls images) - Add retries to DaemonSet/operator rollout and status verification tasks (k3s API is transiently unavailable after Cilium changes network stack) - Update all README references from k3s--galaxy to k3s--bootstrap
- Set kubeProxyReplacement: false in Cilium values — kube-proxy replacement breaks etcd on k3s HA embedded etcd (see field-notes Failure 7). Cilium still provides CNI + network policies + Hubble without it. - Re-enable kube-proxy in k3s config (disable-kube-proxy: false) - Fix kubeconfig write: use copy + replace instead of chained Jinja2 regex_replace in folded scalar (was writing 127.0.0.1 instead of Tailscale IP) - Reset playbook: clean /etc/rancher and /var/lib/rancher entirely
Windmill Helm chart does not consume a windmill-secrets Opaque secret. Database credentials come from the secret values overlay via Helm. Admin password is set via Windmill UI on first boot. Keep only the TLS secretGenerator (referenced by Gateway).
Cilium auto-detected tailscale0 (MTU 1280) alongside eth0/eth1 (MTU 1500), setting all pod veths to 1280. This broke cross-node pod-to-pod HTTP (packets exceeded path MTU and were dropped). Pin devices to [eth0, eth1] and MTU to 1500 to exclude tailscale0. Disable metrics-server — pods cannot reach node VPC IPs directly (connection refused, all ports). Services via kube-proxy DNAT and pod-to-pod via VXLAN work fine. Root cause under investigation (Cilium BPF handling of pod-to-host traffic on multi-NIC nodes). Also inline >- folded scalars in Cilium role tasks.
Pods cannot reach node VPC IPs directly on Cilium multi-NIC nodes (open issue, see field-notes Failure 8b). metrics-server needs kubelet access on nodeIP:10250. Workaround: patch metrics-server deployment in Play 5 to use hostNetwork with --secure-port=4443 (avoids kubelet port conflict). Verified: kubectl top nodes returns data for all 3 nodes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.