A single Go binary that consumes the per-pod ContentionStream gRPC API
(see Mitigation-interface.md), evaluates a CEL
policy each tick, and fires one of three mitigation tiers:
| Tier | Surface | Timescale | Actuator |
|---|---|---|---|
isolate |
cgroup v2 cpu.max on co-located aggressors |
~100 ms | pkg/actuators/isolate |
harvest |
cgroup v2 cpu.max on co-located best-effort pods |
~100 ms | pkg/actuators/harvest |
vertical |
pods/resize subresource (cpu requests/limits) |
~1 s | pkg/actuators/vertical |
horizontal |
apps/v1.Deployment/scale subresource |
~10 s+ | pkg/actuators/horizontal |
The binary runs as a privileged DaemonSet -- one instance per node. Each
instance subscribes only to victim pods on its own node (field selector
spec.nodeName=$NODE_NAME), so node-local mitigations are race-free
without leader election. Horizontal scale is coordinated K8s-natively via
an idempotent /scale patch + a mitigation/horizontal-last-scaled-at
cooldown annotation on the Deployment.
See plan-v2-centralized.md for the full design.
victim pod (this node) mitigation-controller (this node, DaemonSet)
:7900 ──gRPC stream──▶ scoreclient ──▶ features (rolling window per pod)
↓
policy (CEL rules)
↓
┌──────────┬──────────┬──────────┐
▼ ▼ ▼ ▼
isolate harvest vertical horizontal
(cpu.max) (cpu.max) (resize) (scale)
The simulation's three simple control laws (horizontal bang-bang, isolating
saturated ramp, harvesting AIMD) are ported to pkg/controllers and validated
against simulation/simulation.py. They are not
yet driving the per-tick loop — see plan.md for the wiring status.
proto/contention.proto vendored wire contract (3 spatial-horizon fields added)
gen/go/contentionpb/ generated (gitignored) -- run `make proto`
pkg/targets/ multi-victim config loader
pkg/scoreclient/ gRPC subscriber w/ reconnect + multi-pod fan-in
pkg/podwatch/ client-go informer (+ NewLocalNodeWatcher for the DaemonSet)
pkg/features/ rolling window + spatial/temporal feature computation
pkg/policy/ CEL env, YAML rule loader, fsnotify hot-reload, engine
pkg/cgroup/ cgroup v2 path resolution + cpu.max read/write
pkg/actuators/ shared interface + annotation key constants
pkg/actuators/isolate/ throttles aggressor pods' cpu.max (fraction or absolute cap)
pkg/actuators/harvest/ raises best-effort pods' cpu.max to lend victim idle cores
pkg/actuators/vertical/ patches pods/resize for the victim pod
pkg/actuators/horizontal/ patches deployments/scale for the victim Deployment
pkg/controllers/ the 3 simple control laws ported from simulation.py (cap / n / h)
pkg/aggregator/ pluggable Max / Mean / P90 (callable from rules)
pkg/thresholder/ HI/LO + cooldown state machine (also exposed to CEL via `band`)
cmd/mitigation-controller/ the only binary
deploy/controller/ DaemonSet, RBAC, ConfigMap (targets + policy)
deploy/victim-sample/ sample search + profile Deployments
Requires Go 1.23 and protoc. On Debian/Ubuntu:
sudo apt install protobuf-compiler
make deps # installs protoc-gen-go + protoc-gen-go-grpc
make proto # generates gen/go/contentionpb/*.pb.go
go mod tidy
make build # equivalent to `go build ./...`
make test # runs all unit testsBuild the container image:
make docker-controllerThe Dockerfile runs make proto inside the build stage, so docker build
works from a fresh clone.
Three layers: Go unit tests, the offline control-law parity oracle, and an in-cluster smoke test.
make test # go test ./... (all packages)
go test ./pkg/controllers/... # the 3 ported control laws + streaming parity
go test ./pkg/cgroup/... # cpu.max parse / path resolution
go test ./pkg/policy/... # CEL compile + cooldown enginemake test needs the generated proto stubs (make proto once after a fresh
clone) because several packages import gen/go/contentionpb.
pkg/controllers has no proto dependency, so it tests standalone even before
make proto.
simulation/simulation.py is the reference
implementation the Go controllers in pkg/controllers are validated against —
the test expectations in pkg/controllers/controllers_test.go were
cross-checked against it. Run it to regenerate the sweep/figure PNGs or to
re-derive expected values:
cd simulation
pip install numpy scipy matplotlib # one-time
python simulation.py # synthetic signals -> *.png
python simulation.py --data run_data_iter1_ready.json # against a real Gordion traceIt writes sweep_horizontal.png, sweep_isolating.png, sweep_harvesting.png,
and ctrl_reference_run.png, plus a numeric summary to stdout.
After deploying, confirm the pipeline end to end:
# controller is up, one pod per node, and loaded the policy:
kubectl -n mitigation-system rollout status ds/mitigation-controller
kubectl -n mitigation-system logs -l app.kubernetes.io/name=mitigation-controller --tail=20 | grep "policy reloaded"
# drive contention on a victim, then watch actions fire:
kubectl -n mitigation-system logs -l app.kubernetes.io/name=mitigation-controller -f | grep '"msg":"action"'
# verify a cgroup write landed on an aggressor (isolate) / best-effort pod (harvest):
kubectl -n hotelres get pod <aggressor> -o jsonpath='{.metadata.annotations.mitigation/cpu-max-original}'
kubectl -n hotelres get pod <be-pod> -o jsonpath='{.metadata.annotations.mitigation/harvest-cpu-max-original}'You can also exercise the score API alone without the controller — see Smoke test the score API directly.
Note: this repo's CI/dev machine may not have a Go toolchain installed; if
gois missing, the parity oracle (Python) still runs and is the primary way control-law changes are validated before pushing.
Three rules ship in deploy/controller/configmap.yaml,
matching plan-v2-centralized.md Section 5 verbatim:
rules:
- name: sharp_rising_spike
when: "k_temporal > 0.3 || k_spatial > 0.3"
fire:
- kind: isolate
params: { throttle_fraction: 0.5, aggressor_selector: "tier=batch" }
- kind: vertical
params: { scale_factor: 1.5 }
cooldown: "30s"
priority: 100
- name: sustained_high_p50
when: "p50_now > 0.5 && persistence_h >= 3 && duration_above_hi_ms >= 2000"
fire:
- kind: horizontal
params: { delta: 1 }
cooldown: "60s"
priority: 50
- name: clean_state
when: "p50_now < 0.2 && k_temporal < 0.0 && tail_now < 0.5"
fire:
- kind: restore
params: { tier: all }
cooldown: "60s"
priority: 10restore is a meta-action: it fans out to every actuator's Restore(),
which reads the mitigation/* annotations on the corresponding object and
reverses the most recent action.
All feature fields are top-level identifiers (no wrapper object). Match the
field names in features.FeatureVector:
| Identifier | Type | Meaning |
|---|---|---|
target |
string | victim service name |
pod |
string | victim pod name |
p50_now, tail_now |
double | latest p50_trend_pred / tail_trend_label |
p50_h, tail_h |
list(double) | multi-horizon arrays (empty under a single-horizon predictor) |
horizon_ms |
list(int) | parallel array of horizon offsets |
k_spatial |
double | least-squares slope of p50_h vs horizon_ms |
accel_spatial |
double | mean second-difference of p50_h |
p50_max_horizon_ms |
int | argmax horizon |
persistence_h |
int | count of p50_h entries >= HI_THRESHOLD |
k_temporal |
double | least-squares slope of p50 over the rolling window (per second) |
accel_temporal |
double | mean second-difference over the window |
variance |
double | sample variance over the window |
duration_above_hi_ms |
int | length of the most recent contiguous run above HI_THRESHOLD |
window_size |
int | samples currently in the rolling window |
has_spatial |
bool | true iff the latest event populated p50_horizons |
model_version |
string | latest event's model_version |
source_kind |
string | latest event's source_kind ("onnx" / "formula" / ...) |
Two helper functions are registered:
band(score, lo, hi) string->"up"/"down"/"stable"count_at_least(list, threshold) int-> count of list entries>= threshold
| kind | params |
|---|---|
isolate |
aggressor_selector (req), and either throttle_fraction (default 0.5, one-shot) or absolute-cap mode: cap_cores / cpu_max_quota_us (+ period_us, min_quota_us). Optional aggressor_namespace. |
harvest |
be_selector (req), harvest_cores (req, cores to lend on top of baseline). Optional be_namespace, period_us, max_quota_us. |
vertical |
scale_factor (multiplicative) or target_cpu (absolute, e.g. "750m"). Clamped to MIN_CPU/MAX_CPU. |
horizontal |
exactly one of delta (additive) or ensure_min (idempotent floor); optional min_replicas/max_replicas. |
restore |
meta-kind; fans out to every actuator's Restore(). |
The absolute-cap mode on isolate and the harvest kind are the actuation
surfaces the simulation's isolating (cap) and harvesting (h) controllers
drive; see plan.md.
- Edit
data.policy.yamlin the ConfigMap. - Apply:
kubectl apply -f deploy/controller/configmap.yaml. - The kubelet remounts the volume;
fsnotifyinpkg/policy/loader.gotriggersengine.Reloadwithin ~1s. Look forpolicy reloadedin the controller logs.
A typo in a CEL expression is rejected by engine.Reload and the previous
rules stay live -- the controller never goes silent on a bad rule.
| Env var | Default | Meaning |
|---|---|---|
TICK_MS |
100 |
per-pod policy evaluation cadence |
STALE_MS |
1500 |
a snapshot older than this is treated as missing |
WINDOW_SIZE |
20 |
rolling-window samples (~2 s at 100 ms cadence) |
HI_THRESHOLD |
0.5 |
what counts as "elevated" for PersistenceH / DurationAboveHiMs |
MIN_CPU / MAX_CPU |
200m / 4 |
vertical resize clamp |
HORIZONTAL_COOLDOWN_SEC |
30 |
cross-node Deployment scale gate |
TARGETS_CONFIG |
/etc/mitigation/targets.yaml |
mounted from the ConfigMap |
POLICY_CONFIG |
/etc/mitigation/policy.yaml |
same |
NODE_NAME |
(none) | required; injected via fieldRef: spec.nodeName |
Prerequisite: K8s >= 1.35 (in-place pod resize GA -- see
https://kubernetes.io/blog/2025/12/19/kubernetes-v1-35-in-place-pod-resize-ga/),
cgroup v2 on every node, and the
pod-security.kubernetes.io/enforce=privileged namespace label is honoured
(see deploy/controller/namespace.yaml).
kubectl apply -f deploy/victim-sample/namespace.yaml
kubectl apply -f deploy/victim-sample/search.yaml
kubectl apply -f deploy/victim-sample/profile.yamlReplace the placeholder image: REGISTRY/...:tag lines with your real
images. The fields that matter for mitigations to work: named score port
7900, resources.requests == resources.limits,
resizePolicy.cpu = NotRequired.
Automated (recommended): build-push-deploy.sh
does build → push → manifest rewrite → apply → rollout in one shot:
./build-push-deploy.sh --node=node-3 # build, push to docclabgroup, deploy pinned to node-3
./build-push-deploy.sh --tag=v2 --node=node-3 # custom tag
./build-push-deploy.sh --no-build --node=node-3 # redeploy the current pushed image
./build-push-deploy.sh --help # all options (registry, pull-policy, no-push, ...)It renders a temp copy of deploy/controller/daemonset.yaml with the
image/pull-policy set (and pins the DaemonSet to --node via nodeSelector),
applies all four manifests, and waits for rollout — the tracked manifest is
left untouched, so git pull never conflicts. The manual steps below are the
same thing unpacked.
Turn it off + revert: mitigation-off.sh stops the
controller and undoes anything it changed — scales Deployments back to their
baseline replicas and recreates any pods whose cpu.max was modified
(isolate/harvest) so they restart clean:
./mitigation-off.sh --dry-run # preview what would be reverted
./mitigation-off.sh # stop + revert
./mitigation-off.sh --purge # also delete the namespace + cluster RBACFirst make the image reachable by every node (the DaemonSet runs one pod
per node). make docker-controller builds
simple-mitigation/mitigation-controller:dev into the local image store of the
node you built on; the others need it too. Check your runtime with
kubectl get nodes -o wide (CONTAINER-RUNTIME column):
# Save once on the build node:
docker save simple-mitigation/mitigation-controller:dev -o /tmp/mc.tar
# --- containerd runtime: import into the k8s.io namespace on each node ---
sudo ctr -n k8s.io images import /tmp/mc.tar
sudo ctr -n k8s.io images ls | grep mitigation
# --- docker runtime: load on each node (e.g. fan out over SSH) ---
for n in node-1 node-2 node-3 node-4; do
scp /tmp/mc.tar "$n:/tmp/mc.tar" && ssh "$n" 'docker load -i /tmp/mc.tar'
done(For a real registry instead, set the image: in daemonset.yaml to
<registry>/simple-mitigation/mitigation-controller:<tag> and docker push —
then no per-node loading is needed.)
Version note: this design targets K8s >= 1.35 (in-place
pods/resizeGA). On older clusters the controller still runs andisolate/harvest/horizontalwork, but theverticalactuator needspods/resize(alpha 1.27, beta 1.33, GA 1.35) and will error there — drop theverticalfire from the policy on pre-1.33 clusters.
Then apply, in order:
kubectl apply -f deploy/controller/namespace.yaml
kubectl apply -f deploy/controller/rbac.yaml
kubectl apply -f deploy/controller/configmap.yaml
kubectl apply -f deploy/controller/daemonset.yaml
kubectl -n mitigation-system rollout status ds/mitigation-controller
kubectl -n mitigation-system logs -l app.kubernetes.io/name=mitigation-controller --tail=30Adding a victim service later = single ConfigMap edit:
kubectl -n mitigation-system edit cm mitigation-controller-config
# Policy/targets reload via fsnotify within ~1s; no rollout needed.Every action stamps annotations on its target before the actual write so
Reconcile() at startup can find and complete an interrupted apply:
| Target | Annotation keys |
|---|---|
| Aggressor Pod | mitigation/cpu-max-original, mitigation/cpu-max-set-by-node, mitigation/cpu-max-set-at |
| Best-effort Pod | mitigation/harvest-cpu-max-original, mitigation/harvest-set-by-node, mitigation/harvest-set-at |
| Victim Pod | mitigation/cpu-limit-baseline |
| Victim Deployment | mitigation/horizontal-last-scaled-at, mitigation/horizontal-baseline-replicas |
No extra storage backend (etcd, Redis, the controller's own CRD) is needed; the API server is the source of truth.
Matches the path used during development; no controllers needed.
# terminal 1
kubectl -n hotelres port-forward pod/search-<id> 7900:7900
# terminal 2
grpcurl -plaintext -d '{}' localhost:7900 \
gordion.contention.ContentionStream/SubscribeYou should see a stream of ScoreEvent JSON objects at roughly 10 Hz, now
including p50_horizons / tail_horizons / horizon_ms once the
predictor side ships the matching change.
JSON log/slog on stderr. Every action emits a single line with
rule, kind, pod, node, applied, reason, before, after, and
err on failure. No Prometheus exporter yet; deliberately out of scope.
The module path is github.com/coding-workspace/simple-mitigation-1. To
change it (e.g. to your real GitHub org):
OLD=github.com/coding-workspace/simple-mitigation-1
NEW=github.com/your-org/your-repo
grep -rl "$OLD" . --include="*.go" --include="*.proto" --include="Makefile" \
| xargs sed -i "s|$OLD|$NEW|g"
go mod edit -module "$NEW"
make proto && go mod tidy