Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
172 changes: 156 additions & 16 deletions framework/dev/k8s/README.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,33 @@
# Local k8s Launch-Path Harness

This dev-only harness builds local Flower runtime images, configures a k3d
cluster, starts SuperLink and SuperExec, seeds one ServerApp run through the
Control API, and verifies that the Kubernetes executor creates a TaskExecutor
Pod that reaches `Succeeded`.
cluster, starts SuperLink and SuperExec, seeds deterministic ServerApp runs
through the Control API, and verifies that the Kubernetes executor creates
TaskExecutor Pods that reach `Succeeded`.

It has two main modes:

- the default one-task launch-path proof; and
- the `--capacity-cleanup-proof` mode, which uses active Pod budget `1`, seeds
two tasks, observes SuperExec waiting for capacity, and verifies completed
TaskExecutor Pod/Secret cleanup before broad namespace cleanup.

## Prerequisites

Run commands from the repository root. The wrapper expects these tools on
`PATH`:

- `docker`;
- `k3d`;
- `kubectl`;
- `uv`;
- `python`.

The Docker daemon must already be running. If `--skip-build` is used, the local
runtime images selected by the wrapper must already exist and be importable into
k3d.

## Quick Runs

Run the full local smoke test from the repository root:

Expand All @@ -17,11 +41,47 @@ To reuse previously built images:
./framework/dev/k8s/test-real-launch-path.sh --skip-build
```

To run the budget-1/two-task capacity and cleanup proof:

```bash
output_dir=/private/tmp/f7d-v2-capacity-cleanup-proof-$(date +%Y%m%d-%H%M%S)
./framework/dev/k8s/test-real-launch-path.sh \
--capacity-cleanup-proof \
--output-dir "${output_dir}"
```

To verify the saved capacity evidence manually after the wrapper finishes:

```bash
python framework/dev/k8s/verify_evidence.py "${output_dir}" \
--expected-result local-k8s-capacity-cleanup-proof
```

`/private/tmp` is only an example local scratch location. For handoff or review,
choose a durable writable directory, or archive the completed evidence directory
after saving the verifier report.

The wrapper prints verifier output to stdout. To make the verifier report part
of an evidence bundle for review, rerun the verifier and save the output:

```bash
python framework/dev/k8s/verify_evidence.py "${output_dir}" \
--expected-result local-k8s-capacity-cleanup-proof \
> "${output_dir}/diagnostics/verifier-output.txt"
```

The wrapper deletes the test namespace by default. To inspect resources after a
run:

```bash
./framework/dev/k8s/test-real-launch-path.sh --skip-cleanup
output_dir=/private/tmp/f7d-v2-capacity-cleanup-proof-live-$(date +%Y%m%d-%H%M%S)
./framework/dev/k8s/test-real-launch-path.sh \
--capacity-cleanup-proof \
--skip-cleanup \
--output-dir "${output_dir}"
python framework/dev/k8s/verify_evidence.py "${output_dir}" \
--expected-result local-k8s-capacity-cleanup-proof \
--no-require-cleanup
```

## Defaults
Expand All @@ -32,7 +92,12 @@ run:
| Namespace | `flower-local-k8s` |
| Seed Job | `flower-local-k8s-seed-run` |
| Executor ConfigMap | `flower-local-k8s-executor-config` |
| Result | `local-k8s-launch-path` |
| Default result | `local-k8s-launch-path` |
| Capacity-proof result | `local-k8s-capacity-cleanup-proof` |
| Default seeded runs | `1` |
| Capacity-proof seeded runs | `2` |
| Capacity-proof active Pod budget | `1` |
| Capacity-proof probe hold | `5.0` seconds |
| ServerApp marker | `K8s launch probe ServerApp ran` |

## Output
Expand Down Expand Up @@ -61,12 +126,26 @@ Evidence is written under the selected output directory:
| `taskexecutor-secrets.redacted.json` | Redacted per-task Secret evidence with key names and byte lengths. |
| `final-state.json` | Pre-cleanup resource counts and object summaries for the run selectors. |
| `proof-checklist.json` | Reviewer-facing map from claims to artifact fields, with out-of-scope claims. |
| `harness.log` | Short harness result log. |
| `sanitized-config.yaml` | Sanitized copy of the selected harness profile. |
| `objects/capacity-blocked-pods.json` | Capacity-proof snapshot of the first active TaskExecutor Pod. |
| `objects/executor-config.yaml` | Rendered Kubernetes executor config, including capacity settings. |
| `objects/executor-config.json` | JSON form of the rendered Kubernetes executor config. |
| `objects/secrets-before-cleanup.redacted.json` | Capacity-proof redacted Secret snapshot before executor cleanup. |
| `objects/cleanup-pods.json` | Capacity-proof TaskExecutor Pod state after capacity opens. |
| `objects/secrets-after-cleanup.redacted.json` | Capacity-proof redacted Secret snapshot after executor cleanup. |
| `objects/real-launch.yaml` | Rendered SuperLink, executor config, and SuperExec objects. |
| `objects/seed-job.yaml` | Rendered seed ConfigMap and Job. |
| `objects/pods.json` | Observed TaskExecutor Pod list and phases. |
| `diagnostics/commands.txt` | Planned or executed host commands. |
| `diagnostics/failures.txt` | Failure messages when the harness records failures. |
| `diagnostics/image-preflight.json` | Docker image inspection and k3d import plan/results. |
| `diagnostics/image-preflight.txt` | Docker image inspection and k3d import command output. |
| `diagnostics/cleanup.json` | Namespace cleanup command plan/results. |
| `diagnostics/superexec-logs.txt` | Captured SuperExec logs used for claim and capacity-wait evidence. |
| `diagnostics/taskexecutor-logs.txt` | Captured TaskExecutor logs. |
| `diagnostics/cleanup.txt` | Cleanup defaults and the namespace delete command. |
| `diagnostics/verifier-output.txt` | Optional saved verifier report when rerun manually with shell redirection. |

## How The Evidence Proves Correctness

Expand All @@ -78,7 +157,8 @@ map in machine-readable form.

Open `invocation.json` and check:

- `mode` is `local-k8s-launch-path`;
- `mode` is `local-k8s-launch-path` for the default proof or
`local-k8s-capacity-cleanup-proof` for the capacity cleanup proof;
- `dry_run` is `false`;
- `repo.branch` and `repo.sha` match the checkout under review;
- `equivalent_argv` shows the harness mode, output directory, namespace,
Expand All @@ -92,12 +172,16 @@ map in machine-readable form.
contain the executor config used to render TaskExecutor Pods, including the
namespace, image, resource pool, and harness-run label.

3. Confirm one deterministic ServerApp task was seeded through AppIo.
3. Confirm deterministic ServerApp tasks were seeded through AppIo.

Open `objects/seed-job.yaml` and check that the Job runs
`/opt/flower-local-k8s/seed_run.py` against the SuperLink Control API.
Then check `summary.json` and `task-lineage.json`: `seed_run_id` and
`seeded_run_id` should be present and should match.
Then check `summary.json` and `task-lineage.json`.

For the default proof, `seed_run_id` and `seeded_run_id` should be present
and should match. For the capacity cleanup proof, `summary.json` should list
two `seed_run_ids`, `task-lineage.json` should list the same
`seeded_run_ids`, and `seeded_task_count` should be `2`.

4. Confirm the Kubernetes executor created the TaskExecutor Pod.

Expand Down Expand Up @@ -133,8 +217,8 @@ map in machine-readable form.
Open `final-state.json`. It records the Pod, Secret, Job, Service, and
Namespace observation commands plus resource counts before namespace
deletion. This proves what remained at the end of the proof stage. It does
not claim executor-owned completed Pod or Secret cleanup; that is deliberately
out of scope for this slice.
not claim completed Pod or Secret cleanup for the default one-task proof;
that cleanup assertion belongs to the capacity cleanup proof.

8. Confirm the verifier accepted the bundle.

Expand All @@ -144,6 +228,31 @@ map in machine-readable form.
counts, a `Succeeded` phase, and a successful cleanup command when cleanup
was required.

For `--capacity-cleanup-proof`, additionally confirm:

1. `objects/executor-config.yaml` sets `active-pod-budget: 1`.
2. `summary.json` lists two `seed_run_ids`, and `task-lineage.json` records at
least two observed TaskExecutor task records.
3. `events.jsonl` has a passing `capacity.wait_observed` event. Also open
`diagnostics/superexec-logs.txt`; it should include the specific SuperExec
wait marker
`waiting for kubernetes taskexecutor capacity`; `active pods` and `budget`
are useful context, but they are not sufficient evidence on their own.
4. `summary.json` has `cleanup_observed.observed: true`, removed Pod and Secret
names for the completed task, and at least one remaining/new TaskExecutor Pod
after cleanup. Removed and remaining Pod names should be disjoint.
5. `objects/cleanup-pods.json` and
`objects/secrets-after-cleanup.redacted.json` show the post-wait selector
state before broad namespace cleanup.
6. The capacity verifier report should identify the result as
`local-k8s-capacity-cleanup-proof`, show `Task lineage records: 2`, show
`Capacity wait observed: True`, include non-empty removed Pod/Secret lines,
and end with `Verification: PASSED`.

In this mode, `TaskExecutor Pods: 1` in the verifier report is expected
after cleanup: it is the remaining/new TaskExecutor Pod. The two-task
evidence comes from `Task lineage records: 2` and `task-lineage.json`.

## What Is Tested

| Area | Tested | Notes |
Expand All @@ -156,34 +265,65 @@ map in machine-readable form.
| TaskExecutor Pod creation | Yes | Polls for a Pod matching the run selector before failing. |
| TaskExecutor terminal phase | Yes | Waits for observed TaskExecutor Pods to reach `Succeeded`. |
| ServerApp execution marker | Yes | Verifies `K8s launch probe ServerApp ran` in TaskExecutor logs. |
| Capacity wait | Optional | `--capacity-cleanup-proof` seeds two runs with active Pod budget `1` and requires SuperExec wait evidence. |
| Sweeper cleanup | Optional | `--capacity-cleanup-proof` requires the first completed TaskExecutor Pod and Secret to be removed before namespace cleanup. |
| Wrapper cleanup | Yes | Default wrapper behavior deletes the namespace and verifies cleanup evidence. |

## Out Of Scope

| Area | Tested | Notes |
| --- | --- | --- |
| Capacity waiting | No | No capacity queue or resource-pool wait behavior is asserted. |
| Sweeper cleanup | No | No reconciler or orphan cleanup loop is validated. |
| Executor-owned Pod deletion | No | Namespace cleanup removes resources; executor deletion behavior is not proven. |
| Executor-owned Secret deletion | No | Secret RBAC is checked, but per-task Secret lifecycle is not asserted. |
| Cardinality proof | No | The capacity proof uses budget `1` and two tasks; budget `2`/three-task cardinality is a later slice. |
| AppIo result completion semantics | No | This slice observes launch and Pod success, not full result semantics. |
| ClientApp execution | No | The probe includes a minimal ClientApp file only because the FAB schema expects it. |
| TLS, CNI/NetworkPolicy, production RBAC | No | This is local/dev-only and uses insecure local AppIo. |
| Concurrency, retry, failure behavior | No | The harness starts one deterministic run. |
| Concurrency, retry, failure behavior | No | The default proof starts one deterministic run; the capacity proof starts two deterministic runs only to exercise budget waiting and cleanup. |

## Useful Commands

Inspect resources after `--skip-cleanup`:

```bash
kubectl --context k3d-flower-local-k8s get pods -n flower-local-k8s
kubectl --context k3d-flower-local-k8s get jobs,secrets -n flower-local-k8s
kubectl --context k3d-flower-local-k8s logs pod/flower-superlink -n flower-local-k8s
kubectl --context k3d-flower-local-k8s logs pod/flower-superexec -n flower-local-k8s
```

Verify an existing default launch-path bundle:

```bash
python framework/dev/k8s/verify_evidence.py "${output_dir}"
```

Verify an existing capacity cleanup bundle:

```bash
python framework/dev/k8s/verify_evidence.py "${output_dir}" \
--expected-result local-k8s-capacity-cleanup-proof
```

Verify a bundle from a run that used `--skip-cleanup`:

```bash
python framework/dev/k8s/verify_evidence.py "${output_dir}" \
--expected-result local-k8s-capacity-cleanup-proof \
--no-require-cleanup
```

Remove the namespace manually:

```bash
kubectl --context k3d-flower-local-k8s delete namespace flower-local-k8s \
--ignore-not-found=true --wait=true
```

If Docker was restarted and an existing local k3d cluster appears stale, recreate
only the local harness cluster and rerun:

```bash
k3d cluster delete flower-local-k8s
./framework/dev/k8s/test-real-launch-path.sh \
--capacity-cleanup-proof \
--output-dir "${output_dir}"
```
Original file line number Diff line number Diff line change
Expand Up @@ -12,11 +12,20 @@
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""Probe ServerApp used by the local k8s launch-path harness."""

import time

import flwr as fl

app = fl.serverapp.ServerApp()
_PROBE_HOLD_SECONDS_CONFIG_KEY = "local-k8s.probe-hold-seconds"


@app.main()
def main(grid, context):
"""Run the probe ServerApp and optionally stay active for capacity tests."""
print("K8s launch probe ServerApp ran")
hold_seconds = context.run_config.get(_PROBE_HOLD_SECONDS_CONFIG_KEY, 0.0)
if isinstance(hold_seconds, (float, int)) and hold_seconds > 0:
time.sleep(float(hold_seconds))
1 change: 1 addition & 0 deletions framework/dev/k8s/assets/probe_app/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -17,3 +17,4 @@ serverapp = "launch_probe.server_app:app"
clientapp = "launch_probe.client_app:app"

[tool.flwr.app.config]
local-k8s.probe-hold-seconds = 0.0
37 changes: 28 additions & 9 deletions framework/dev/k8s/assets/seed_run.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,41 +12,60 @@
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""Seed deterministic ServerApp runs for the local k8s harness."""

import argparse
import hashlib
from pathlib import Path

import grpc

from flwr.cli.build import build_fab_from_disk
from flwr.common.serde import fab_to_proto
from flwr.common.serde import fab_to_proto, scalar_to_proto
from flwr.proto.control_pb2 import StartRunRequest
from flwr.proto.control_pb2_grpc import ControlStub
from flwr.supercore.constant import NOOP_FEDERATION
from flwr.supercore.fab import Fab

_PROBE_APP_DIR = Path("/opt/flower-local-k8s/probe_app")
_PROBE_HOLD_SECONDS_CONFIG_KEY = "local-k8s.probe-hold-seconds"


def main() -> None:
"""Create one or more deterministic ServerApp runs through Control API."""
parser = argparse.ArgumentParser()
parser.add_argument("--control-api-address", required=True)
parser.add_argument("--run-count", type=int, default=1)
parser.add_argument("--probe-hold-seconds", type=float, default=0.0)
args = parser.parse_args()
if args.run_count < 1:
raise ValueError("--run-count must be at least 1")

fab_bytes = build_fab_from_disk(_PROBE_APP_DIR)
fab_hash = hashlib.sha256(fab_bytes).hexdigest()
channel = grpc.insecure_channel(args.control_api_address)
grpc.channel_ready_future(channel).result(timeout=60)
stub = ControlStub(channel)
response = stub.StartRun(
StartRunRequest(
fab=fab_to_proto(Fab(fab_hash, fab_bytes, {})),
federation=NOOP_FEDERATION,
override_config = {}
if args.probe_hold_seconds > 0:
override_config[_PROBE_HOLD_SECONDS_CONFIG_KEY] = scalar_to_proto(
args.probe_hold_seconds
)
run_ids = []
for _ in range(args.run_count):
response = stub.StartRun(
StartRunRequest(
fab=fab_to_proto(Fab(fab_hash, fab_bytes, {})),
override_config=override_config,
federation=NOOP_FEDERATION,
)
)
)
if not response.HasField("run_id"):
raise RuntimeError("Control API did not return a run_id")
print(f"K8s launch seed created run_id={response.run_id}")
if not response.HasField("run_id"):
raise RuntimeError("Control API did not return a run_id")
run_ids.append(response.run_id)
print(f"K8s launch seed created run_id={response.run_id}")
joined_run_ids = ",".join(str(run_id) for run_id in run_ids)
print(f"K8s launch seed created run_ids={joined_run_ids}")


if __name__ == "__main__":
Expand Down
Loading