Skip to content

ugglr/mac-shakedown

Repository files navigation

Shakedown

Verify a new Mac before your return window closes.

A verification harness for new Macs, Apple Silicon and Intel. Built for cases where you can't easily return the unit: bought abroad, narrow return window, expensive config, or you just don't want to discover a defect three months from now.

A single command runs the automated phases (system inventory, battery health, CPU variance benchmark, sustained thermal load) end-to-end. A separate runbook walks you through the manual phases (display dead-pixel test, hinge / keyboard / speaker / port inspection, Apple Diagnostics).

Why "shakedown"? Borrowed from engineering: a shakedown run is the first-run stress test of new machinery before it's commissioned. Same idea, applied to your new Mac.

Why this exists

Some Mac generations ship with batch-level defects that don't show up in a quick boot test. The 2026 M5 Max line, for example, had units showing up to 41.5% multi-core performance variance between identical benchmark runs. You can only see that by running repeated load tests on a thermally-saturated chassis. A 30-second smoke test on store Wi-Fi will not catch it.

Shakedown is the procedure for catching those.

Status (v0.1): the methodology has not yet been validated against a confirmed-defective unit. Thresholds are derived from public reports and engineering reasoning. Expect them to tighten as crowd-sourced submissions land. Treat current results as advisory, not authoritative; if a verdict is borderline, rerun before deciding.

Phase 4 uses parallel SHA-256, which is hardware-accelerated on Apple Silicon and Coffee Lake+ Intel. The variance methodology transfers cleanly to any sustained workload (the SHA choice is for zero-install portability), but the test doesn't probe integer pipelines or memory bandwidth as deeply as Cinebench / Geekbench would. A non-accelerated workload pass now ships as opt-in Phase 4b (./run --noaccel, BLAKE2b), and memory bandwidth has its own phase (Phase 12).

Quick start

  1. Install Xcode Command Line Tools (provides git, 5–10 min one-time; a GUI dialog pops up):

    xcode-select -p >/dev/null 2>&1 || xcode-select --install
  2. Clone and run.

    git clone https://github.com/ugglr/mac-shakedown ~/mac-shakedown && cd ~/mac-shakedown
    ./run

On a terminal with no flags, ./run opens a short guided picker (verify a new Mac vs quick check, plus strict and the AI load), builds the command, and runs it, so you don't have to remember flags. Pass any flag (or SHAKEDOWN_YES=1, or pipe it) and the picker is skipped, the run proceeds directly.

Either way, ./run auto-selects the matching preset from your hardware (chip, memory, model, screen size), so it bins against the right thresholds and baseline with nothing to type. Pass --target mbp-16-m5-max-64 (see targets/) to assert an expected SKU, or to override the match; if no preset matches, it auto-detects chassis class from system_profiler and skips the inventory asserts.

The orchestrator runs the automated phases (preflight → inventory → battery → race benchmark → SSD test → memory bandwidth → CPU variance → thermal load) end-to-end, asks for sudo once upfront (Phase 5 and the SSD page-cache drop need it), and writes a SCHEMA-compliant report to Reports/local/ plus a sanitized PR-able copy to Reports/submissions/. Opt-in flags add heavier passes: --noaccel (a non-accelerated BLAKE2b variance pass), --gpu (a Metal GPU compute pass), and --llama (clones and builds llama.cpp for a combined CPU+GPU+memory AI load). --store bundles the thorough profile for verifying a new unit. Runtime ~20 min on Intel, ~27 min on Air, ~47 min on MacBook Pro.

Verifying a new purchase, especially in a store you can't easily return to? Follow the Store Day Checklist, a top-to-bottom run-this-then-that guide for verifying your machine in one sitting at the store. The Benchmark Reference behind it has the sourced per-generation expected scores and live-lookup links, and Verification/scripts/compare-reports.sh reference.json yours.json diffs your unit against a known-good sibling.

./run --no-sudo skips the 10-min thermal phase (the only phase that needs sudo) for a half-runtime no-password variance-only pass.

Keep the charger plugged in for all phases.

For the manual phases (display test, hinge / keyboard / speaker / port inspection, Apple Diagnostics, optional 30-min idle drain), follow Verification/Runbook.md phases 6–9 by hand.

What a run looks like

See examples/sample-report-illustrative/ for an annotated example PASS report on a 16" M5 Max: Markdown render and the underlying JSON. (Illustrative, not a real run, replaced when crowd-sourced submissions land.)

What gets checked

Phase What How
0 Pre-flight & quiet-system check uptime, top
1 Hardware identity vs. target system_profiler + sysctlinventory.sh
2 Battery health (cycles, capacity, condition) ioreg AppleSmartBatterybattery.sh
3 Sensor & port inventory (uses Phase 1 output)
10 Race benchmark (cold) xz -9 -T<P> of a 200 MB random blob → race-bench.sh
11 SSD sequential read/write (cold) 2 GB incompressible random data, page cache dropped between → ssd-test.sh
12 Memory bandwidth (cold) multi-threaded STREAM triad (vendored C, clang at runtime) with a memmove fallback → memory-bandwidth.sh
4 CPU performance variance 5 s burst + chassis-class-aware warmup (300 s on Pro, 60 s on Air) + 5 × 60 s timed iterations, parallel SHA-256 → cpu-variance.sh
4b Non-accelerated CPU variance (opt-in --noaccel) same methodology with BLAKE2b (no crypto instruction, hits the integer pipelines) → cpu-variance.sh
5 Sustained thermal load (chassis-class-aware thresholds) 10 min continuous + powermetrics sampling → thermal-load.sh
13 GPU compute variance (opt-in --gpu) sustained Metal FMA load, compiled with swiftc at runtime → gpu-variance.sh
14 Combined CPU+GPU+memory (opt-in --llama) clones + builds llama.cpp, runs llama-bench (AI inference load) → llama-bench.sh
6 Display visual inspection fullscreen color cycle in Safari → display-test.sh
7 Manual physical inspection hinge, keyboard, speakers, ports, Touch ID, etc. (runbook checklist)
8 Apple Diagnostics reboot + Cmd-D
9 Optional idle drain 30 min sleep, measure %/30 min

Phases 10 (race), 11 (SSD), and 12 (memory bandwidth) run before the heavy phases so the chassis is cold. They produce calibration numbers (verdict "info"); pass/fail thresholds land in schema v0.3 once the submission corpus is non-empty (Phase 11 carries a conservative warn-only floor in the meantime). Unlike Phase 4 (SHA-256, hardware-accelerated), Phase 10 (LZMA) is fair across chassis families since it doesn't hit any crypto coprocessor.

Specifying your target

Two ways:

  1. No target (the usual way). Auto-selects the matching preset from the detected hardware (chip, memory, model family, and screen size), so the right thresholds, asserts, and baseline apply with nothing to type:

    ./run

    If exactly one preset matches it is used (and printed); if none match, it falls back to auto-detecting chassis class, including the 14"/16" sub-class from the built-in display, and skips the SKU asserts. targets/ ships presets across the M1-M5 generations (Air, 14"/16" Pro/Max) plus Intel; see targets/README.md.

  2. Pin a preset. Pass --target to assert an expected SKU (hard-fails on a chip / RAM mismatch, useful when verifying you got the SKU you paid for) or to override the auto-match:

    ./run --target mbp-16-m5-max-64

(Don't see your SKU? targets/README.md has the schema. Open a PR adding a preset.)

Repo layout

mac-shakedown/
├── README.md
├── CONTRIBUTING.md
├── CHANGELOG.md
├── SECURITY.md
├── LICENSE
├── run                             # convenience entry point, execs the orchestrator
├── Shakedown Brain.md              # Obsidian map-of-content (optional, for vault users)
├── .github/                        # issue + PR templates, CI lint workflow
├── Verification/                   # generation-agnostic test machinery
│   ├── Runbook.md                  # the procedure
│   ├── Pass-Fail Criteria.md       # thresholds (parameterized by chassis class)
│   ├── Benchmark Reference.md      # install/run/expected-score + in-store protocol
│   ├── Store Day Checklist.md      # the one-command in-store flow
│   ├── Production QA.md            # gap analysis: toward factory-grade QA
│   ├── per-core-pinning.md         # why macOS has no per-core affinity (methodology note)
│   └── scripts/
│       ├── run-shakedown.sh        # the orchestrator (`./run` execs this)
│       ├── inventory.sh            # system_profiler + sysctl → JSON
│       ├── battery.sh              # ioreg battery health → JSON
│       ├── race-bench.sh           # cold xz race → JSON
│       ├── ssd-test.sh             # sequential SSD read/write → JSON (sudo for purge)
│       ├── memory-bandwidth.sh     # STREAM triad (stream-triad.c) or memmove fallback → JSON
│       ├── stream-triad.c          # vendored STREAM-style triad, compiled at runtime
│       ├── cpu-variance.sh         # burst + warmup + 5×60s timed iters → JSON (WORKLOAD=blake2b for 4b)
│       ├── thermal-load.sh         # 10-min sustained + powermetrics → JSON (sudo)
│       ├── gpu-variance.sh         # opt-in Metal compute variance, swiftc at runtime → JSON
│       ├── llama-bench.sh          # opt-in llama.cpp combined load (clone+build) → JSON
│       ├── compare-reports.sh      # diff two reports (unit vs known-good sibling)
│       ├── make-baseline.sh        # build baselines/<preset>.json from known-good reports
│       ├── repeatability.sh        # MSA: is our gage error small vs the tolerance?
│       └── display-test.sh         # fullscreen color cycle (HTML)
├── baselines/                      # calibrated golden limits per SKU (mostly empty; built from known-good runs)
├── targets/                        # preset SKU configs
│   ├── README.md
│   ├── mbp-16-m5-max-64.json
│   ├── mbp-14-m5-pro-24.json
│   ├── macbook-air-m5-16.json
│   └── mbp-16-intel-2019.json
├── examples/                       # generation-specific calibrations + sample reports
│   ├── m5-2026/                    # the 2026 M5 generation defect landscape
│   │   ├── README.md
│   │   ├── M5 Quality Issues.md
│   │   ├── Issues/
│   │   └── Sources.md
│   └── sample-report-illustrative/ # what a real run looks like (illustrative, not from a real unit)
└── Reports/                        # output artifacts
    ├── SCHEMA.md                   # JSON report schema (v1.0)
    ├── local/                      # full output, gitignored
    └── submissions/                # sanitized, PR-able copies

Adding a new generation

When a new chip line ships, copy examples/m5-2026/ to examples/<generation>-<year>/, document the issues there, and add a target preset pointing to it. See CONTRIBUTING.md.

Running individual scripts (advanced)

./run orchestrates the script-driven phases. If you want to rerun a single phase, say to confirm a borderline variance warn without redoing the whole pass, call the scripts directly:

export CHASSIS_CLASS=active-cooled-pro-16  # or -14 | fanless | desktop | intel-laptop | intel-desktop

./Verification/scripts/inventory.sh    > Reports/inventory.json
./Verification/scripts/battery.sh      > Reports/battery.json
./Verification/scripts/cpu-variance.sh > Reports/variance.json
sudo CHASSIS_CLASS="$CHASSIS_CLASS" ./Verification/scripts/thermal-load.sh > Reports/thermal.json
./Verification/scripts/display-test.sh

The inline sudo CHASSIS_CLASS=... form preserves the env var across the privilege boundary regardless of sudoers env_keep config. Read the values against Pass-Fail Criteria.

Submit a calibration report

The v0.1 thresholds need real-world data to calibrate. If you ran the harness (PASS, WARN, or FAIL), please consider submitting your report. Known-good machines from someone you trust are the most valuable submissions, since that's what the methodology currently lacks.

./run already writes the sanitized copy to Reports/submissions/<filename>.json. To submit it:

  1. Skim the JSON for any leftover PII (free-form notes, store name, etc.).
  2. git checkout -b submit-<your-date> && git add Reports/submissions/<filename>.json && git commit -m "submission: <chassis> <chip> <verdict>"
  3. Open a PR. CI runs the submission audit and rejects _raw_* leakage, plaintext serials, malformed schema, or off-pattern filenames.

The manual phases (display, physical inspection, Apple Diagnostics, idle drain) land as skipped placeholders. If you ran any of them, hand-edit Reports/local/<filename>.json to fill in the results, re-sanitize, and overwrite the submission copy before the PR.

Roadmap

Wave 7 shipped most of the original roadmap: the non-accelerated variance pass (Phase 4b, BLAKE2b), GPU compute variance (Phase 13, Metal), memory bandwidth (Phase 12), a conservative SSD floor, the active-cooled-pro-14 / active-cooled-pro-16 thermal sub-classes, and M1-M4 plus Intel-era generation calibrations.

Wave 8 added the in-store toolkit: the --store thorough profile, compare-reports.sh (diff your unit against a known-good sibling), a sourced Benchmark Reference, a vendored STREAM triad in Phase 12 (real copy / scale / add / triad), and an opt-in --llama phase that builds llama.cpp for a combined CPU+GPU+memory load.

Since then: a one-command verdict banner on ./run, a single-sitting Store Day Checklist, and the first leg of production-grade QA, calibrated golden baselines (make-baseline.sh builds baselines/<preset>.json from known-good runs; ./run then bins against it for a calibrated verdict). Verification/Production QA.md is the honest gap analysis to a factory-grade screen.

Still open:

  • Hosted aggregator. Eventually, submission via API to a public site so reports aren't reviewed by hand. Until then, the PR-submission flow above is the aggregator. Slower, but no infra, and PR review catches PII before merge.
  • A corpus to calibrate against. The mechanism now exists: ./run bins against a golden baseline when baselines/<preset>.json is present (make-baseline.sh builds one from known-good reports), turning the verdict calibrated. What is still missing is the population of known-good units to characterize. The Measurement System Analysis harness now exists (repeatability.sh reports whether our gage error is small relative to the tolerance it decides against), and --strict hard-gates the preconditions we can see before a run (on AC, quiet system) so a scored run is comparable. Coverage of the remaining defect holes and SPC are next. Verification/Production QA.md lays out the full path to a production-grade screen.
  • Per-core pinning. macOS lacks public CPU affinity APIs, so we can't pin workers to specific cores; a defective single core gets averaged across N P-cores. Reporting worker_imbalance_pct_per_iter is the partial mitigation. The investigation is written up in Verification/per-core-pinning.md: the short version is that THREAD_AFFINITY_POLICY is ignored on Apple Silicon and QoS hints only steer between the P and E clusters, so there's no per-core pinning to be had today.

Origin

Built originally to vet a 16" M5 Max purchase abroad, where returning a defective unit isn't practical. The research that informs the M5 Max thresholds is in examples/m5-2026/, kept as a worked example of what a generation calibration looks like.

License

MIT. See LICENSE.

About

Verify a new Mac before the return window closes. An agentic QA harness catching batch-level defects (perf variance, thermal throttling, hardware health) the store demo misses. Apple Silicon + Intel.

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors