Verify a new Mac before your return window closes.
A verification harness for new Macs, Apple Silicon and Intel. Built for cases where you can't easily return the unit: bought abroad, narrow return window, expensive config, or you just don't want to discover a defect three months from now.
A single command runs the automated phases (system inventory, battery health, CPU variance benchmark, sustained thermal load) end-to-end. A separate runbook walks you through the manual phases (display dead-pixel test, hinge / keyboard / speaker / port inspection, Apple Diagnostics).
Why "shakedown"? Borrowed from engineering: a shakedown run is the first-run stress test of new machinery before it's commissioned. Same idea, applied to your new Mac.
Some Mac generations ship with batch-level defects that don't show up in a quick boot test. The 2026 M5 Max line, for example, had units showing up to 41.5% multi-core performance variance between identical benchmark runs. You can only see that by running repeated load tests on a thermally-saturated chassis. A 30-second smoke test on store Wi-Fi will not catch it.
Shakedown is the procedure for catching those.
Status (v0.1): the methodology has not yet been validated against a confirmed-defective unit. Thresholds are derived from public reports and engineering reasoning. Expect them to tighten as crowd-sourced submissions land. Treat current results as advisory, not authoritative; if a verdict is borderline, rerun before deciding.
Phase 4 uses parallel SHA-256, which is hardware-accelerated on Apple Silicon and Coffee Lake+ Intel. The variance methodology transfers cleanly to any sustained workload (the SHA choice is for zero-install portability), but the test doesn't probe integer pipelines or memory bandwidth as deeply as Cinebench / Geekbench would. A non-accelerated workload pass now ships as opt-in Phase 4b (
./run --noaccel, BLAKE2b), and memory bandwidth has its own phase (Phase 12).
-
Install Xcode Command Line Tools (provides
git, 5–10 min one-time; a GUI dialog pops up):xcode-select -p >/dev/null 2>&1 || xcode-select --install
-
Clone and run.
git clone https://github.com/ugglr/mac-shakedown ~/mac-shakedown && cd ~/mac-shakedown ./run
On a terminal with no flags, ./run opens a short guided picker (verify a new Mac vs quick check, plus strict and the AI load), builds the command, and runs it, so you don't have to remember flags. Pass any flag (or SHAKEDOWN_YES=1, or pipe it) and the picker is skipped, the run proceeds directly.
Either way, ./run auto-selects the matching preset from your hardware (chip, memory, model, screen size), so it bins against the right thresholds and baseline with nothing to type. Pass --target mbp-16-m5-max-64 (see targets/) to assert an expected SKU, or to override the match; if no preset matches, it auto-detects chassis class from system_profiler and skips the inventory asserts.
The orchestrator runs the automated phases (preflight → inventory → battery → race benchmark → SSD test → memory bandwidth → CPU variance → thermal load) end-to-end, asks for sudo once upfront (Phase 5 and the SSD page-cache drop need it), and writes a SCHEMA-compliant report to Reports/local/ plus a sanitized PR-able copy to Reports/submissions/. Opt-in flags add heavier passes: --noaccel (a non-accelerated BLAKE2b variance pass), --gpu (a Metal GPU compute pass), and --llama (clones and builds llama.cpp for a combined CPU+GPU+memory AI load). --store bundles the thorough profile for verifying a new unit. Runtime ~20 min on Intel, ~27 min on Air, ~47 min on MacBook Pro.
Verifying a new purchase, especially in a store you can't easily return to? Follow the Store Day Checklist, a top-to-bottom run-this-then-that guide for verifying your machine in one sitting at the store. The Benchmark Reference behind it has the sourced per-generation expected scores and live-lookup links, and
Verification/scripts/compare-reports.sh reference.json yours.jsondiffs your unit against a known-good sibling.
./run --no-sudo skips the 10-min thermal phase (the only phase that needs sudo) for a half-runtime no-password variance-only pass.
Keep the charger plugged in for all phases.
For the manual phases (display test, hinge / keyboard / speaker / port inspection, Apple Diagnostics, optional 30-min idle drain), follow Verification/Runbook.md phases 6–9 by hand.
See examples/sample-report-illustrative/ for an annotated example PASS report on a 16" M5 Max: Markdown render and the underlying JSON. (Illustrative, not a real run, replaced when crowd-sourced submissions land.)
| Phase | What | How |
|---|---|---|
| 0 | Pre-flight & quiet-system check | uptime, top |
| 1 | Hardware identity vs. target | system_profiler + sysctl → inventory.sh |
| 2 | Battery health (cycles, capacity, condition) | ioreg AppleSmartBattery → battery.sh |
| 3 | Sensor & port inventory | (uses Phase 1 output) |
| 10 | Race benchmark (cold) | xz -9 -T<P> of a 200 MB random blob → race-bench.sh |
| 11 | SSD sequential read/write (cold) | 2 GB incompressible random data, page cache dropped between → ssd-test.sh |
| 12 | Memory bandwidth (cold) | multi-threaded STREAM triad (vendored C, clang at runtime) with a memmove fallback → memory-bandwidth.sh |
| 4 | CPU performance variance | 5 s burst + chassis-class-aware warmup (300 s on Pro, 60 s on Air) + 5 × 60 s timed iterations, parallel SHA-256 → cpu-variance.sh |
| 4b | Non-accelerated CPU variance (opt-in --noaccel) |
same methodology with BLAKE2b (no crypto instruction, hits the integer pipelines) → cpu-variance.sh |
| 5 | Sustained thermal load (chassis-class-aware thresholds) | 10 min continuous + powermetrics sampling → thermal-load.sh |
| 13 | GPU compute variance (opt-in --gpu) |
sustained Metal FMA load, compiled with swiftc at runtime → gpu-variance.sh |
| 14 | Combined CPU+GPU+memory (opt-in --llama) |
clones + builds llama.cpp, runs llama-bench (AI inference load) → llama-bench.sh |
| 6 | Display visual inspection | fullscreen color cycle in Safari → display-test.sh |
| 7 | Manual physical inspection | hinge, keyboard, speakers, ports, Touch ID, etc. (runbook checklist) |
| 8 | Apple Diagnostics | reboot + Cmd-D |
| 9 | Optional idle drain | 30 min sleep, measure %/30 min |
Phases 10 (race), 11 (SSD), and 12 (memory bandwidth) run before the heavy phases so the chassis is cold. They produce calibration numbers (verdict "info"); pass/fail thresholds land in schema v0.3 once the submission corpus is non-empty (Phase 11 carries a conservative warn-only floor in the meantime). Unlike Phase 4 (SHA-256, hardware-accelerated), Phase 10 (LZMA) is fair across chassis families since it doesn't hit any crypto coprocessor.
Two ways:
-
No target (the usual way). Auto-selects the matching preset from the detected hardware (chip, memory, model family, and screen size), so the right thresholds, asserts, and baseline apply with nothing to type:
./run
If exactly one preset matches it is used (and printed); if none match, it falls back to auto-detecting chassis class, including the 14"/16" sub-class from the built-in display, and skips the SKU asserts.
targets/ships presets across the M1-M5 generations (Air, 14"/16" Pro/Max) plus Intel; seetargets/README.md. -
Pin a preset. Pass
--targetto assert an expected SKU (hard-fails on a chip / RAM mismatch, useful when verifying you got the SKU you paid for) or to override the auto-match:./run --target mbp-16-m5-max-64
(Don't see your SKU? targets/README.md has the schema. Open a PR adding a preset.)
mac-shakedown/
├── README.md
├── CONTRIBUTING.md
├── CHANGELOG.md
├── SECURITY.md
├── LICENSE
├── run # convenience entry point, execs the orchestrator
├── Shakedown Brain.md # Obsidian map-of-content (optional, for vault users)
├── .github/ # issue + PR templates, CI lint workflow
├── Verification/ # generation-agnostic test machinery
│ ├── Runbook.md # the procedure
│ ├── Pass-Fail Criteria.md # thresholds (parameterized by chassis class)
│ ├── Benchmark Reference.md # install/run/expected-score + in-store protocol
│ ├── Store Day Checklist.md # the one-command in-store flow
│ ├── Production QA.md # gap analysis: toward factory-grade QA
│ ├── per-core-pinning.md # why macOS has no per-core affinity (methodology note)
│ └── scripts/
│ ├── run-shakedown.sh # the orchestrator (`./run` execs this)
│ ├── inventory.sh # system_profiler + sysctl → JSON
│ ├── battery.sh # ioreg battery health → JSON
│ ├── race-bench.sh # cold xz race → JSON
│ ├── ssd-test.sh # sequential SSD read/write → JSON (sudo for purge)
│ ├── memory-bandwidth.sh # STREAM triad (stream-triad.c) or memmove fallback → JSON
│ ├── stream-triad.c # vendored STREAM-style triad, compiled at runtime
│ ├── cpu-variance.sh # burst + warmup + 5×60s timed iters → JSON (WORKLOAD=blake2b for 4b)
│ ├── thermal-load.sh # 10-min sustained + powermetrics → JSON (sudo)
│ ├── gpu-variance.sh # opt-in Metal compute variance, swiftc at runtime → JSON
│ ├── llama-bench.sh # opt-in llama.cpp combined load (clone+build) → JSON
│ ├── compare-reports.sh # diff two reports (unit vs known-good sibling)
│ ├── make-baseline.sh # build baselines/<preset>.json from known-good reports
│ ├── repeatability.sh # MSA: is our gage error small vs the tolerance?
│ └── display-test.sh # fullscreen color cycle (HTML)
├── baselines/ # calibrated golden limits per SKU (mostly empty; built from known-good runs)
├── targets/ # preset SKU configs
│ ├── README.md
│ ├── mbp-16-m5-max-64.json
│ ├── mbp-14-m5-pro-24.json
│ ├── macbook-air-m5-16.json
│ └── mbp-16-intel-2019.json
├── examples/ # generation-specific calibrations + sample reports
│ ├── m5-2026/ # the 2026 M5 generation defect landscape
│ │ ├── README.md
│ │ ├── M5 Quality Issues.md
│ │ ├── Issues/
│ │ └── Sources.md
│ └── sample-report-illustrative/ # what a real run looks like (illustrative, not from a real unit)
└── Reports/ # output artifacts
├── SCHEMA.md # JSON report schema (v1.0)
├── local/ # full output, gitignored
└── submissions/ # sanitized, PR-able copies
When a new chip line ships, copy examples/m5-2026/ to examples/<generation>-<year>/, document the issues there, and add a target preset pointing to it. See CONTRIBUTING.md.
./run orchestrates the script-driven phases. If you want to rerun a single phase, say to confirm a borderline variance warn without redoing the whole pass, call the scripts directly:
export CHASSIS_CLASS=active-cooled-pro-16 # or -14 | fanless | desktop | intel-laptop | intel-desktop
./Verification/scripts/inventory.sh > Reports/inventory.json
./Verification/scripts/battery.sh > Reports/battery.json
./Verification/scripts/cpu-variance.sh > Reports/variance.json
sudo CHASSIS_CLASS="$CHASSIS_CLASS" ./Verification/scripts/thermal-load.sh > Reports/thermal.json
./Verification/scripts/display-test.shThe inline sudo CHASSIS_CLASS=... form preserves the env var across the privilege boundary regardless of sudoers env_keep config. Read the values against Pass-Fail Criteria.
The v0.1 thresholds need real-world data to calibrate. If you ran the harness (PASS, WARN, or FAIL), please consider submitting your report. Known-good machines from someone you trust are the most valuable submissions, since that's what the methodology currently lacks.
./run already writes the sanitized copy to Reports/submissions/<filename>.json. To submit it:
- Skim the JSON for any leftover PII (free-form notes, store name, etc.).
git checkout -b submit-<your-date> && git add Reports/submissions/<filename>.json && git commit -m "submission: <chassis> <chip> <verdict>"- Open a PR. CI runs the submission audit and rejects
_raw_*leakage, plaintext serials, malformed schema, or off-pattern filenames.
The manual phases (display, physical inspection, Apple Diagnostics, idle drain) land as skipped placeholders. If you ran any of them, hand-edit Reports/local/<filename>.json to fill in the results, re-sanitize, and overwrite the submission copy before the PR.
Wave 7 shipped most of the original roadmap: the non-accelerated variance pass (Phase 4b, BLAKE2b), GPU compute variance (Phase 13, Metal), memory bandwidth (Phase 12), a conservative SSD floor, the active-cooled-pro-14 / active-cooled-pro-16 thermal sub-classes, and M1-M4 plus Intel-era generation calibrations.
Wave 8 added the in-store toolkit: the --store thorough profile, compare-reports.sh (diff your unit against a known-good sibling), a sourced Benchmark Reference, a vendored STREAM triad in Phase 12 (real copy / scale / add / triad), and an opt-in --llama phase that builds llama.cpp for a combined CPU+GPU+memory load.
Since then: a one-command verdict banner on ./run, a single-sitting Store Day Checklist, and the first leg of production-grade QA, calibrated golden baselines (make-baseline.sh builds baselines/<preset>.json from known-good runs; ./run then bins against it for a calibrated verdict). Verification/Production QA.md is the honest gap analysis to a factory-grade screen.
Still open:
- Hosted aggregator. Eventually, submission via API to a public site so reports aren't reviewed by hand. Until then, the PR-submission flow above is the aggregator. Slower, but no infra, and PR review catches PII before merge.
- A corpus to calibrate against. The mechanism now exists:
./runbins against a golden baseline whenbaselines/<preset>.jsonis present (make-baseline.shbuilds one from known-good reports), turning the verdict calibrated. What is still missing is the population of known-good units to characterize. The Measurement System Analysis harness now exists (repeatability.shreports whether our gage error is small relative to the tolerance it decides against), and--stricthard-gates the preconditions we can see before a run (on AC, quiet system) so a scored run is comparable. Coverage of the remaining defect holes and SPC are next.Verification/Production QA.mdlays out the full path to a production-grade screen. - Per-core pinning. macOS lacks public CPU affinity APIs, so we can't pin workers to specific cores; a defective single core gets averaged across N P-cores. Reporting
worker_imbalance_pct_per_iteris the partial mitigation. The investigation is written up inVerification/per-core-pinning.md: the short version is thatTHREAD_AFFINITY_POLICYis ignored on Apple Silicon and QoS hints only steer between the P and E clusters, so there's no per-core pinning to be had today.
Built originally to vet a 16" M5 Max purchase abroad, where returning a defective unit isn't practical. The research that informs the M5 Max thresholds is in examples/m5-2026/, kept as a worked example of what a generation calibration looks like.
MIT. See LICENSE.