perf(quality-gates): Variance reduction - outlier filtering, warmup exclusion#41961
perf(quality-gates): Variance reduction - outlier filtering, warmup exclusion#41961
Conversation
|
CLA Signature Action: All authors have signed the CLA. You may need to manually re-run the blocking PR check if it doesn't pass in a few minutes. |
…WSER_LOADS` constants
…ropagate `trimmedCount`
ce2c331 to
775bb15
Compare
Builds ready [775bb15]
⚡ Performance Benchmarks (Total: 🟢 7 pass · 🟡 8 warn · 🔴 0 fail)
Bundle size diffs
|
Builds ready [897ae1d] [reused from 775bb15]
⚡ Performance Benchmarks (Total: 🟢 7 pass · 🟡 8 warn · 🔴 0 fail)
Bundle size diffs
|
Web vitals collection can fail silently inside try-catch, making allWebVitalsRuns sparse. Slicing by warmupSize = WARMUP_RUNS * pageLoads assumes every warmup page load produced an entry; filtering by wv.iteration >= warmupSize is correct regardless of collection gaps. Reported by Cursor Bugbot (review #4150432255).
Builds ready [451bc12] [reused from fc2dcd7]
⚡ Performance Benchmarks (Total: 🟢 7 pass · 🟡 8 warn · 🔴 0 fail)
Bundle size diffs [🚨 Warning! Bundle size has increased!]
|
Add missing \`@param options.*\` declarations and remove markdown bold syntax that triggered the \`jsdoc/check-indentation\` rule.
…onsistency Page-load benchmarks use IQR-only trimming, so \`outliers === trimmedCount\` (no z-score pass). Setting both fields ensures Sentry telemetry is consistent across page-load and iteration-based benchmark paths. Dashboards can still derive z-score count as \`outliers - trimmedCount\`; for page loads that difference is always zero, which is correct.
…reference \`outliers\` and \`trimmedCount\` pointed at the same object. For page loads (IQR-only) the values are equal, but aliasing the reference means a mutation to one silently corrupts the other. Spread produces an independent copy with the correct values.
Builds ready [a8c5df8]
⚡ Performance Benchmarks (Total: 🟢 7 pass · 🟡 8 warn · 🔴 0 fail)
Bundle size diffs [🚨 Warning! Bundle size has increased!]
|
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 4c40363. Configure here.
…`runPageLoadBenchmark\`
|
Builds ready [f1fc8fe]
⚡ Performance Benchmarks (Total: 🟢 6 pass · 🟡 8 warn · 🔴 0 fail)
Bundle size diffs [🚨 Warning! Bundle size has increased!]
|
Builds ready [f1fc8fe]
⚡ Performance Benchmarks (Total: 🟢 7 pass · 🟡 8 warn · 🔴 0 fail)
Bundle size diffs [🚨 Warning! Bundle size has increased!]
|




Description
The variance audit found 5 metrics with CV 30–50% and 2 borderline at 25–30%. Raw per-run samples contain outliers from cold-start JIT compilation, GC pauses, and CI machine variance. These inflate stdDev and produce false positives across all three verdict layers — not just Layer 3 (Mann-Whitney U), but Layer 1 (absolute thresholds) and Layer 2 (historical baselines) today.
This PR adds sample preparation that benefits every layer downstream:
IQR-based outlier trimming.
trimOutliers()instatistics.tsremoves values outside[Q1 − 1.5·IQR, Q3 + 1.5·IQR]before stats computation. At n=15 independent sessions, this removes 0–3 extreme values.trimmedCountis exposed per-metric inTimerStatistics,BenchmarkResults, and as a Sentry tag for observability.Warm-up run exclusion. The first
WARMUP_RUNS(default: 1) browser-load sessions are discarded before computing stats. The first cold-start iteration is a known outlier source (JIT compilation, cache priming). Applied before IQR trimming so the trim operates on warm samples only.PowerUser iteration rebalance.
startupPowerUserHomechanges from 10 browser loads × 10 page loads to 15 × 7 (105 total samples, same CI time). For outlier detection and Mann-Whitney U, independent session count matters more than within-session page reloads. Trading 3 page loads per session for 5 additional sessions increases independent n by 50% at negligible CI time cost. After warm-up exclusion: 14 effective sessions × 6 page loads = 84 effective samples.Minimum sample gate constant.
MIN_SAMPLES_FOR_VERDICT = 5is defined inconstants.tsand wired into the Mann-Whitney U verdict logic in #41520.Expected CV impact
startupPowerUser.uiStartupstartupPowerUser.loadstartupPowerUser.loadScriptsopenAccountMenuToAccountListLoadedsolanaAssetDetails.assetClickToPriceChartMetrics with
tailRatio > 1.3benefit most from trimming (the top three above).Changelog
CHANGELOG entry: No user-facing changes. Benchmark infrastructure only (outlier trimming, warm-up exclusion, PowerUser iteration rebalance, trimmedCount Sentry tag).
Related issues
Fixes: MetaMask/MetaMask-planning#7185
Depends on: #40729 (Layer 1+2 comparison CLI)
Related: #41520 (Mann-Whitney U — trimming improves statistical power; wires
MIN_SAMPLES_FOR_VERDICTinto verdict logic)Manual testing steps
yarn jest test/e2e/benchmarks/utils/outlier-trimming.test.tsyarn jest test/e2e/benchmarks/utils/Screenshots/Recordings
N/A — CI tooling change with no visual output.
Pre-merge author checklist
Pre-merge reviewer checklist
Note
Medium Risk
Changes benchmark aggregation and reporting (warm-up run exclusion, IQR outlier trimming, and new derived metrics), which can materially shift quality-gate/Sentry signals even though it’s CI-only.
Overview
Reduces benchmark variance by discarding warm-up browser sessions (
WARMUP_RUNS) and applying IQR-based outlier trimming before computing page-load statistics, with new per-metrictrimmedCount/outliersincluded inBenchmarkResults.Rebalances the
startupPowerUserHomepreset to 15 browser loads × 7 page loads (and defaults power-user runs to the higher session count), and extends Sentry reporting to include derived reliability metrics (cv,dataQuality,tailRatio) plus the new trimming/outlier counts.Reviewed by Cursor Bugbot for commit f1fc8fe. Bugbot is set up for automated code reviews on this repo. Configure here.