perf: inline get_loose_bvar_range field accessor#13387
perf: inline get_loose_bvar_range field accessor#13387Kha wants to merge 10 commits intoleanprover:masterfrom
get_loose_bvar_range field accessor#13387Conversation
This PR turns `replace_rec_fn` into a class template parameterized on the functor type and exposes it from `kernel/replace_fn.h`, alongside templated `replace<F>(...)` overloads that monomorphize on the functor at the call site. Hot callers like `instantiate`, `instantiate_rev`, and the C-exported `lean_expr_instantiate*_core` variants in `kernel/instantiate.cpp` all pass their substitution closure as a stateless lambda; previously each call into the closure went through `std::function`'s indirect dispatch on every recursion step, blocking inlining of the loose-bvar-range early return. With this change the closure body inlines into `replace_rec_fn::apply` and the `std::function` indirection is gone. The legacy `replace(expr, std::function<...>, bool)` overload is preserved as a thin out-of-line wrapper that instantiates the template once with `std::function` as the functor type, so external callers continue to work unchanged. On `leanchecker --fresh Init.Data.List.Lemmas` this shaves `17.10 G -> 16.50 G` instructions (~3.5%) and `1.74s -> 1.69s` wall-clock (~3%). All existing callers benefit automatically since they were already passing lambdas directly; no caller-side changes were required.
This PR skips inserting an entry into `replace_rec_fn`'s cache when the result equals the input, i.e. when the user functor decided that the subterm is unchanged. The dominant kernel callers of `replace` are the `instantiate*` family in `kernel/instantiate.cpp`, and their substitution closure short-circuits to `some_expr(m)` whenever `offset >= get_loose_bvar_range(m)`. That predicate is true for the vast majority of subterms in any non-trivial expression, so the cache was being filled almost exclusively with `(node, offset) -> node` self-mappings. Each such entry costs an insert and pollutes the cache for actually-useful entries, while a re-visit can recompute the same answer by calling the functor again at O(1) cost (the loose-bvar range is memoized on the expr node). On `leanchecker --fresh Init.Data.List.Lemmas` this shaves `17.10 G -> 16.55 G` instructions (~3.2%) and `1.74s -> 1.61s` wall-clock (~7.5%).
This PR replaces the `r.raw() != e.raw()` skip-when-result-is-input check in `replace_rec_fn::save_result` (introduced earlier in this stack) with an opt-in pre-cache `skip` predicate that callers can supply alongside the rewriter functor. The skip predicate is invoked *before* the cache lookup; if it returns true, the input is returned unchanged with no cache touch. The earlier identity-skip approach blew up catastrophically on workloads like `Init.WFExtrinsicFix` (~3-4x instruction count), because it dropped caching for the *recursion-identity* case as well as the cheap `m_f`-early-return case. In the recursion-identity case the cache had been remembering "we already descended into this shared subterm and proved it's unchanged" — without that memo, every revisit redid the full recursive descent and the work compounded exponentially. The skip predicate fixes this by separating the two cases: only the cheap "no work needed for this subtree" guard goes into `skip`, and the ordinary cache logic handles every other path unchanged. For `instantiate*`/`lift_loose_bvars`/`lower_loose_bvars`, the predicate is the existing memoized loose-bvar-range check that was previously the first line of `m_f`. It's strictly cheaper than the cache lookup it replaces. Measurements vs the `xkn` baseline on `leanchecker --fresh`: | | baseline | this PR | |-----------------------|----------|---------| | Init.WFExtrinsicFix | 9.32 G | 9.07 G | | Init.Data.List.Lemmas | 17.10 G | 16.61 G | Both workloads improve modestly, with no regression on either.
This PR replaces the `std::unordered_map`-based cache in `replace_rec_fn` with a small-buffer cache that stores its first 16 entries inline in uninitialized stack storage and only allocates a real hash map for the rare large traversal. Instrumentation across a full `leanchecker --fresh Init.Data.List.Lemmas` run shows that 87% of `replace_rec_fn` instances hold at most 15 entries and only 0.21% exceed 128, with a mean cache size of just 9 entries spread across ~950k instances. At that scale a hash map is the wrong data structure: its per-instance bucket-array allocation and per-entry node allocation dwarf the cost of a linear scan over a handful of entries. The new structure pays nothing for entries that are never inserted, no allocation at all on the common path, and falls back to the original `unordered_map` once the inline buffer fills. Combined with the existing `is_likely_unshared` filter, lookups on the common path are just a tight scan over a stack-resident array. On `leanchecker --fresh Init.Data.List.Lemmas` this shaves `17.10 G -> 16.18 G` instructions (~5.4%) and `1.74s -> 1.62s` wall-clock (~6.7%) compared to the previous baseline. It supersedes the prior `try_emplace` and `reserve(128)` micro-optimizations on the same cache, both of which are no longer needed since the hash map is no longer on the hot path.
This PR makes `get_loose_bvar_range` an inline accessor that reads the
memoized loose-bvar-range field directly from the expression's data
header (`(get_data(e) >> 44)`), matching the existing pattern for
`hash` / `has_fvar` / `has_expr_mvar` / `has_univ_mvar`. It was the
only one of these field accessors still going out-of-line through a
Lean-implemented thunk:
```cpp
extern "C" unsigned lean_expr_loose_bvar_range(object * e);
unsigned get_loose_bvar_range(expr const & e) {
return lean_expr_loose_bvar_range(e.to_obj_arg());
}
```
Every call did a `lean_inc` on the borrowed `expr` (via `to_obj_arg`),
a function call into Lean, the Lean function body extracting the same
shifted field, and a matching `lean_dec` on parameter consumption —
all to retrieve a value that's already inline in the object header.
`get_loose_bvar_range` is hot: it gates the skip predicates in every
`replace_rec_fn` traversal, gates `has_loose_bvars`, and is checked at
the top of every `instantiate*` / `lift_loose_bvars` /
`lower_loose_bvars` entry point. Inlining it shaves significant time
across kernel-heavy workloads.
The previous out-of-line implementation is preserved as
`loose_bvar_range_core` for the `lean_assert` cross-check, matching
the convention used by `has_fvar_core`, `has_expr_mvar_core`, etc.
Measurements on `leanchecker --fresh` against the previous tip:
| | before | after | delta |
|--------------------------|---------|---------|------------------|
| Init.Data.List.Lemmas | 13.92 G | 13.51 G | -405 M (-2.9%) |
| Init.WFExtrinsicFix | 7.79 G | 7.59 G | -200 M (-2.6%) |
| big_do (elab benchmark) | 24.33 G | 24.37 G | noise |
|
!bench |
|
Benchmark results for 11c62ea against 41ab492 are in. Significant changes detected! @Kha
Large changes (64✅) Too many entries to display here. View the full report on radar instead. Medium changes (91✅) Too many entries to display here. View the full report on radar instead. Small changes (1424✅, 3🟥) Too many entries to display here. View the full report on radar instead. |
|
Reference manual CI status:
|
|
Mathlib CI status (docs):
|
|
!bench mathlib |
|
Benchmark results for leanprover-community/mathlib4-nightly-testing@8aad75a against leanprover-community/mathlib4-nightly-testing@bea414c are in. There are significant results. @Kha
Large changes (3✅)
and 1 hidden Medium changes (145✅) Too many entries to display here. View the full report on radar instead. Small changes (1002✅, 19🟥) Too many entries to display here. View the full report on radar instead. |
This PR makes
get_loose_bvar_rangean inline accessor that reads thememoized loose-bvar-range field directly from the expression's data
header (
(get_data(e) >> 44)), matching the existing pattern forhash/has_fvar/has_expr_mvar/has_univ_mvar. It was theonly one of these field accessors still going out-of-line through a
Lean-implemented thunk:
Every call did a
lean_incon the borrowedexpr(viato_obj_arg),a function call into Lean, the Lean function body extracting the same
shifted field, and a matching
lean_decon parameter consumption —all to retrieve a value that's already inline in the object header.
get_loose_bvar_rangeis hot: it gates the skip predicates in everyreplace_rec_fntraversal, gateshas_loose_bvars, and is checked atthe top of every
instantiate*/lift_loose_bvars/lower_loose_bvarsentry point. Inlining it shaves significant timeacross kernel-heavy workloads.
The previous out-of-line implementation is preserved as
loose_bvar_range_corefor thelean_assertcross-check, matchingthe convention used by
has_fvar_core,has_expr_mvar_core, etc.Measurements on
leanchecker --freshagainst the previous tip: