Skip to content

perf: combined kernel optimizations#13381

Closed
Kha wants to merge 9 commits intoleanprover:masterfrom
Kha:push-wmorskvrkxrm
Closed

perf: combined kernel optimizations#13381
Kha wants to merge 9 commits intoleanprover:masterfrom
Kha:push-wmorskvrkxrm

Conversation

@Kha
Copy link
Copy Markdown
Member

@Kha Kha commented Apr 12, 2026

No description provided.

Kha added 4 commits April 12, 2026 15:31
This PR turns `replace_rec_fn` into a class template parameterized on the
functor type and exposes it from `kernel/replace_fn.h`, alongside templated
`replace<F>(...)` overloads that monomorphize on the functor at the call
site. Hot callers like `instantiate`, `instantiate_rev`, and the C-exported
`lean_expr_instantiate*_core` variants in `kernel/instantiate.cpp` all pass
their substitution closure as a stateless lambda; previously each call into
the closure went through `std::function`'s indirect dispatch on every
recursion step, blocking inlining of the loose-bvar-range early return.
With this change the closure body inlines into `replace_rec_fn::apply` and
the `std::function` indirection is gone.

The legacy `replace(expr, std::function<...>, bool)` overload is preserved
as a thin out-of-line wrapper that instantiates the template once with
`std::function` as the functor type, so external callers continue to work
unchanged.

On `leanchecker --fresh Init.Data.List.Lemmas` this shaves
`17.10 G -> 16.50 G` instructions (~3.5%) and `1.74s -> 1.69s` wall-clock
(~3%). All existing callers benefit automatically since they were already
passing lambdas directly; no caller-side changes were required.
This PR skips inserting an entry into `replace_rec_fn`'s cache when the
result equals the input, i.e. when the user functor decided that the
subterm is unchanged. The dominant kernel callers of `replace` are the
`instantiate*` family in `kernel/instantiate.cpp`, and their substitution
closure short-circuits to `some_expr(m)` whenever
`offset >= get_loose_bvar_range(m)`. That predicate is true for the vast
majority of subterms in any non-trivial expression, so the cache was being
filled almost exclusively with `(node, offset) -> node` self-mappings.
Each such entry costs an insert and pollutes the cache for actually-useful
entries, while a re-visit can recompute the same answer by calling the
functor again at O(1) cost (the loose-bvar range is memoized on the expr
node).

On `leanchecker --fresh Init.Data.List.Lemmas` this shaves
`17.10 G -> 16.55 G` instructions (~3.2%) and `1.74s -> 1.61s` wall-clock
(~7.5%).
@Kha
Copy link
Copy Markdown
Member Author

Kha commented Apr 12, 2026

!bench

@leanprover-radar
Copy link
Copy Markdown

leanprover-radar commented Apr 12, 2026

Benchmark results for 6ff9c0b against 82bb27f are in. Significant changes detected! @Kha

  • 🟥 build//instructions: +8.5G (+0.07%)

Large changes (50✅, 18🟥)

Too many entries to display here. View the full report on radar instead.

Medium changes (67✅, 8🟥)

Too many entries to display here. View the full report on radar instead.

Small changes (1370✅, 38🟥)

Too many entries to display here. View the full report on radar instead.

This PR replaces the `r.raw() != e.raw()` skip-when-result-is-input check
in `replace_rec_fn::save_result` (introduced earlier in this stack) with
an opt-in pre-cache `skip` predicate that callers can supply alongside
the rewriter functor. The skip predicate is invoked *before* the cache
lookup; if it returns true, the input is returned unchanged with no
cache touch.

The earlier identity-skip approach blew up catastrophically on workloads
like `Init.WFExtrinsicFix` (~3-4x instruction count), because it dropped
caching for the *recursion-identity* case as well as the cheap
`m_f`-early-return case. In the recursion-identity case the cache had
been remembering "we already descended into this shared subterm and
proved it's unchanged" — without that memo, every revisit redid the
full recursive descent and the work compounded exponentially.

The skip predicate fixes this by separating the two cases: only the
cheap "no work needed for this subtree" guard goes into `skip`, and the
ordinary cache logic handles every other path unchanged. For
`instantiate*`/`lift_loose_bvars`/`lower_loose_bvars`, the predicate is
the existing memoized loose-bvar-range check that was previously the
first line of `m_f`. It's strictly cheaper than the cache lookup it
replaces.

Measurements vs the `xkn` baseline on `leanchecker --fresh`:

|                       | baseline | this PR |
|-----------------------|----------|---------|
| Init.WFExtrinsicFix   | 9.32 G   | 9.07 G  |
| Init.Data.List.Lemmas | 17.10 G  | 16.61 G |

Both workloads improve modestly, with no regression on either.
@github-actions github-actions bot added the toolchain-available A toolchain is available for this PR, at leanprover/lean4-pr-releases:pr-release-NNNN label Apr 12, 2026
@mathlib-lean-pr-testing
Copy link
Copy Markdown

mathlib-lean-pr-testing bot commented Apr 12, 2026

Mathlib CI status (docs):

  • ❗ Mathlib CI can not be attempted yet, as the nightly-testing-2026-04-09 tag does not exist there yet. We will retry when you push more commits. If you rebase your branch onto nightly-with-mathlib, Mathlib CI should run now. You can force Mathlib CI using the force-mathlib-ci label. (2026-04-12 17:56:15)
  • ❗ Batteries/Mathlib CI will not be attempted unless your PR branches off the nightly-with-mathlib branch. Try git rebase d53b46a0f3a7d0822550f4179ca3dab446a3c448 --onto d53b46a0f3a7d0822550f4179ca3dab446a3c448. You can force Mathlib CI using the force-mathlib-ci label. (2026-04-12 21:44:48)

@leanprover-bot
Copy link
Copy Markdown
Collaborator

leanprover-bot commented Apr 12, 2026

Reference manual CI status:

  • ❗ Reference manual CI can not be attempted yet, as the nightly-testing-2026-04-09 tag does not exist there yet. We will retry when you push more commits. If you rebase your branch onto nightly-with-manual, reference manual CI should run now. You can force reference manual CI using the force-manual-ci label. (2026-04-12 17:56:16)
  • ❗ Reference manual CI will not be attempted unless your PR branches off the nightly-with-manual branch. Try git rebase d53b46a0f3a7d0822550f4179ca3dab446a3c448 --onto 861bc19e0c1c45b5766cc5c9ef0488568368b236. You can force reference manual CI using the force-manual-ci label. (2026-04-12 21:44:50)
  • ❗ Reference manual CI can not be attempted yet, as the nightly-testing-2026-04-11 tag does not exist there yet. We will retry when you push more commits. If you rebase your branch onto nightly-with-manual, reference manual CI should run now. You can force reference manual CI using the force-manual-ci label. (2026-04-13 08:32:40)

@Kha Kha force-pushed the push-wmorskvrkxrm branch from 6ff9c0b to 6ddadf5 Compare April 12, 2026 18:22
@Kha
Copy link
Copy Markdown
Member Author

Kha commented Apr 12, 2026

!bench

@leanprover-radar
Copy link
Copy Markdown

leanprover-radar commented Apr 12, 2026

Benchmark results for 6ddadf5 against 82bb27f are in. Significant changes detected! @Kha

  • build//instructions: -309.0G (-2.56%)

Large changes (51✅, 1🟥)

Too many entries to display here. View the full report on radar instead.

Medium changes (84✅)

Too many entries to display here. View the full report on radar instead.

Small changes (1352✅, 2🟥)

Too many entries to display here. View the full report on radar instead.

Kha added 3 commits April 12, 2026 20:24
This PR replaces the `std::unordered_map`-based cache in `replace_rec_fn`
with a small-buffer cache that stores its first 16 entries inline in
uninitialized stack storage and only allocates a real hash map for the
rare large traversal. Instrumentation across a full
`leanchecker --fresh Init.Data.List.Lemmas` run shows that 87% of
`replace_rec_fn` instances hold at most 15 entries and only 0.21%
exceed 128, with a mean cache size of just 9 entries spread across
~950k instances. At that scale a hash map is the wrong data structure:
its per-instance bucket-array allocation and per-entry node allocation
dwarf the cost of a linear scan over a handful of entries.

The new structure pays nothing for entries that are never inserted, no
allocation at all on the common path, and falls back to the original
`unordered_map` once the inline buffer fills. Combined with the
existing `is_likely_unshared` filter, lookups on the common path are
just a tight scan over a stack-resident array.

On `leanchecker --fresh Init.Data.List.Lemmas` this shaves
`17.10 G -> 16.18 G` instructions (~5.4%) and `1.74s -> 1.62s`
wall-clock (~6.7%) compared to the previous baseline. It supersedes
the prior `try_emplace` and `reserve(128)` micro-optimizations on the
same cache, both of which are no longer needed since the hash map is
no longer on the hot path.
@Kha Kha force-pushed the push-wmorskvrkxrm branch from 6ddadf5 to 17e07a3 Compare April 12, 2026 20:59
@Kha
Copy link
Copy Markdown
Member Author

Kha commented Apr 12, 2026

!bench

@leanprover-radar
Copy link
Copy Markdown

leanprover-radar commented Apr 12, 2026

Benchmark results for 17e07a3 against d53b46a are in. Significant changes detected! @Kha

  • build//instructions: -304.3G (-2.54%)

Large changes (50✅)

Too many entries to display here. View the full report on radar instead.

Medium changes (78✅)

Too many entries to display here. View the full report on radar instead.

Small changes (1342✅, 2🟥)

Too many entries to display here. View the full report on radar instead.

@Kha
Copy link
Copy Markdown
Member Author

Kha commented Apr 12, 2026

!bench mathlib

@leanprover-radar
Copy link
Copy Markdown

leanprover-radar commented Apr 12, 2026

Benchmark results for leanprover-community/mathlib4-nightly-testing@8082e83 against leanprover-community/mathlib4-nightly-testing@bea414c are in. Significant changes detected! @Kha

  • build//instructions: -4.1T (-2.41%)

Large changes (2✅)

  • build//instructions: -4.1T (-2.41%)
  • build/profile/type checking//wall-clock: -3m 53s (-12.77%)

and 1 hidden

Medium changes (92✅)

Too many entries to display here. View the full report on radar instead.

Small changes (826✅, 3🟥)

Too many entries to display here. View the full report on radar instead.

mathlib-nightly-testing bot pushed a commit to leanprover-community/batteries that referenced this pull request Apr 13, 2026
@github-actions github-actions bot added the mathlib4-nightly-available A branch for this PR exists at leanprover-community/mathlib4-nightly-testing:lean-pr-testing-NNNN label Apr 13, 2026
mathlib-nightly-testing bot pushed a commit to leanprover-community/mathlib4-nightly-testing that referenced this pull request Apr 13, 2026
@mathlib-lean-pr-testing mathlib-lean-pr-testing bot added the builds-mathlib CI has verified that Mathlib builds against this PR label Apr 13, 2026
@mathlib-lean-pr-testing
Copy link
Copy Markdown

Mathlib CI status (docs):

@Kha
Copy link
Copy Markdown
Member Author

Kha commented Apr 13, 2026

!bench

@leanprover-radar
Copy link
Copy Markdown

leanprover-radar commented Apr 13, 2026

Benchmark results for b3beaf8 against 41ab492 are in. Significant changes detected! @Kha

  • build//instructions: -306.0G (-2.55%)

Large changes (52✅)

Too many entries to display here. View the full report on radar instead.

Medium changes (75✅)

Too many entries to display here. View the full report on radar instead.

Small changes (1331✅, 1🟥)

Too many entries to display here. View the full report on radar instead.

@Kha Kha closed this Apr 14, 2026
@Kha Kha deleted the push-wmorskvrkxrm branch April 14, 2026 12:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

builds-mathlib CI has verified that Mathlib builds against this PR mathlib4-nightly-available A branch for this PR exists at leanprover-community/mathlib4-nightly-testing:lean-pr-testing-NNNN toolchain-available A toolchain is available for this PR, at leanprover/lean4-pr-releases:pr-release-NNNN

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants