Rewrite LocalOutlierFactor as a windowed, on-demand detector by MaxHalford · Pull Request #1929 · online-ml/river

MaxHalford · 2026-06-26T13:35:10Z

What

Rewrites anomaly.LocalOutlierFactor from the incremental-LOF bookkeeping (nine threaded dicts + a stack of free functions) into a small class that delegates storage and neighbor search to a river.neighbors engine.

`LocalOutlierFactor`

Engine-based & bounded. Samples live in a fixed-size sliding window managed by a BaseNN engine — LazySearch (exact) by default, or SWINN for approximate search, mirroring KNNClassifier/KNNRegressor. learn_one is now O(1) and memory is bounded by the window size (it was super-linear and unbounded before).
On-demand, non-mutating scoring. score_one computes the LOF against the current window without modifying the model. A point is never its own neighbor, so:
- an unseen point reproduces scikit-learn's LocalOutlierFactor(novelty=True), and
- a stored point reproduces the in-sample negative_outlier_factor_,
both matched to ~1e-15. Per-score memoization of neighborhoods keeps it ~6× faster than the naive recompute.
narwhals mini-batching. learn_many accepts any narwhals-supported eager dataframe (pandas, polars, pyarrow, …) instead of pandas only. Completes the anomaly/lof.py item of Migrate all mini-batch (_many) methods to narwhals for dataframe-agnostic support #1919.
Slimmer, more useful docstring; keeps both the Breunig (2000) and Pokrajac (2007) references.

Neighbors bug fix (prerequisite)

While validating against scikit-learn I found a pre-existing bug in the Rust Euclidean fast path of neighbors.LazySearch: the search heap was keyed on the negated distance, so it returned the farthest k candidates instead of the nearest. This silently affected KNNClassifier, KNNRegressor, and the new LOF whenever they ran over a LazySearch engine with the default Euclidean distance. Fixed, with a regression test (river/neighbors/test_lazy.py) — there was previously no test covering the fast path's correctness, which is how it slipped through.

Estimator-check changes

anomaly.check_roc_auc now scores before learning each sample (prequential), instead of after. Scoring a just-learned point leaks the label and inflates the score; all affected detectors (HST, LODA, OneClassSVM, LOF) still pass comfortably.
KNNRegressor skips check_shuffle_features_no_impact: its distance-weighted average is sensitive to float summation order under feature reordering (the corrected search exposed this; KNNClassifier's argmax absorbs it). Same precedent as the forest models.
LocalOutlierFactor joins the automated estimator checks (it was previously ignored); _unit_test_params runs it at n_neighbors=20 (scikit-learn's own default — k=10 is genuinely weak on the small, duplicate-heavy CreditCard check sample, as it is for scikit-learn there).

Benchmarks

learn_one: O(1) vs. old super-linear (~1 s for 600 inserts and climbing).
score_one: ~12 ms at window 1000, k=20 (down from ~73 ms before memoization).
Accuracy: prequential ROC AUC on CreditCard (5k, k=20) = 0.795, vs. scikit-learn static LOF = 0.728.

Behavior changes

Scores reflect the most recent window_size samples rather than the entire history.
Scoring an already-seen point returns its LOF instead of 0.0.
The distance_func parameter is replaced by the engine's distance function.

Part of #1919.

🤖 Generated with Claude Code

…tector Replace the incremental-LOF bookkeeping (nine threaded dicts + free functions) with a small class that delegates storage and neighbor search to a `river.neighbors` engine (LazySearch by default, SWINN for approximate search). - `learn_one` is now O(1): it appends to a bounded sliding window, so memory no longer grows with the stream (was super-linear before). - `score_one` computes the LOF against the current window on demand and no longer mutates the model. A point is never its own neighbor, so an unseen point reproduces scikit-learn's `LocalOutlierFactor(novelty=True)` and a stored point reproduces the in-sample `negative_outlier_factor_` (matched to ~1e-15). Per-score memoization keeps it ~6x faster. - `learn_many` accepts any narwhals-supported eager dataframe (pandas, polars, pyarrow, ...) instead of pandas only. Addresses the anomaly/lof.py item of #1919. Along the way, fix a pre-existing bug in the Rust Euclidean fast path of `neighbors.LazySearch`: its search heap was keyed on the negated distance, so it returned the *farthest* k candidates instead of the nearest. This affected KNNClassifier/KNNRegressor/LOF on a LazySearch engine with the default distance. Add a regression test (test_lazy.py). Switch the shared `check_roc_auc` anomaly check to score-then-learn so it no longer leaks the label by scoring an already-learned point. KNNRegressor now skips check_shuffle_features_no_impact (its weighted average is sensitive to float summation order under feature reordering, like the forest models); LOF joins the automated estimator checks. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

MaxHalford · 2026-06-26T14:02:56Z

@smastelini I noticed LOF was reimplementing nearest neighbours logic. With this refactoring, we can plug in our NN backends, including SWINN :)

MaxHalford requested a review from smastelini as a code owner June 26, 2026 13:35

MaxHalford mentioned this pull request Jun 26, 2026

Migrate all mini-batch (_many) methods to narwhals for dataframe-agnostic support #1919

Open

17 tasks

Merge branch 'main' into feat/lof-windowed-rewrite

582b541

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Rewrite LocalOutlierFactor as a windowed, on-demand detector#1929

Rewrite LocalOutlierFactor as a windowed, on-demand detector#1929
MaxHalford wants to merge 2 commits into
mainfrom
feat/lof-windowed-rewrite

MaxHalford commented Jun 26, 2026

Uh oh!

MaxHalford commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Uh oh!

Conversation

MaxHalford commented Jun 26, 2026

What

LocalOutlierFactor

Neighbors bug fix (prerequisite)

Estimator-check changes

Benchmarks

Behavior changes

Uh oh!

MaxHalford commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`LocalOutlierFactor`