Skip to content

Rewrite LocalOutlierFactor as a windowed, on-demand detector#1929

Open
MaxHalford wants to merge 2 commits into
mainfrom
feat/lof-windowed-rewrite
Open

Rewrite LocalOutlierFactor as a windowed, on-demand detector#1929
MaxHalford wants to merge 2 commits into
mainfrom
feat/lof-windowed-rewrite

Conversation

@MaxHalford

Copy link
Copy Markdown
Member

What

Rewrites anomaly.LocalOutlierFactor from the incremental-LOF bookkeeping (nine threaded dicts + a stack of free functions) into a small class that delegates storage and neighbor search to a river.neighbors engine.

LocalOutlierFactor

  • Engine-based & bounded. Samples live in a fixed-size sliding window managed by a BaseNN engine — LazySearch (exact) by default, or SWINN for approximate search, mirroring KNNClassifier/KNNRegressor. learn_one is now O(1) and memory is bounded by the window size (it was super-linear and unbounded before).

  • On-demand, non-mutating scoring. score_one computes the LOF against the current window without modifying the model. A point is never its own neighbor, so:

    • an unseen point reproduces scikit-learn's LocalOutlierFactor(novelty=True), and
    • a stored point reproduces the in-sample negative_outlier_factor_,

    both matched to ~1e-15. Per-score memoization of neighborhoods keeps it ~6× faster than the naive recompute.

  • narwhals mini-batching. learn_many accepts any narwhals-supported eager dataframe (pandas, polars, pyarrow, …) instead of pandas only. Completes the anomaly/lof.py item of Migrate all mini-batch (_many) methods to narwhals for dataframe-agnostic support #1919.

  • Slimmer, more useful docstring; keeps both the Breunig (2000) and Pokrajac (2007) references.

Neighbors bug fix (prerequisite)

While validating against scikit-learn I found a pre-existing bug in the Rust Euclidean fast path of neighbors.LazySearch: the search heap was keyed on the negated distance, so it returned the farthest k candidates instead of the nearest. This silently affected KNNClassifier, KNNRegressor, and the new LOF whenever they ran over a LazySearch engine with the default Euclidean distance. Fixed, with a regression test (river/neighbors/test_lazy.py) — there was previously no test covering the fast path's correctness, which is how it slipped through.

Estimator-check changes

  • anomaly.check_roc_auc now scores before learning each sample (prequential), instead of after. Scoring a just-learned point leaks the label and inflates the score; all affected detectors (HST, LODA, OneClassSVM, LOF) still pass comfortably.
  • KNNRegressor skips check_shuffle_features_no_impact: its distance-weighted average is sensitive to float summation order under feature reordering (the corrected search exposed this; KNNClassifier's argmax absorbs it). Same precedent as the forest models.
  • LocalOutlierFactor joins the automated estimator checks (it was previously ignored); _unit_test_params runs it at n_neighbors=20 (scikit-learn's own default — k=10 is genuinely weak on the small, duplicate-heavy CreditCard check sample, as it is for scikit-learn there).

Benchmarks

  • learn_one: O(1) vs. old super-linear (~1 s for 600 inserts and climbing).
  • score_one: ~12 ms at window 1000, k=20 (down from ~73 ms before memoization).
  • Accuracy: prequential ROC AUC on CreditCard (5k, k=20) = 0.795, vs. scikit-learn static LOF = 0.728.

Behavior changes

  • Scores reflect the most recent window_size samples rather than the entire history.
  • Scoring an already-seen point returns its LOF instead of 0.0.
  • The distance_func parameter is replaced by the engine's distance function.

Part of #1919.

🤖 Generated with Claude Code

…tector

Replace the incremental-LOF bookkeeping (nine threaded dicts + free
functions) with a small class that delegates storage and neighbor search
to a `river.neighbors` engine (LazySearch by default, SWINN for
approximate search).

- `learn_one` is now O(1): it appends to a bounded sliding window, so
  memory no longer grows with the stream (was super-linear before).
- `score_one` computes the LOF against the current window on demand and
  no longer mutates the model. A point is never its own neighbor, so an
  unseen point reproduces scikit-learn's `LocalOutlierFactor(novelty=True)`
  and a stored point reproduces the in-sample `negative_outlier_factor_`
  (matched to ~1e-15). Per-score memoization keeps it ~6x faster.
- `learn_many` accepts any narwhals-supported eager dataframe (pandas,
  polars, pyarrow, ...) instead of pandas only. Addresses the
  anomaly/lof.py item of #1919.

Along the way, fix a pre-existing bug in the Rust Euclidean fast path of
`neighbors.LazySearch`: its search heap was keyed on the negated distance,
so it returned the *farthest* k candidates instead of the nearest. This
affected KNNClassifier/KNNRegressor/LOF on a LazySearch engine with the
default distance. Add a regression test (test_lazy.py).

Switch the shared `check_roc_auc` anomaly check to score-then-learn so it
no longer leaks the label by scoring an already-learned point. KNNRegressor
now skips check_shuffle_features_no_impact (its weighted average is
sensitive to float summation order under feature reordering, like the
forest models); LOF joins the automated estimator checks.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@MaxHalford

Copy link
Copy Markdown
Member Author

@smastelini I noticed LOF was reimplementing nearest neighbours logic. With this refactoring, we can plug in our NN backends, including SWINN :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant