Rewrite LocalOutlierFactor as a windowed, on-demand detector#1929
Open
MaxHalford wants to merge 2 commits into
Open
Rewrite LocalOutlierFactor as a windowed, on-demand detector#1929MaxHalford wants to merge 2 commits into
MaxHalford wants to merge 2 commits into
Conversation
…tector Replace the incremental-LOF bookkeeping (nine threaded dicts + free functions) with a small class that delegates storage and neighbor search to a `river.neighbors` engine (LazySearch by default, SWINN for approximate search). - `learn_one` is now O(1): it appends to a bounded sliding window, so memory no longer grows with the stream (was super-linear before). - `score_one` computes the LOF against the current window on demand and no longer mutates the model. A point is never its own neighbor, so an unseen point reproduces scikit-learn's `LocalOutlierFactor(novelty=True)` and a stored point reproduces the in-sample `negative_outlier_factor_` (matched to ~1e-15). Per-score memoization keeps it ~6x faster. - `learn_many` accepts any narwhals-supported eager dataframe (pandas, polars, pyarrow, ...) instead of pandas only. Addresses the anomaly/lof.py item of #1919. Along the way, fix a pre-existing bug in the Rust Euclidean fast path of `neighbors.LazySearch`: its search heap was keyed on the negated distance, so it returned the *farthest* k candidates instead of the nearest. This affected KNNClassifier/KNNRegressor/LOF on a LazySearch engine with the default distance. Add a regression test (test_lazy.py). Switch the shared `check_roc_auc` anomaly check to score-then-learn so it no longer leaks the label by scoring an already-learned point. KNNRegressor now skips check_shuffle_features_no_impact (its weighted average is sensitive to float summation order under feature reordering, like the forest models); LOF joins the automated estimator checks. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
17 tasks
Member
Author
|
@smastelini I noticed LOF was reimplementing nearest neighbours logic. With this refactoring, we can plug in our NN backends, including SWINN :) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Rewrites
anomaly.LocalOutlierFactorfrom the incremental-LOF bookkeeping (nine threaded dicts + a stack of free functions) into a small class that delegates storage and neighbor search to ariver.neighborsengine.LocalOutlierFactorEngine-based & bounded. Samples live in a fixed-size sliding window managed by a
BaseNNengine —LazySearch(exact) by default, orSWINNfor approximate search, mirroringKNNClassifier/KNNRegressor.learn_oneis now O(1) and memory is bounded by the window size (it was super-linear and unbounded before).On-demand, non-mutating scoring.
score_onecomputes the LOF against the current window without modifying the model. A point is never its own neighbor, so:LocalOutlierFactor(novelty=True), andnegative_outlier_factor_,both matched to ~1e-15. Per-score memoization of neighborhoods keeps it ~6× faster than the naive recompute.
narwhals mini-batching.
learn_manyaccepts any narwhals-supported eager dataframe (pandas, polars, pyarrow, …) instead of pandas only. Completes theanomaly/lof.pyitem of Migrate all mini-batch (_many) methods to narwhals for dataframe-agnostic support #1919.Slimmer, more useful docstring; keeps both the Breunig (2000) and Pokrajac (2007) references.
Neighbors bug fix (prerequisite)
While validating against scikit-learn I found a pre-existing bug in the Rust Euclidean fast path of
neighbors.LazySearch: the search heap was keyed on the negated distance, so it returned the farthest k candidates instead of the nearest. This silently affectedKNNClassifier,KNNRegressor, and the new LOF whenever they ran over aLazySearchengine with the default Euclidean distance. Fixed, with a regression test (river/neighbors/test_lazy.py) — there was previously no test covering the fast path's correctness, which is how it slipped through.Estimator-check changes
anomaly.check_roc_aucnow scores before learning each sample (prequential), instead of after. Scoring a just-learned point leaks the label and inflates the score; all affected detectors (HST, LODA, OneClassSVM, LOF) still pass comfortably.KNNRegressorskipscheck_shuffle_features_no_impact: its distance-weighted average is sensitive to float summation order under feature reordering (the corrected search exposed this;KNNClassifier's argmax absorbs it). Same precedent as the forest models.LocalOutlierFactorjoins the automated estimator checks (it was previously ignored);_unit_test_paramsruns it atn_neighbors=20(scikit-learn's own default —k=10is genuinely weak on the small, duplicate-heavy CreditCard check sample, as it is for scikit-learn there).Benchmarks
Behavior changes
window_sizesamples rather than the entire history.0.0.distance_funcparameter is replaced by the engine's distance function.Part of #1919.
🤖 Generated with Claude Code