Add decomposition methods OnlineSVD, OnlinePCA, OnlineDMD/wC + Hankelizer#1509
Add decomposition methods OnlineSVD, OnlinePCA, OnlineDMD/wC + Hankelizer#1509MarekWadinger wants to merge 101 commits into
Conversation
…eig + FIX: exponential w in learn many + MINOR: robustness
… ADD: score attribute
Standardization of input shapes
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
|
Hello @MaxHalford and @hoanganhngo610, 👋 I believe the methods are ready for benchmarking. The results are published in this notebook. In the plot I combine two checks, performance w.r.t. number of features and delay imposed by conversion from pd.DataFrame (dict) to np.array used in the core. Mean absolute number of processed samples per second is provided here (for n features in range(3,20) as it remains pretty stable):
The results in the notebook indicate that using pd.DataFrame slows down OnlinePCA, which is the fastest decomposition implementation, by up to 14%. However, I believe your concerns are likely related to the fact that the core of the decomposition methods works with np.arrays, correct? What are your thoughts on the performance and adequacy of the evaluation? Thanks for your time 🙏 |
|
Is this still active? |
|
Hey @s-bessing. I would love to have this reviewed and published. I'm actively working with OnlineDMD. @hoanganhngo610 @MaxHalford are you available to refresh discussion on this? I'm ready to fix the checks if you could provide some feedback also on my latest comment. :) Thx |
|
@MarekWadinger, thanks for the reply. I am currently working on an online topic model. For this, I came across the river package and like the approach. Currently, I use a static reducer (UMAP), but I am not entirely satisfied with it since it is static. |
|
Hey @s-bessing. I used to really like UMAP in my fault detection projects. I believe OnlineDMD could be a match but there are some bottlenecks. It works much better on reasonably noisy data as high autocorrelation may break the underlying SVD computation in case of piece-wise constant behavior (this happens to me, for instance, in OnlineDMDc where I have information about control signal noiseless and does not change for a while). But if there are certain periodic components and dominant patterns in your data, I think you should give it a hit. :) |
|
Hi @MarekWadinger and @s-bessing, Thanks a lot for your contribution! |
|
Hey @kulbachcedric, I would love to bring this back to life. It got stuck on review process. I can wipe some dust and would love to get a feedback on the PR once I'm done. :) |
…iness
- Fix failing doctests: wrap np.isclose/np.allclose in bool() for
NumPy 2.x compatibility, skip non-deterministic OnlinePCA outputs
- Replace deprecated np.row_stack with np.vstack throughout
- Replace assert statements with ValueError for parameter validation
- Add _unit_test_skips to OnlineDMD so check_estimator passes
- Remove debug counters (_n_cached, _n_computed) marked for removal
- Remove warning spam in OnlineSVDZhang.update (60k+ warnings in tests)
- Add type hint for OnlineSVD.solver parameter
- Clean up all TODO comments across the module
- Add release notes for decomposition module and Hankelizer
…argetRegressor for correct type hints
… with numerical precision notes
|
Funny, the failing test converges to different optima on my machine and on the ubuntu where the code-quality checks are running. I proceeded with simplified test to check whether the eigenvalues are finite |
|
@kulbachcedric ready for rereview ;) |
# Conflicts: # docs/releases/unreleased.md # river/compose/pipeline.py # river/utils/rolling.py
- preprocessing.Hankelizer: convert docstring from Google style (Args:/Examples:/Todo:) to the NumPy convention mandated by CONTRIBUTING.md. Unblocks river/test_docs.py::test_print_docstring, which now parses every public docstring. - test_odmdwc.py: access .A on the wrapped OnlineDMDwC instance through Rolling.obj instead of relying on __getattr__ delegation, which mypy cannot resolve statically. Behavior unchanged.
- Make pandas optional in odmd.py: the top-level `import pandas as pd` was failing the new "Tests without pandas" CI job introduced upstream. Pandas is now imported under `TYPE_CHECKING`, and runtime `isinstance` checks go through a `_is_dataframe` TypeGuard backed by `utils.pandas.PANDAS_INSTALLED` / `utils.pandas.import_pandas()`. - Rename three N802-violating methods exposed by upstream's pep8-naming ruleset: `A_allclose` -> `a_allclose`, `_update_A_P` -> `_update_a_p`, `_reconstruct_AB` -> `_reconstruct_ab`. All callsites are within odmd.py; no external API impact.
The previous fix only covered odmd.py, but odmd.py imports osvd.py at line 30, so the eager `import pandas as pd` in osvd.py was still crashing CI's "Tests without pandas" job during decomposition module collection. Same treatment as odmd.py: pandas moved under TYPE_CHECKING, an `_is_dataframe` TypeGuard handles the runtime isinstance check, and the `pd.DataFrame(...)` constructions in `transform_many` go through `utils.pandas.import_pandas()` so calling that mini-batch method without pandas surfaces the project's standard "install river[pandas]" error. Added `>>> import numpy as np` / `>>> import pandas as pd` to the OnlineSVD and OnlineSVDZhang docstring examples — they relied on the module-level imports that are now lazy. Verified end-to-end against the CI recipe in a fresh venv: `uv sync --all-extras --group dev` then `uv pip uninstall pandas` then `pytest` -> 4391 passed, 0 failed. Full suite with pandas: 4654 passed.
`sp.sparse.linalg.svds` uses ARPACK, whose initial random vector is not controlled by `np.random.seed`, so consecutive calls return singular vectors with arbitrary (and uncorrelated) signs. The previous assertion compared raw vector differences, which failed nondeterministically whenever any column happened to land on opposite signs across the three SVD calls. Align column signs to `u_orig` before computing distances. Verified deterministic across 8 isolated runs and 4 full-suite runs.
|
Hello @MarekWadinger. I'm open to merging your contributions, but not like this. What I'd like to do is open one PR for each contribution, and take a thorough look at each method. These methods could be useful, so we need to benchmark them, assess their relevance, and provide usage documentation. Merging this whole PR as is would not provide benefits for users, I believe. |

Hello @MaxHalford, @hoanganhngo610, and everyone 👋,
In #1366, @MaxHalford showed interest in implementation of OnlinePCA and OnlineSVD methods in river.
Given my current project involvement with online decomposition methods, I believe the community could benefit from having access to these methods and their maintenance over time. Additionally, I am particularly interested in DMD, which combines the advantages of PCA and FFT. Hence, I propose the introduction of three new methods as part of the new decomposition module:
decomposition.OnlineSVDimplemented based on Brand, M. (2006) (proposed by @MaxHalford in issue) with some considerations on re-orthogonalization. Since it is required quite often, compromising computation speed, it could be interesting to align with Zhang, Y. (2022) (I made some effort to implement but I'm yet to expore validity and possibility to implementrevertin similar vein).decomposition.OnlinePCAimplemented based on Eftekhari, A. (2019) (proposed by @MaxHalford in issue), as it is currently state-of-the-art with all the proofs and guarantees. Would be happy to validate together if all considerations are handled in proposed OnlinePCA.decomposition.OnlineDMDimplemented based on Zhang, H. 2019. It can operate as MiniBatchTransformer, MiniBatchRegressor (sort of), and works with Rolling so I would need some help figuring out how we'd like to classify it (maybe new base class Decomposer.Additionally, I propose
preprocessing.Hankelizer, which could be beneficial for various regressors and particularly useful for enhancing feature space by introducing time-delayed embedding.I've tried to include all necessary tests. However, I need to investigate why re-orthogonalization in OnlineSVD yields significantly different values when tested on various operating systems (locally, all tests pass).
Looking forward for your comments and revisions. 😌