Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions docs/releases/unreleased.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,9 @@

## covariance

- Added `EwaCovariance`, `LedoitWolfCovariance`, `OASCovariance`, and `ShrunkCovariance`: online covariance estimators for non-stationary streams (exponentially weighted, recency-biased) and high-dimensional / few-sample regimes (shrinkage towards a well-conditioned target). They are dict-native like `EmpiricalCovariance` and support mini-batches via `update_many` on any [narwhals](https://github.com/narwhals-dev/narwhals)-supported eager backend.
- Added `EwaPrecision`, an exponentially weighted precision (inverse covariance) matrix maintained online via a forgetting-factor Sherman-Morrison update. The recency-weighted counterpart of `EmpiricalPrecision`, useful for tracking Mahalanobis distances and Gaussian likelihoods on non-stationary streams.
- `EmpiricalCovariance.update_many` and `EmpiricalPrecision.update_many` now accept any [narwhals](https://github.com/narwhals-dev/narwhals)-supported eager dataframe (pandas, polars, pyarrow, ...) instead of pandas only. Outputs are unchanged for the pandas path.
- Added weighted sample support to `EmpiricalCovariance.update` and `EmpiricalCovariance.revert` by accepting an optional `w` parameter and propagating it to the underlying `stats.Cov` and `stats.Var` statistics.
- Sped up `EmpiricalCovariance.update`/`revert` (~40% faster at 30 features) by caching the sorted feature list and pair iteration in the hot path. No semantic change.
- Restructured `EmpiricalPrecision` around NumPy-backed dense state, removing the per-update dict ↔ numpy marshalling. ~7× faster on 2000 × 20 sample streams.
Expand All @@ -24,6 +27,7 @@

- Added `datasets.CriteoAds`, a 100,000-row sample of the Criteo Display Advertising Challenge (binary click prediction with 13 integer and 26 high-cardinality categorical features). A natural fit for one-hot models such as `linear_model.AdPredictor`.
- Added `datasets.Shuttle`, the UCI Statlog (Shuttle) dataset cast as a binary anomaly-detection task following the ODDS benchmark (49,097 observations, 9 numerical features, ~7% anomalies). Ships bundled with River.
- Added `datasets.SP500Stocks`, daily returns (1,257 trading days, 2013-2018) for ten large-cap S&P 500 stocks across diverse sectors. A natural fit for the online covariance estimators in `river.covariance`.

## facto

Expand Down Expand Up @@ -91,6 +95,7 @@

## stats

- Added `stats.EWCov`, an exponentially weighted covariance between two variables (the bivariate counterpart of `stats.EWVar`).
- Added `stats.ChiSquared`, a streaming Chi-squared statistic between two categorical variables. Wrap it with `utils.Rolling` for a rolling version.

## stream
Expand Down
31 changes: 29 additions & 2 deletions river/covariance/__init__.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,34 @@
"""Online estimation of covariance and precision matrices."""
"""Online estimation of covariance and precision matrices.

A covariance matrix summarises how a set of variables move together. It is the engine behind
portfolio risk, anomaly detection (via the Mahalanobis distance), Gaussian models, and many
dimensionality-reduction methods. This module estimates it (and its inverse, the precision
matrix) incrementally from a stream, without storing the data. See each estimator's docstring for
what it does and when to reach for it.

The estimators are dict-native: `update(x)` takes a mapping and the `matrix` is a dict of pairwise
values. Most also expose an `update_many` method for mini-batches of any narwhals-compatible
dataframe.

"""

from __future__ import annotations

from .emp import EmpiricalCovariance, EmpiricalPrecision
from .ewa import (
EwaCovariance,
EwaPrecision,
LedoitWolfCovariance,
OASCovariance,
ShrunkCovariance,
)

__all__ = ["EmpiricalCovariance", "EmpiricalPrecision"]
__all__ = [
"EmpiricalCovariance",
"EmpiricalPrecision",
"EwaCovariance",
"EwaPrecision",
"LedoitWolfCovariance",
"OASCovariance",
"ShrunkCovariance",
]
33 changes: 23 additions & 10 deletions river/covariance/emp.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
from river import stats, utils

if typing.TYPE_CHECKING:
import pandas as pd
from narwhals.stable.v2.typing import IntoDataFrame


class SymmetricMatrix(abc.ABC):
Expand All @@ -33,6 +33,8 @@ def __getitem__(self, key):

def __repr__(self):
names = sorted({i for i, _ in self.matrix})
if not names:
return f"{type(self).__name__} (empty)"

headers = [""] + list(map(str, names))
columns = [headers[1:]]
Expand Down Expand Up @@ -177,25 +179,30 @@ def revert(self, x: dict, w: float = 1.0):
for i in keys:
cov_dict[i, i].revert(x[i], w)

def update_many(self, X: pd.DataFrame):
def update_many(self, X: IntoDataFrame):
"""Update with a dataframe of samples.

Any [narwhals](https://github.com/narwhals-dev/narwhals)-compatible eager dataframe
(pandas, polars, pyarrow, ...) is accepted.

Parameters
----------
X
A dataframe of samples.

"""

X_arr = X.values
frame = utils.dataframe.into_frame(X)
columns = list(frame.columns)
X_arr = utils.dataframe.to_numpy(frame)
mean_arr = X_arr.mean(axis=0)
cov_arr = np.cov(X_arr.T, ddof=self.ddof)

n = len(X)
mean = dict(zip(X.columns, mean_arr))
n = len(frame)
mean = dict(zip(columns, mean_arr))
cov = {
(i, j): cov_arr[r, c]
for (r, i), (c, j) in itertools.combinations_with_replacement(enumerate(X.columns), r=2)
for (r, i), (c, j) in itertools.combinations_with_replacement(enumerate(columns), r=2)
}

self._update_from_state(n=n, mean=mean, cov=cov)
Expand All @@ -215,6 +222,7 @@ def _update_from_state(self, n: int, mean: dict, cov: float | dict):
Raises
----------
KeyError: If an element in `mean` or `cov` is missing.

"""
for i, j in itertools.combinations(sorted(mean.keys()), r=2):
try:
Expand Down Expand Up @@ -264,6 +272,7 @@ def _from_state(cls, n: int, mean: dict, cov: float | dict, *, ddof=1):
Returns
----------
cls: A new instance of the class with updated covariance matrix.

"""
new = cls(ddof=ddof)
new._update_from_state(n=n, mean=mean, cov=cov)
Expand Down Expand Up @@ -405,25 +414,29 @@ def update(self, x):
self._w_arr[ids] = w
self._inv_cov_mat[ix] = 0.5 * (block + block.T)

def update_many(self, X: pd.DataFrame):
def update_many(self, X: IntoDataFrame):
"""Update with a dataframe of samples.

Any [narwhals](https://github.com/narwhals-dev/narwhals)-compatible eager dataframe
(pandas, polars, pyarrow, ...) is accepted.

Parameters
----------
X
A dataframe of samples.

"""
ids = self._ensure_features(X.columns)
X_arr = np.asarray(X.values, dtype=np.float64)
frame = utils.dataframe.into_frame(X)
ids = self._ensure_features(frame.columns)
X_arr = utils.dataframe.to_numpy(frame)

loc = self._loc_arr[ids].copy()
w = self._w_arr[ids].copy()
ix = np.ix_(ids, ids)
inv_cov = np.asfortranarray(self._inv_cov_mat[ix]) / np.maximum(w, 1)

# update formulas
n_batch = len(X)
n_batch = len(frame)
diff = X_arr - loc
loc = (w * loc + n_batch * X_arr.mean(axis=0)) / (w + n_batch)
w += n_batch
Expand Down
Loading