feat(preprocessing,compose): dataframe-agnostic StandardScaler & compose mini-batching via narwhals by MaxHalford · Pull Request #1932 · online-ml/river

MaxHalford · 2026-06-26T14:24:58Z

Motivation

Part of #1919 (migrate all _many methods to narwhals). This makes preprocessing.StandardScaler and the compose composition primitives dataframe-agnostic, so a whole pipeline can be mini-batched on any narwhals-supported eager backend (pandas, polars, pyarrow, nullable/arrow-backed pandas, ...). The numpy compute cores are untouched and the input backend (including the pandas index) is rebuilt on output.

Changes

preprocessing/scale.py — StandardScaler

learn_many wraps the input via into_frame and drops to a float64 numpy matrix with to_numpy; the windowed branch iterates rows via iter_rows.
transform_many keeps a verbatim classic-pandas fast path (in-place np.divide(out=), no-copy frame, float dtype preservation) and adds an agnostic float64 path for every other backend. Pandas output is byte-for-byte unchanged.

compose

Pipeline: type hints only — the orchestration was already backend-agnostic.
TransformerUnion: pd.concat(axis=1) → nw.concat(how="horizontal").
Select: X.loc[...].copy() → narwhals select (still pure).
TransformerProduct: keeps the pandas Sparse[uint8] fast path, adds an agnostic elementwise-product path for other backends.
FuncTransformer: unchanged (already passes the native frame through).

Tests

New cross-backend tests via the frame_backend fixture, with pandas as the oracle:

value/stat parity for learn_many & transform_many (incl. with_std and window_size variants)
mixed-dtype real dataset (TrumpApproval), chunked learning
emerging / disappearing / reordered features between mini-batches
native-backend round-trip, float32 preservation on the pandas fast path

All scale + compose tests pass, 481 estimator-checks pass, mypy clean.

Performance

Pandas (dominant case) has zero regression: transform_many's pandas path is the old code verbatim, and learn_many matches the old baseline at realistic sizes (10.96 vs 11.12 ms at 50k×100). The only added cost is the constant ~60µs narwhals.from_native boundary, identical to the other narwhalified methods.

🤖 Generated with Claude Code

…ing dataframe-agnostic via narwhals Route StandardScaler's and the composition primitives' mini-batch methods through the same narwhals boundary as the GLM/OneHotEncoder paths, so a whole pipeline can be mini-batched on any narwhals-supported eager backend (pandas, polars, pyarrow, nullable/arrow-backed pandas, ...). The numpy compute cores are untouched and the input backend (including the pandas index) is rebuilt on output. preprocessing/scale.py — StandardScaler: - learn_many wraps via into_frame and drops to a float64 numpy matrix; the windowed branch iterates rows backend-agnostically. - transform_many keeps a verbatim classic-pandas fast path (in-place divide, no-copy frame, float-dtype preservation) and adds an agnostic float64 path for every other backend. Pandas output is byte-for-byte unchanged. compose: - Pipeline: type hints only — the orchestration was already backend-agnostic. - TransformerUnion: pd.concat(axis=1) -> nw.concat(how="horizontal"). - Select: X.loc[...].copy() -> narwhals select (still pure). - TransformerProduct: keeps the pandas Sparse[uint8] fast path, adds an agnostic elementwise-product path for other backends. Adds cross-backend tests (mixed-dtype TrumpApproval, chunked learning, emerging/ disappearing/reordered features, native-backend round-trip) using the frame_backend fixture. Pandas remains the oracle. Refs #1919. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

MaxHalford · 2026-06-26T14:28:03Z

@FBruzzesi does it look ok to you?

…ls StandardScaler `StandardScaler.transform_many` no longer imports pandas unconditionally: it only needs pandas on the classic-pandas fast path. The old test passed `object()` (which now fails at the narwhals boundary, and isn't a valid `IntoDataFrameT`). Split it into two: a pandas input still raises ImportError when pandas is missing, while a polars input goes through the agnostic path and works without pandas. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

FBruzzesi · 2026-06-26T14:35:21Z

Hey Max, I was about to push the changes for StandardScaler with some general improvements to create a DataFrame from a 2d numpy array without having to slice it. preprocessing module seems great!

If you are not in a rush I can make a more in depth review later

MaxHalford · 2026-06-26T14:47:09Z

Sure, take your time! And feel free to push on top of this PR 👍

I'm done for today hehe

FBruzzesi

Went through it a bit more thoroughly: nothing to add for the compose module, most ops are straightforward. For the other comment on the StandardScaler, I am proposing two changes in #1935.

…y, simplify `StandardScaler` (#1935) * Allow to create dataframe directly from 2d numpy array * fixup square array issue * simplify StandardScaler

MaxHalford requested a review from smastelini as a code owner June 26, 2026 14:24

MaxHalford mentioned this pull request Jun 26, 2026

Migrate all mini-batch (_many) methods to narwhals for dataframe-agnostic support #1919

Open

17 tasks

FBruzzesi reviewed Jun 26, 2026

View reviewed changes

perf(utils, preprocessing): Build native frames directly from 2D nump…

bc7104b

…y, simplify `StandardScaler` (#1935) * Allow to create dataframe directly from 2d numpy array * fixup square array issue * simplify StandardScaler

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat(preprocessing,compose): dataframe-agnostic StandardScaler & compose mini-batching via narwhals#1932

feat(preprocessing,compose): dataframe-agnostic StandardScaler & compose mini-batching via narwhals#1932
MaxHalford wants to merge 3 commits into
mainfrom
feat/standardscaler-compose-narwhals

MaxHalford commented Jun 26, 2026

Uh oh!

MaxHalford commented Jun 26, 2026

Uh oh!

FBruzzesi commented Jun 26, 2026

Uh oh!

MaxHalford commented Jun 26, 2026

Uh oh!

FBruzzesi left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Uh oh!

Conversation

MaxHalford commented Jun 26, 2026

Motivation

Changes

Tests

Performance

Uh oh!

MaxHalford commented Jun 26, 2026

Uh oh!

FBruzzesi commented Jun 26, 2026

Uh oh!

MaxHalford commented Jun 26, 2026

Uh oh!

FBruzzesi left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants