Skip to content

feat(preprocessing,compose): dataframe-agnostic StandardScaler & compose mini-batching via narwhals#1932

Open
MaxHalford wants to merge 3 commits into
mainfrom
feat/standardscaler-compose-narwhals
Open

feat(preprocessing,compose): dataframe-agnostic StandardScaler & compose mini-batching via narwhals#1932
MaxHalford wants to merge 3 commits into
mainfrom
feat/standardscaler-compose-narwhals

Conversation

@MaxHalford

Copy link
Copy Markdown
Member

Motivation

Part of #1919 (migrate all _many methods to narwhals). This makes preprocessing.StandardScaler and the compose composition primitives dataframe-agnostic, so a whole pipeline can be mini-batched on any narwhals-supported eager backend (pandas, polars, pyarrow, nullable/arrow-backed pandas, ...). The numpy compute cores are untouched and the input backend (including the pandas index) is rebuilt on output.

Changes

preprocessing/scale.pyStandardScaler

  • learn_many wraps the input via into_frame and drops to a float64 numpy matrix with to_numpy; the windowed branch iterates rows via iter_rows.
  • transform_many keeps a verbatim classic-pandas fast path (in-place np.divide(out=), no-copy frame, float dtype preservation) and adds an agnostic float64 path for every other backend. Pandas output is byte-for-byte unchanged.

compose

  • Pipeline: type hints only — the orchestration was already backend-agnostic.
  • TransformerUnion: pd.concat(axis=1)nw.concat(how="horizontal").
  • Select: X.loc[...].copy() → narwhals select (still pure).
  • TransformerProduct: keeps the pandas Sparse[uint8] fast path, adds an agnostic elementwise-product path for other backends.
  • FuncTransformer: unchanged (already passes the native frame through).

Tests

New cross-backend tests via the frame_backend fixture, with pandas as the oracle:

  • value/stat parity for learn_many & transform_many (incl. with_std and window_size variants)
  • mixed-dtype real dataset (TrumpApproval), chunked learning
  • emerging / disappearing / reordered features between mini-batches
  • native-backend round-trip, float32 preservation on the pandas fast path

All scale + compose tests pass, 481 estimator-checks pass, mypy clean.

Performance

Pandas (dominant case) has zero regression: transform_many's pandas path is the old code verbatim, and learn_many matches the old baseline at realistic sizes (10.96 vs 11.12 ms at 50k×100). The only added cost is the constant ~60µs narwhals.from_native boundary, identical to the other narwhalified methods.

🤖 Generated with Claude Code

…ing dataframe-agnostic via narwhals

Route StandardScaler's and the composition primitives' mini-batch methods through
the same narwhals boundary as the GLM/OneHotEncoder paths, so a whole pipeline can
be mini-batched on any narwhals-supported eager backend (pandas, polars, pyarrow,
nullable/arrow-backed pandas, ...). The numpy compute cores are untouched and the
input backend (including the pandas index) is rebuilt on output.

preprocessing/scale.py — StandardScaler:
- learn_many wraps via into_frame and drops to a float64 numpy matrix; the windowed
  branch iterates rows backend-agnostically.
- transform_many keeps a verbatim classic-pandas fast path (in-place divide, no-copy
  frame, float-dtype preservation) and adds an agnostic float64 path for every other
  backend. Pandas output is byte-for-byte unchanged.

compose:
- Pipeline: type hints only — the orchestration was already backend-agnostic.
- TransformerUnion: pd.concat(axis=1) -> nw.concat(how="horizontal").
- Select: X.loc[...].copy() -> narwhals select (still pure).
- TransformerProduct: keeps the pandas Sparse[uint8] fast path, adds an agnostic
  elementwise-product path for other backends.

Adds cross-backend tests (mixed-dtype TrumpApproval, chunked learning, emerging/
disappearing/reordered features, native-backend round-trip) using the frame_backend
fixture. Pandas remains the oracle.

Refs #1919.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@MaxHalford

Copy link
Copy Markdown
Member Author

@FBruzzesi does it look ok to you?

…ls StandardScaler

`StandardScaler.transform_many` no longer imports pandas unconditionally: it only
needs pandas on the classic-pandas fast path. The old test passed `object()` (which
now fails at the narwhals boundary, and isn't a valid `IntoDataFrameT`). Split it
into two: a pandas input still raises ImportError when pandas is missing, while a
polars input goes through the agnostic path and works without pandas.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@FBruzzesi

Copy link
Copy Markdown
Contributor

Hey Max, I was about to push the changes for StandardScaler with some general improvements to create a DataFrame from a 2d numpy array without having to slice it. preprocessing module seems great!

If you are not in a rush I can make a more in depth review later

@MaxHalford

Copy link
Copy Markdown
Member Author

Sure, take your time! And feel free to push on top of this PR 👍

I'm done for today hehe

@FBruzzesi FBruzzesi left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went through it a bit more thoroughly: nothing to add for the compose module, most ops are straightforward. For the other comment on the StandardScaler, I am proposing two changes in #1935.

…y, simplify `StandardScaler` (#1935)

* Allow to create dataframe directly from 2d numpy array

* fixup square array issue

* simplify StandardScaler
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants