feat(preprocessing,compose): dataframe-agnostic StandardScaler & compose mini-batching via narwhals#1932
Open
MaxHalford wants to merge 3 commits into
Open
feat(preprocessing,compose): dataframe-agnostic StandardScaler & compose mini-batching via narwhals#1932MaxHalford wants to merge 3 commits into
MaxHalford wants to merge 3 commits into
Conversation
…ing dataframe-agnostic via narwhals Route StandardScaler's and the composition primitives' mini-batch methods through the same narwhals boundary as the GLM/OneHotEncoder paths, so a whole pipeline can be mini-batched on any narwhals-supported eager backend (pandas, polars, pyarrow, nullable/arrow-backed pandas, ...). The numpy compute cores are untouched and the input backend (including the pandas index) is rebuilt on output. preprocessing/scale.py — StandardScaler: - learn_many wraps via into_frame and drops to a float64 numpy matrix; the windowed branch iterates rows backend-agnostically. - transform_many keeps a verbatim classic-pandas fast path (in-place divide, no-copy frame, float-dtype preservation) and adds an agnostic float64 path for every other backend. Pandas output is byte-for-byte unchanged. compose: - Pipeline: type hints only — the orchestration was already backend-agnostic. - TransformerUnion: pd.concat(axis=1) -> nw.concat(how="horizontal"). - Select: X.loc[...].copy() -> narwhals select (still pure). - TransformerProduct: keeps the pandas Sparse[uint8] fast path, adds an agnostic elementwise-product path for other backends. Adds cross-backend tests (mixed-dtype TrumpApproval, chunked learning, emerging/ disappearing/reordered features, native-backend round-trip) using the frame_backend fixture. Pandas remains the oracle. Refs #1919. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
17 tasks
Member
Author
|
@FBruzzesi does it look ok to you? |
…ls StandardScaler `StandardScaler.transform_many` no longer imports pandas unconditionally: it only needs pandas on the classic-pandas fast path. The old test passed `object()` (which now fails at the narwhals boundary, and isn't a valid `IntoDataFrameT`). Split it into two: a pandas input still raises ImportError when pandas is missing, while a polars input goes through the agnostic path and works without pandas. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Contributor
|
Hey Max, I was about to push the changes for If you are not in a rush I can make a more in depth review later |
Member
Author
|
Sure, take your time! And feel free to push on top of this PR 👍 I'm done for today hehe |
…y, simplify `StandardScaler` (#1935) * Allow to create dataframe directly from 2d numpy array * fixup square array issue * simplify StandardScaler
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Part of #1919 (migrate all
_manymethods to narwhals). This makespreprocessing.StandardScalerand thecomposecomposition primitives dataframe-agnostic, so a whole pipeline can be mini-batched on any narwhals-supported eager backend (pandas, polars, pyarrow, nullable/arrow-backed pandas, ...). The numpy compute cores are untouched and the input backend (including the pandas index) is rebuilt on output.Changes
preprocessing/scale.py—StandardScalerlearn_manywraps the input viainto_frameand drops to a float64 numpy matrix withto_numpy; the windowed branch iterates rows viaiter_rows.transform_manykeeps a verbatim classic-pandas fast path (in-placenp.divide(out=), no-copy frame, float dtype preservation) and adds an agnostic float64 path for every other backend. Pandas output is byte-for-byte unchanged.composePipeline: type hints only — the orchestration was already backend-agnostic.TransformerUnion:pd.concat(axis=1)→nw.concat(how="horizontal").Select:X.loc[...].copy()→ narwhalsselect(still pure).TransformerProduct: keeps the pandasSparse[uint8]fast path, adds an agnostic elementwise-product path for other backends.FuncTransformer: unchanged (already passes the native frame through).Tests
New cross-backend tests via the
frame_backendfixture, with pandas as the oracle:learn_many&transform_many(incl.with_stdandwindow_sizevariants)All scale + compose tests pass, 481 estimator-checks pass, mypy clean.
Performance
Pandas (dominant case) has zero regression:
transform_many's pandas path is the old code verbatim, andlearn_manymatches the old baseline at realistic sizes (10.96 vs 11.12 ms at 50k×100). The only added cost is the constant ~60µsnarwhals.from_nativeboundary, identical to the other narwhalified methods.🤖 Generated with Claude Code