Skip to content

PERF: short-circuit array_equivalent and equals comparisons#65192

Merged
rhshadrach merged 4 commits into
pandas-dev:mainfrom
jbrockmendel:perf-32339
May 9, 2026
Merged

PERF: short-circuit array_equivalent and equals comparisons#65192
rhshadrach merged 4 commits into
pandas-dev:mainfrom
jbrockmendel:perf-32339

Conversation

@jbrockmendel
Copy link
Copy Markdown
Member

Summary

  • Add short-circuiting Cython functions for array equality checks in pandas/_libs/lib.pyx:
    • array_equivalent_float: single-pass NaN-aware float comparison, replacing the 4-temporary-array expression ((left == right) | (isnan(left) & isnan(right))).all()
    • array_equivalent_bytes: memcmp-based comparison for int/bool/datetime arrays, replacing np.array_equal
    • has_nans/all_nans: short-circuiting replacements for np.isnan(arr).any()/.all()
  • Wire these up in array_equivalent, _array_equivalent_float, _array_equivalent_datetimelike, and the equals methods on DatetimeLikeIndex, MultiIndex, MaskedArray, and Categorical
  • For non-contiguous inputs, fall back to the original numpy expressions with no regression

array_equivalent with dtype_equal=True, 10^6 elements:

dtype equal early mismatch
float64 2.0x 800x
int64 1.1x 380x
datetime64 1.1x 235x
bool 1.8x 60x

DataFrame.equals on 1000x1000:

dtype equal early mismatch
float64 298 us 4 us
int64 165 us 4 us

closes #32339

Test plan

  • All pre-existing equals tests pass (frame, series, index, dtypes, arrays, internals)
  • Tested correctness for all dtypes: float32/64, complex64/128, int, bool, datetime, timedelta
  • Tested edge cases: empty arrays, NaN-heavy arrays, different shapes, non-contiguous, F-contiguous, strided slices
  • Verified no regression for adversarial inputs (non-contiguous falls back to numpy, F-contiguous 2D uses transpose trick)

🤖 Generated with Claude Code

@jbrockmendel jbrockmendel marked this pull request as draft April 12, 2026 19:35
@jbrockmendel jbrockmendel added the Performance Memory or execution speed performance label Apr 12, 2026
jbrockmendel and others added 3 commits April 16, 2026 13:48
Add short-circuiting Cython functions for array equality checks:

- array_equivalent_float: single-pass NaN-aware float comparison,
  replacing the 4-temporary-array expression
  ((left == right) | (isnan(left) & isnan(right))).all()
- array_equivalent_bytes: memcmp-based comparison for int/bool/datetime
  arrays, replacing np.array_equal
- has_nans/all_nans: short-circuiting replacements for
  np.isnan(arr).any()/all()

For non-contiguous inputs, falls back to the original numpy expressions
with no regression.

closes pandas-dev#32339

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jbrockmendel jbrockmendel marked this pull request as ready for review April 19, 2026 02:20
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, just a question

Comment thread pandas/_libs/lib.pyx
Comment on lines +586 to +588
C-contiguous inputs. Not safe for dtypes where distinct bit patterns can
represent the same value (e.g. floats with -0.0/+0.0 or NaN) or for arrays
that contain object pointers.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth adding asserts for the common unsafe ones (float/complex/object I think)?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i dont think so; the caller is responsible for gating

@rhshadrach rhshadrach added this to the 3.1 milestone May 9, 2026
@rhshadrach rhshadrach merged commit 400aed1 into pandas-dev:main May 9, 2026
45 checks passed
@rhshadrach
Copy link
Copy Markdown
Member

Thanks @jbrockmendel

@jbrockmendel jbrockmendel deleted the perf-32339 branch May 9, 2026 15:49
Sharl0tteIsTaken added a commit to Sharl0tteIsTaken/pandas that referenced this pull request May 10, 2026
…-comparison

* upstream/main: (78 commits)
  PERF: short-circuit array_equivalent and equals comparisons (pandas-dev#65192)
  DOC: clarify GroupBy.resample parameters (pandas-devGH-54295) (pandas-dev#65545)
  TYP: type-check pandas.core.sample (pandas-dev#65533)
  DOC: drop dead redirects, generate ExtensionDtype member pages (pandas-dev#65527)
  CLN: drop stale pre-commit exclude for deleted code_style.rst (pandas-dev#65531)
  TYP: enable mypy checking for pandas.core.window.online (pandas-dev#65534)
  TYP: enable mypy on core.internals.api (pandas-dev#65535)
  TYP: enable mypy checks on pandas.core.dtypes.generic (pandas-dev#65537)
  DOC: document Rolling.apply Series index when on= is specified (pandas-dev#65539)
  TYP: enable mypy disallow_untyped_defs for pandas.core.roperator (pandas-dev#65540)
  TYP: enable mypy checking for pandas.core.sorting (pandas-dev#65541)
  DOC: clarify result_type has no effect for ufuncs in DataFrame.apply (pandas-dev#65542)
  TYP: enable mypy on pandas.compat.numpy.function (pandas-dev#65543)
  DOC: clarify read_fwf filepath_or_buffer description (pandas-devGH-55790) (pandas-dev#65544)
  DOC: fix relativedelta link in DateOffset See Also (pandas-dev#65546)
  TYP: annotate pandas/io/excel/_pyxlsb and drop from mypy overrides (pandas-dev#65547)
  TYP: annotate pandas/_config/config.py (pandas-dev#65549)
  PERF: skip to_datetime cache=True overhead for no-help input shapes (GH#65380) (pandas-dev#65409)
  BUG:to_datetime with origin (pandas-dev#63915)
  BUG: Series.transform now raises SpecificationError for duplicate function names (GH#54929) (pandas-dev#65156)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Performance Memory or execution speed performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PERF: short-circuit (left == right).all() comparisons

2 participants