fix(impala/pyspark): use regexp_replace to strip#11989
fix(impala/pyspark): use regexp_replace to strip#11989
regexp_replace to strip#11989Conversation
trim doesn't remove _f_strim does not remove f's
07b2fb4 to
b78c203
Compare
trim does not remove f's0ae77dc to
b5ad8f6
Compare
regexp_replace to strip
There was a problem hiding this comment.
Pull request overview
Updates Impala and PySpark string lstrip/rstrip/strip compilation to use regexp_replace for correct whitespace handling (notably form feed), and adjusts tests/snapshots accordingly.
Changes:
- Implement
LStrip/RStrip/Stripin the PySpark and Impala SQL compilers viaregexp_replacepatterns using\\s. - Extend backend string method tests with a form-feed case and remove PySpark/Databricks xfails for
lstrip/rstrip. - Refresh Impala SQL snapshots to match the new
REGEXP_REPLACEoutput; includes a few small Polars backend cleanups.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| ibis/backends/tests/test_string.py | Adds a \f test value and updates expectations; removes PySpark/Databricks lstrip/rstrip xfails. |
| ibis/backends/sql/compilers/pyspark.py | Implements visit_LStrip/visit_RStrip/visit_Strip using regexp_replace. |
| ibis/backends/sql/compilers/impala.py | Switches strip variants to regexp_replace to avoid incorrect trimming behavior. |
| ibis/backends/polars/init.py | Minor cleanup (typo fix, exception type, type annotation tweaks). |
| ibis/backends/impala/tests/snapshots/test_string_builtins/test_string_builtins/lstrip/out.sql | Updates snapshot to REGEXP_REPLACE(..., '^\\s+', ''). |
| ibis/backends/impala/tests/snapshots/test_string_builtins/test_string_builtins/rstrip/out.sql | Updates snapshot to REGEXP_REPLACE(..., '\\s+$', ''). |
| ibis/backends/impala/tests/snapshots/test_string_builtins/test_string_builtins/strip/out.sql | Updates snapshot to `REGEXP_REPLACE(..., '^\s+ |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
I agree that the expected behavior is that To be clear, my understanding is (please correct if I'm wrong): On main, the current behavior is:
After this change:
|
|
@NickCrews Thanks for the quick review, and sorry for the delay in responding! I was running some tests with Spark 3.5 and Spark 4.1 to answer your questions last night, but I ended up coming to the conclusion that this solution isn't correct.
Actually, Spark SQL LTRIM (and RTRIM) does accept characters to trim, even on 3.5. The problem is the use of the I'll confirm this across 3.5 (via local Spark Connect) and 4.1 (on Databricks).
(Haven't revisited Impala yet.)
Hoping to get this revised over the weekend! |
Description of changes
Use
regexp_replaceinstead oftrimmethods to implementstripmethods, due to issues handling the form feed character.This is almost certainly slower. I alternatively considered dropping
\ffrom the set of characters to trim; however, data can include form feeds (e.g. from legacy systems), so correctness seems preferable.As an added bonus, this fixes two xfails on the PySpark side.
Issues closed