Parquet IO: also use zoneinfo timezones by default even when pyarrow uses pytz#65134
Parquet IO: also use zoneinfo timezones by default even when pyarrow uses pytz#65134jorisvandenbossche wants to merge 9 commits intopandas-dev:mainfrom
Conversation
337f918 to
b22fbc8
Compare
| if any( | ||
| isinstance(dtype, pd.DatetimeTZDtype) | ||
| for dtype in df._mgr.get_unique_dtypes() | ||
| ): | ||
| col_indices = df._select_dtypes_indices(pd.DatetimeTZDtype) | ||
| for i in col_indices: |
There was a problem hiding this comment.
Also here, my feeling is that we should have existing helpers that make this easier to do (i.e. to avoid to iterate over every single column's dtype).
But I couldn't directly find anything, so I added this _select_dtypes_indices equivalent of select_dtypes but just giving you the indices instead of the materialized subset dataframe.
The any check with a call to mgr.get_unique_dtypes is maybe less necessary (because _select_dtypes_indices also already works per block), or could be moved inside _select_dtypes_indices
There was a problem hiding this comment.
We have a handful of places in e.g. DataFrame.select_dtypes that does blk_dtypes = [blk.dtype for blk in self._mgr.blocks]. Definitely makes sense to have a helper for this. I'd be OK with the helper returning the usually-but-not-always-unique list, fine either way.
| offset = tz.utcoffset(None) | ||
| if offset is not None: | ||
| return dt.timezone(offset) | ||
| except Exception: |
There was a problem hiding this comment.
I was repeating the same pattern from below which I wrote first for zones, but I suppose here there should never be an error (a pytz FixedOffset should always have an offset, which is returned from utcoffset() regardless of the value being passed). Will update
There was a problem hiding this comment.
Looking back: timezones.is_fixed_offset has some logic to detect if a timezone if a fixed offset, and so t does not only return true for FixedOffset, but also for some zones that have no transitions, like "Etc/GMT+1".
And I am not 100% sure that all those cases where timezones.is_fixed_offset returns true will work exactly the same. I mostly want to ensure this never raises an error (because that would introduce a new regression)
That said, such "fixed" zones should probably not be converted to a fixed offset with datetime.timezone, but to a zoneinfo object when possible. So will switch the order here and first try to convert to zoneinfo
|
Couple of comments, generally looks good |
We generally switched to
zoneinfotimezones by default in pandas 3.0 (#34916), however because of pyarrow still returning pytz if installed, essentiallyread_parquet(and other IO methods using pyarrow) still defaults topytztimezones.(unless you have an environment without pytz, but e.g. for people upgrading pandas in an existing env, you will always have pytz)
I think it would be nice to have a consistent behaviour of
read_parquetregardless of the availability ofpytz, and have it follow the general default in pandas.I also have a PR on the pyarrow side to stop defaulting to pytz timezones (apache/arrow#49694), but awaiting that change, we could "normalize" the timezone that pyarrow returned to give a consistent behaviour for our users (also regardless of the pyarrow version they would be using in the future).
(still have to clean-up and add tests)
doc/source/whatsnew/vX.X.X.rstfile if fixing a bug or adding a new feature.