Skip to content

Parquet IO: also use zoneinfo timezones by default even when pyarrow uses pytz#65134

Open
jorisvandenbossche wants to merge 9 commits intopandas-dev:mainfrom
jorisvandenbossche:pyarrow-pytz-to-zoneinfo
Open

Parquet IO: also use zoneinfo timezones by default even when pyarrow uses pytz#65134
jorisvandenbossche wants to merge 9 commits intopandas-dev:mainfrom
jorisvandenbossche:pyarrow-pytz-to-zoneinfo

Conversation

@jorisvandenbossche
Copy link
Copy Markdown
Member

We generally switched to zoneinfo timezones by default in pandas 3.0 (#34916), however because of pyarrow still returning pytz if installed, essentially read_parquet (and other IO methods using pyarrow) still defaults to pytz timezones.
(unless you have an environment without pytz, but e.g. for people upgrading pandas in an existing env, you will always have pytz)

I think it would be nice to have a consistent behaviour of read_parquet regardless of the availability of pytz, and have it follow the general default in pandas.
I also have a PR on the pyarrow side to stop defaulting to pytz timezones (apache/arrow#49694), but awaiting that change, we could "normalize" the timezone that pyarrow returned to give a consistent behaviour for our users (also regardless of the pyarrow version they would be using in the future).

(still have to clean-up and add tests)


@jorisvandenbossche jorisvandenbossche added this to the 3.0.3 milestone Apr 9, 2026
@jorisvandenbossche jorisvandenbossche added Timezones Timezone data dtype IO Parquet parquet, feather Arrow pyarrow functionality labels Apr 9, 2026
@jorisvandenbossche jorisvandenbossche force-pushed the pyarrow-pytz-to-zoneinfo branch from 337f918 to b22fbc8 Compare April 9, 2026 14:10
Comment on lines +250 to +255
if any(
isinstance(dtype, pd.DatetimeTZDtype)
for dtype in df._mgr.get_unique_dtypes()
):
col_indices = df._select_dtypes_indices(pd.DatetimeTZDtype)
for i in col_indices:
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also here, my feeling is that we should have existing helpers that make this easier to do (i.e. to avoid to iterate over every single column's dtype).
But I couldn't directly find anything, so I added this _select_dtypes_indices equivalent of select_dtypes but just giving you the indices instead of the materialized subset dataframe.

The any check with a call to mgr.get_unique_dtypes is maybe less necessary (because _select_dtypes_indices also already works per block), or could be moved inside _select_dtypes_indices

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a handful of places in e.g. DataFrame.select_dtypes that does blk_dtypes = [blk.dtype for blk in self._mgr.blocks]. Definitely makes sense to have a helper for this. I'd be OK with the helper returning the usually-but-not-always-unique list, fine either way.

offset = tz.utcoffset(None)
if offset is not None:
return dt.timezone(offset)
except Exception:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what can go wrong here?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was repeating the same pattern from below which I wrote first for zones, but I suppose here there should never be an error (a pytz FixedOffset should always have an offset, which is returned from utcoffset() regardless of the value being passed). Will update

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking back: timezones.is_fixed_offset has some logic to detect if a timezone if a fixed offset, and so t does not only return true for FixedOffset, but also for some zones that have no transitions, like "Etc/GMT+1".
And I am not 100% sure that all those cases where timezones.is_fixed_offset returns true will work exactly the same. I mostly want to ensure this never raises an error (because that would introduce a new regression)

That said, such "fixed" zones should probably not be converted to a fixed offset with datetime.timezone, but to a zoneinfo object when possible. So will switch the order here and first try to convert to zoneinfo

@jbrockmendel
Copy link
Copy Markdown
Member

Couple of comments, generally looks good

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Arrow pyarrow functionality IO Parquet parquet, feather Timezones Timezone data dtype

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants