Summary
Installing only the parquet extra is insufficient to read or write Parquet files. The parser hits ModuleNotFoundError: No module named 'pandas' the first time read_rows() is called.
Reproduction
Fully self-contained — the parquet extra already installs pyarrow, so we use it to write the test file. No other data or deps required.
python -m venv /tmp/repro && source /tmp/repro/bin/activate
pip install 'frictionless[parquet]==5.19.0'
# 1. Write a tiny parquet file using pyarrow (which the extra does provide):
python - <<'PY'
import pyarrow as pa, pyarrow.parquet as pq
pq.write_table(pa.table({"id": [1, 2], "name": ["alice", "bob"]}), "/tmp/repro.parquet")
PY
# 2. Try to read it via frictionless — triggers the bug:
cd /tmp && python -c "from frictionless import Resource; print(Resource('/tmp/repro.parquet').read_rows())"
Output (confirmed on a clean Python 3.12 venv):
Traceback (most recent call last):
...
File ".../frictionless/formats/parquet/parser.py", line 42, in read_cell_stream_create
df = table.to_pandas(categories=control.categories or None)
^^^^^^^^^^^^^^^
File "pyarrow/pandas-shim.pxi", line 50, in pyarrow.lib._PandasAPIShim._import_pandas
ModuleNotFoundError: No module named 'pandas'
frictionless.exception.FrictionlessException: [source-error] The data source has not supported or has inconsistent contents: No module named 'pandas'
Root cause
frictionless/formats/parquet/parser.py uses pandas unconditionally:
table.to_pandas(...) (line 42) on every read
TableResource(data=df, format="pandas") (line 43) on every read
platform.pandas.io.common.get_handle(...) (lines 32–35) on remote reads
source.to_pandas() (line 55) on every write
But pyproject.toml declares parquet = ["pyarrow>=14.0"] — pandas is not pulled in. The import is lazy (platform.pandas via @extras(name="pandas"), and PyArrow calling it at runtime inside to_pandas()), which defers the error to first use instead of at import time. CI doesn't catch this because the hatch default env installs all extras together (e.g., frictionless[...,pandas,parquet,...]), masking the packaging gap.
The bug has been latent since PR #1260 (Oct 2022), which introduced the remote-read pandas dependency and the pandas-dataframe conversion.
Workaround
Install with both extras explicitly: pip install 'frictionless[parquet,pandas]'.
Proposed fix
Option A (quick fix): Add pandas>=1.0 to the parquet extra in pyproject.toml so it matches the actual runtime surface of ParquetParser. One-line change; makes the parquet and pandas extras strictly redundant, which honestly reflects today's runtime coupling.
Option B (architectural fix): Keep the parquet extra lightweight and avoid pulling in pandas entirely by rewriting ParquetParser to read natively from PyArrow — e.g., iterating via table.to_batches() or table.to_pylist() instead of delegating to TableResource(format="pandas"). Larger change; decouples the two extras for good.
PR coming — opens with Option A as the minimal, low-risk fix; Option B is left as a follow-up for maintainers to weigh.
Environment
- frictionless 5.19.0
- Python 3.12
- Linux (WSL2), but not OS-specific — reproduces anywhere pip installs the
parquet extra without pandas.
Summary
Installing only the
parquetextra is insufficient to read or write Parquet files. The parser hitsModuleNotFoundError: No module named 'pandas'the first timeread_rows()is called.Reproduction
Fully self-contained — the
parquetextra already installspyarrow, so we use it to write the test file. No other data or deps required.Output (confirmed on a clean Python 3.12 venv):
Root cause
frictionless/formats/parquet/parser.pyusespandasunconditionally:table.to_pandas(...)(line 42) on every readTableResource(data=df, format="pandas")(line 43) on every readplatform.pandas.io.common.get_handle(...)(lines 32–35) on remote readssource.to_pandas()(line 55) on every writeBut
pyproject.tomldeclaresparquet = ["pyarrow>=14.0"]—pandasis not pulled in. The import is lazy (platform.pandasvia@extras(name="pandas"), and PyArrow calling it at runtime insideto_pandas()), which defers the error to first use instead of at import time. CI doesn't catch this because the hatch default env installs all extras together (e.g.,frictionless[...,pandas,parquet,...]), masking the packaging gap.The bug has been latent since PR #1260 (Oct 2022), which introduced the remote-read pandas dependency and the pandas-dataframe conversion.
Workaround
Install with both extras explicitly:
pip install 'frictionless[parquet,pandas]'.Proposed fix
Option A (quick fix): Add
pandas>=1.0to theparquetextra inpyproject.tomlso it matches the actual runtime surface ofParquetParser. One-line change; makes theparquetandpandasextras strictly redundant, which honestly reflects today's runtime coupling.Option B (architectural fix): Keep the
parquetextra lightweight and avoid pulling in pandas entirely by rewritingParquetParserto read natively from PyArrow — e.g., iterating viatable.to_batches()ortable.to_pylist()instead of delegating toTableResource(format="pandas"). Larger change; decouples the two extras for good.PR coming — opens with Option A as the minimal, low-risk fix; Option B is left as a follow-up for maintainers to weigh.
Environment
parquetextra without pandas.