Skip to content

frictionless[parquet] raises ModuleNotFoundError: pandas on first parquet read #1773

@dsmedia

Description

@dsmedia

Summary

Installing only the parquet extra is insufficient to read or write Parquet files. The parser hits ModuleNotFoundError: No module named 'pandas' the first time read_rows() is called.

Reproduction

Fully self-contained — the parquet extra already installs pyarrow, so we use it to write the test file. No other data or deps required.

python -m venv /tmp/repro && source /tmp/repro/bin/activate
pip install 'frictionless[parquet]==5.19.0'

# 1. Write a tiny parquet file using pyarrow (which the extra does provide):
python - <<'PY'
import pyarrow as pa, pyarrow.parquet as pq
pq.write_table(pa.table({"id": [1, 2], "name": ["alice", "bob"]}), "/tmp/repro.parquet")
PY

# 2. Try to read it via frictionless — triggers the bug:
cd /tmp && python -c "from frictionless import Resource; print(Resource('/tmp/repro.parquet').read_rows())"

Output (confirmed on a clean Python 3.12 venv):

Traceback (most recent call last):
  ...
  File ".../frictionless/formats/parquet/parser.py", line 42, in read_cell_stream_create
    df = table.to_pandas(categories=control.categories or None)
         ^^^^^^^^^^^^^^^
  File "pyarrow/pandas-shim.pxi", line 50, in pyarrow.lib._PandasAPIShim._import_pandas
ModuleNotFoundError: No module named 'pandas'

frictionless.exception.FrictionlessException: [source-error] The data source has not supported or has inconsistent contents: No module named 'pandas'

Root cause

frictionless/formats/parquet/parser.py uses pandas unconditionally:

  • table.to_pandas(...) (line 42) on every read
  • TableResource(data=df, format="pandas") (line 43) on every read
  • platform.pandas.io.common.get_handle(...) (lines 32–35) on remote reads
  • source.to_pandas() (line 55) on every write

But pyproject.toml declares parquet = ["pyarrow>=14.0"]pandas is not pulled in. The import is lazy (platform.pandas via @extras(name="pandas"), and PyArrow calling it at runtime inside to_pandas()), which defers the error to first use instead of at import time. CI doesn't catch this because the hatch default env installs all extras together (e.g., frictionless[...,pandas,parquet,...]), masking the packaging gap.

The bug has been latent since PR #1260 (Oct 2022), which introduced the remote-read pandas dependency and the pandas-dataframe conversion.

Workaround

Install with both extras explicitly: pip install 'frictionless[parquet,pandas]'.

Proposed fix

Option A (quick fix): Add pandas>=1.0 to the parquet extra in pyproject.toml so it matches the actual runtime surface of ParquetParser. One-line change; makes the parquet and pandas extras strictly redundant, which honestly reflects today's runtime coupling.

Option B (architectural fix): Keep the parquet extra lightweight and avoid pulling in pandas entirely by rewriting ParquetParser to read natively from PyArrow — e.g., iterating via table.to_batches() or table.to_pylist() instead of delegating to TableResource(format="pandas"). Larger change; decouples the two extras for good.

PR coming — opens with Option A as the minimal, low-risk fix; Option B is left as a follow-up for maintainers to weigh.

Environment

  • frictionless 5.19.0
  • Python 3.12
  • Linux (WSL2), but not OS-specific — reproduces anywhere pip installs the parquet extra without pandas.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions