Skip to content

feat: add tests to validate dataset files against datapackage.json#782

Merged
dsmedia merged 6 commits intovega:mainfrom
dsmedia:add-validation
Apr 30, 2026
Merged

feat: add tests to validate dataset files against datapackage.json#782
dsmedia merged 6 commits intovega:mainfrom
dsmedia:add-validation

Conversation

@dsmedia
Copy link
Copy Markdown
Member

@dsmedia dsmedia commented Apr 19, 2026

Adds tests/test_datapackage.py and tests/conftest.py — a two-tier pytest validator for verifying data files against datapackage.json. Advances the work introduced in #631 to adopt the Data Package v2 standard; full v2 conformance requires follow-up changes (see Scope and known gaps below). Frictionless tooling for datapackage is still nascent, so this PR incorporates some customizations to ensure proper coverage. CI integration is committed as a follow-up PR (per review thread).

The validator runs in two tiers, each parametrized over every resource in the descriptor.

  • Fast tier (default) — pure-Python, sub-second across all 70+ resources. Verifies file existence, declared bytes, and git-blob SHA-1 against on-disk content. Stdlib only. We do this ourselves because of current limitations in frictionless-py (5.19.0): its byte-count check returns None for tabular JSON / arrow / parquet resources, and its hash check only supports md5 and sha256 (our descriptor uses sha1).

  • Slow tier (pytest --run-slow) — frictionless schema and row validation per resource. Multi-minute on flights-3m at full read; pass --limit-rows N to cap row reads for quick iteration. Byte-count and hash-count are skipped via Checklist(skip_errors=...) (the fast tier covers them more completely), and serial execution is load-bearing because frictionless's parallel path silently ignores Checklist.skip_errors. Known-expected slow-tier failures fall into two categories and are declared in _data/validate_datapackage.toml: pedagogical inconsistencies (e.g. movies carries intentional schema quirks documented as teaching material) and infrastructure gaps (e.g. flights_200k_arrow has no upstream arrow parser). Each entry is marked xfail(strict=True) at parametrize time. Removing an entry re-enables strict checking; if the upstream issue resolves, the run flips XFAIL → XPASS and fails, prompting allowlist removal. Frictionless tooling is evolving (as noted in feat: improve Data Package metadata compliance with CKAN licenses and field schemas #755) and over time fewer customizations may be required.

This bump also switches frictionless's parquet backend from fastparquet to pyarrow, which exposes a known gap (frictionlessdata/frictionless-py#1773, fix proposed in frictionlessdata/frictionless-py#1774): the parquet extra doesn't pull pandas, but the parquet parser calls pyarrow.Table.to_pandas(). pyproject.toml adds pandas>=2.2.3 as a top-level dep so CI's npm run build step (which calls build_datapackage.py) doesn't fail with No module named 'pandas'. On main this worked only because >=5.18.0 pulled fastparquetpandas transitively; the bump severed that chain. This workaround becomes redundant once #1774 releases.

Scope and known gaps

This PR validates consistency between data files and their descriptor entries: existence, byte size, git-blob SHA-1, and frictionless schema/row checks. It does not validate datapackage.json itself against the Data Package v2 JSON Schema.

Known gaps for follow-up work:

  • Resource paths use repo-local shorthand. The descriptor uses bare filenames under data/ rather than paths relative to the descriptor location. This is tracked in Help needed: fixing non-standard resource.path in vega-datasets altair#3946. Until that migration lands, the slow validation tier uses basepath=str(DATA).

  • The descriptor is not fully v2-conformant today, and does not declare $schema. A JSON Schema check against the published v2 profile fails because 12 of 73 resources declare type: "json" or type: "file", while v2 restricts Resource.type strictly to "table" (see https://datapackage.org/standard/data-resource/#type). Upstream work to formalize JSON Data Resource support — promoting it from recipe to standard — is tracked in frictionlessdata/datapackage#937, and would directly address 9 of the 12 non-table cases here once it lands.

dsmedia added 3 commits April 19, 2026 01:11
Adds scripts/validate_datapackage.py for verifying data files against
the descriptor. Optional, local, not run in CI — the existing workflow
(taplo + ruff + npm build) is unchanged.

Runs two phases in sequence. Phase 1 is pure-Python: recomputes each
file's size and git-blob SHA-1 and compares against datapackage.json.
We do this ourselves because frictionless-py can't cover either case
reliably — its byte-count check returns None for tabular JSON / arrow /
parquet resources, and its hash check only supports md5 and sha256
(our descriptor uses sha1). Phase 2 hands each resource to
frictionless-py for schema and row validation, with byte-count and
hash-count skipped (phase 1 covered them) and serial execution because
frictionless's parallel path silently ignores Checklist.skip_errors.

Known-expected phase-2 failures — movies.json's intentional schema
mismatch (documented pedagogy) and flights-200k.arrow (no arrow parser
upstream) — are declared in _data/validate_datapackage.toml and marked
with a yellow warning in the output without tripping the exit code.
Removing an entry from that file re-enables strict checking and
surfaces any regression in a PR.

Bumps frictionless>=5.18.0 to >=5.18.1 to pick up the parallel-validate
fix in frictionlessdata/frictionless-py#1722; v5.18.0 threw a runtime
error on --parallel for non-FK packages. The script's PEP 723 deps
also include the pandas extra as a workaround for
frictionlessdata/frictionless-py#1773 / #1774 (the parquet extra alone
doesn't pull pandas, which the parquet parser needs); that extra will
be redundant once #1774 releases.
The lockfile incorrectly recorded taplo with `source = { registry = ".wheels" }`
and only the aarch64 wheel — leaked from a local WSL2 workaround when the
frictionless bump triggered a re-lock. CI on x86_64 couldn't install it.

Restore taplo's PyPI source with all platform wheels so CI passes.
frictionless[parquet]>=5.18.1 switched from fastparquet to pyarrow for
parquet support. pyarrow.Table.to_pandas() needs pandas at runtime but
pyarrow doesn't declare it as a transitive dep, so build_datapackage.py
failed in CI with "No module named 'pandas'" when inferring parquet
schemas. On main this worked only because frictionless[parquet]>=5.18.0
pulled fastparquet, which in turn pulled pandas.

Add pandas as an explicit top-level dep so the default install tree
satisfies frictionless's parquet backend.
@dsmedia dsmedia requested review from domoritz and mattijn April 19, 2026 13:43
Copy link
Copy Markdown
Member

@domoritz domoritz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be cleaner to write these checks as unit tests with pytest? The logic for rendering the failures looks very similar to what I would expect from a unit testing framework.

@domoritz observed on PR vega#782 that the script reimplements pytest's
reporting layer (custom rich panels, manual progress bars, custom
expected-failures filtering). vega-datasets is a metadata + data hub —
every line of tracked code should pay rent — so move the validator into
pytest where pytest already provides reporting, parametrization, xfail
strict, and standard CLI ergonomics.

Two tiers map onto pytest's fast/slow split:

* Default — file existence, declared bytes, git-blob SHA-1. Stdlib only,
  sub-second over 73 resources. Covers what frictionless-py doesn't
  today (byte-count returns None for tabular JSON / arrow / parquet;
  hash-count supports only md5 / sha256; descriptor uses sha1).

* Slow (`pytest --run-slow`) — frictionless schema and row validation
  per resource. Default is full read (matching the script); pass
  `--limit-rows N` to cap during iteration. flights-3m.parquet is
  ~minutes at full read.

The expected-failures allowlist (movies — intentional pedagogy;
flights_200k_arrow — no upstream parser) stays in
_data/validate_datapackage.toml. xfail marks are emitted at parametrize
time via pytest.param(resource, marks=[xfail(strict=True)]) — not via
brittle ID-string matching in pytest_collection_modifyitems. Strict
mode flips XFAIL to FAIL the moment an upstream issue resolves,
prompting allowlist removal.

Tests assert descriptor contracts, they don't skip on missing fields.
Every resource has path / bytes / sha1: hash today; if a future
descriptor regression drops one, the test fails loudly rather than
silently SKIPping.

parallel=False is preserved with a code comment explaining the
load-bearing rationale: frictionless's parallel path silently ignores
Checklist.skip_errors, which would re-surface byte-count and
hash-count errors phase 1 already covers.

Net: -330 lines from deleting scripts/validate_datapackage.py
(replaced by 213 lines under tests/), pytest>=9 promoted from
transitive in uv.lock to declared dev-group dep. The rich PEP 723
inline dep on the script goes away; rich itself stays in uv.lock
transitively via frictionless -> typer -> rich.
@dsmedia dsmedia changed the title feat: add script to validate dataset files against datapackage.json feat: add tests to validate dataset files against datapackage.json Apr 28, 2026
@dsmedia
Copy link
Copy Markdown
Member Author

dsmedia commented Apr 28, 2026

Wouldn't it be cleaner to write these checks as unit tests with pytest? The logic for rendering the failures looks very similar to what I would expect from a unit testing framework.

Yep, makes sense -- converted.

@dsmedia dsmedia requested a review from domoritz April 28, 2026 10:39
Copy link
Copy Markdown
Member

@domoritz domoritz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we run this in GitHub actions?

@dsmedia
Copy link
Copy Markdown
Member Author

dsmedia commented Apr 30, 2026

Should we run this in GitHub actions?

Yes, definitely. Will do as a follow-up PR.

dsmedia added 2 commits April 30, 2026 08:12
Replace the misattributed reference to frictionlessdata/frictionless-py#1435
with a pointer to vega#758, the in-repo issue tracking
the actual cause: descriptor resource paths are bare filenames under
data/ rather than relative to the descriptor location.

#1435 is about remote descriptors with local data (CLI --basepath
silently ignored on remote URLs); our descriptor is local and fixing
#1435 upstream would not remove this workaround. vega#758 is the correct
upstream tracker and chains to vega/altair#3946 for the cross-repo
coordination required before the descriptor paths can be migrated.
`pytest.param` is a function (factory), not a type, so `-> pytest.param`
is an invalid type annotation (flagged by ty as `invalid-type-form`).
Pytest 9.0.3 doesn't expose `ParameterSet` at the public top level, so
fall back to `-> Any` with an inline comment naming the actual type.
@dsmedia dsmedia merged commit ac11b79 into vega:main Apr 30, 2026
3 checks passed
@dsmedia dsmedia deleted the add-validation branch April 30, 2026 12:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants