feat: add tests to validate dataset files against datapackage.json by dsmedia · Pull Request #782 · vega/vega-datasets

dsmedia · 2026-04-19T05:49:45Z

Adds tests/test_datapackage.py and tests/conftest.py — a two-tier pytest validator for verifying data files against datapackage.json. Advances the work introduced in #631 to adopt the Data Package v2 standard; full v2 conformance requires follow-up changes (see Scope and known gaps below). Frictionless tooling for datapackage is still nascent, so this PR incorporates some customizations to ensure proper coverage. CI integration is committed as a follow-up PR (per review thread).

The validator runs in two tiers, each parametrized over every resource in the descriptor.

Fast tier (default) — pure-Python, sub-second across all 70+ resources. Verifies file existence, declared bytes, and git-blob SHA-1 against on-disk content. Stdlib only. We do this ourselves because of current limitations in frictionless-py (5.19.0): its byte-count check returns None for tabular JSON / arrow / parquet resources, and its hash check only supports md5 and sha256 (our descriptor uses sha1).
Slow tier (pytest --run-slow) — frictionless schema and row validation per resource. Multi-minute on flights-3m at full read; pass --limit-rows N to cap row reads for quick iteration. Byte-count and hash-count are skipped via Checklist(skip_errors=...) (the fast tier covers them more completely), and serial execution is load-bearing because frictionless's parallel path silently ignores Checklist.skip_errors. Known-expected slow-tier failures fall into two categories and are declared in _data/validate_datapackage.toml: pedagogical inconsistencies (e.g. movies carries intentional schema quirks documented as teaching material) and infrastructure gaps (e.g. flights_200k_arrow has no upstream arrow parser). Each entry is marked xfail(strict=True) at parametrize time. Removing an entry re-enables strict checking; if the upstream issue resolves, the run flips XFAIL → XPASS and fails, prompting allowlist removal. Frictionless tooling is evolving (as noted in feat: improve Data Package metadata compliance with CKAN licenses and field schemas #755) and over time fewer customizations may be required.

This bump also switches frictionless's parquet backend from fastparquet to pyarrow, which exposes a known gap (frictionlessdata/frictionless-py#1773, fix proposed in frictionlessdata/frictionless-py#1774): the parquet extra doesn't pull pandas, but the parquet parser calls pyarrow.Table.to_pandas(). pyproject.toml adds pandas>=2.2.3 as a top-level dep so CI's npm run build step (which calls build_datapackage.py) doesn't fail with No module named 'pandas'. On main this worked only because >=5.18.0 pulled fastparquet → pandas transitively; the bump severed that chain. This workaround becomes redundant once #1774 releases.

Scope and known gaps

This PR validates consistency between data files and their descriptor entries: existence, byte size, git-blob SHA-1, and frictionless schema/row checks. It does not validate datapackage.json itself against the Data Package v2 JSON Schema.

Known gaps for follow-up work:

Resource paths use repo-local shorthand. The descriptor uses bare filenames under data/ rather than paths relative to the descriptor location. This is tracked in Help needed: fixing non-standard resource.path in vega-datasets altair#3946. Until that migration lands, the slow validation tier uses basepath=str(DATA).
The descriptor is not fully v2-conformant today, and does not declare $schema. A JSON Schema check against the published v2 profile fails because 12 of 73 resources declare type: "json" or type: "file", while v2 restricts Resource.type strictly to "table" (see https://datapackage.org/standard/data-resource/#type). Upstream work to formalize JSON Data Resource support — promoting it from recipe to standard — is tracked in frictionlessdata/datapackage#937, and would directly address 9 of the 12 non-table cases here once it lands.

Adds scripts/validate_datapackage.py for verifying data files against the descriptor. Optional, local, not run in CI — the existing workflow (taplo + ruff + npm build) is unchanged. Runs two phases in sequence. Phase 1 is pure-Python: recomputes each file's size and git-blob SHA-1 and compares against datapackage.json. We do this ourselves because frictionless-py can't cover either case reliably — its byte-count check returns None for tabular JSON / arrow / parquet resources, and its hash check only supports md5 and sha256 (our descriptor uses sha1). Phase 2 hands each resource to frictionless-py for schema and row validation, with byte-count and hash-count skipped (phase 1 covered them) and serial execution because frictionless's parallel path silently ignores Checklist.skip_errors. Known-expected phase-2 failures — movies.json's intentional schema mismatch (documented pedagogy) and flights-200k.arrow (no arrow parser upstream) — are declared in _data/validate_datapackage.toml and marked with a yellow warning in the output without tripping the exit code. Removing an entry from that file re-enables strict checking and surfaces any regression in a PR. Bumps frictionless>=5.18.0 to >=5.18.1 to pick up the parallel-validate fix in frictionlessdata/frictionless-py#1722; v5.18.0 threw a runtime error on --parallel for non-FK packages. The script's PEP 723 deps also include the pandas extra as a workaround for frictionlessdata/frictionless-py#1773 / #1774 (the parquet extra alone doesn't pull pandas, which the parquet parser needs); that extra will be redundant once #1774 releases.

The lockfile incorrectly recorded taplo with `source = { registry = ".wheels" }` and only the aarch64 wheel — leaked from a local WSL2 workaround when the frictionless bump triggered a re-lock. CI on x86_64 couldn't install it. Restore taplo's PyPI source with all platform wheels so CI passes.

frictionless[parquet]>=5.18.1 switched from fastparquet to pyarrow for parquet support. pyarrow.Table.to_pandas() needs pandas at runtime but pyarrow doesn't declare it as a transitive dep, so build_datapackage.py failed in CI with "No module named 'pandas'" when inferring parquet schemas. On main this worked only because frictionless[parquet]>=5.18.0 pulled fastparquet, which in turn pulled pandas. Add pandas as an explicit top-level dep so the default install tree satisfies frictionless's parquet backend.

domoritz

Wouldn't it be cleaner to write these checks as unit tests with pytest? The logic for rendering the failures looks very similar to what I would expect from a unit testing framework.

@domoritz

@domoritz observed on PR vega#782 that the script reimplements pytest's reporting layer (custom rich panels, manual progress bars, custom expected-failures filtering). vega-datasets is a metadata + data hub — every line of tracked code should pay rent — so move the validator into pytest where pytest already provides reporting, parametrization, xfail strict, and standard CLI ergonomics. Two tiers map onto pytest's fast/slow split: * Default — file existence, declared bytes, git-blob SHA-1. Stdlib only, sub-second over 73 resources. Covers what frictionless-py doesn't today (byte-count returns None for tabular JSON / arrow / parquet; hash-count supports only md5 / sha256; descriptor uses sha1). * Slow (`pytest --run-slow`) — frictionless schema and row validation per resource. Default is full read (matching the script); pass `--limit-rows N` to cap during iteration. flights-3m.parquet is ~minutes at full read. The expected-failures allowlist (movies — intentional pedagogy; flights_200k_arrow — no upstream parser) stays in _data/validate_datapackage.toml. xfail marks are emitted at parametrize time via pytest.param(resource, marks=[xfail(strict=True)]) — not via brittle ID-string matching in pytest_collection_modifyitems. Strict mode flips XFAIL to FAIL the moment an upstream issue resolves, prompting allowlist removal. Tests assert descriptor contracts, they don't skip on missing fields. Every resource has path / bytes / sha1: hash today; if a future descriptor regression drops one, the test fails loudly rather than silently SKIPping. parallel=False is preserved with a code comment explaining the load-bearing rationale: frictionless's parallel path silently ignores Checklist.skip_errors, which would re-surface byte-count and hash-count errors phase 1 already covers. Net: -330 lines from deleting scripts/validate_datapackage.py (replaced by 213 lines under tests/), pytest>=9 promoted from transitive in uv.lock to declared dev-group dep. The rich PEP 723 inline dep on the script goes away; rich itself stays in uv.lock transitively via frictionless -> typer -> rich.

dsmedia · 2026-04-28T10:39:32Z

Wouldn't it be cleaner to write these checks as unit tests with pytest? The logic for rendering the failures looks very similar to what I would expect from a unit testing framework.

Yep, makes sense -- converted.

domoritz

Should we run this in GitHub actions?

dsmedia · 2026-04-30T09:10:36Z

Should we run this in GitHub actions?

Yes, definitely. Will do as a follow-up PR.

Replace the misattributed reference to frictionlessdata/frictionless-py#1435 with a pointer to vega#758, the in-repo issue tracking the actual cause: descriptor resource paths are bare filenames under data/ rather than relative to the descriptor location. #1435 is about remote descriptors with local data (CLI --basepath silently ignored on remote URLs); our descriptor is local and fixing #1435 upstream would not remove this workaround. vega#758 is the correct upstream tracker and chains to vega/altair#3946 for the cross-repo coordination required before the descriptor paths can be migrated.

`pytest.param` is a function (factory), not a type, so `-> pytest.param` is an invalid type annotation (flagged by ty as `invalid-type-form`). Pytest 9.0.3 doesn't expose `ParameterSet` at the public top level, so fall back to `-> Any` with an inline comment naming the actual type.

dsmedia added 3 commits April 19, 2026 01:11

dsmedia requested review from domoritz and mattijn April 19, 2026 13:43

domoritz requested changes Apr 27, 2026

View reviewed changes

dsmedia changed the title ~~feat: add script to validate dataset files against datapackage.json~~ feat: add tests to validate dataset files against datapackage.json Apr 28, 2026

dsmedia requested a review from domoritz April 28, 2026 10:39

domoritz approved these changes Apr 29, 2026

View reviewed changes

dsmedia added 2 commits April 30, 2026 08:12

dsmedia merged commit ac11b79 into vega:main Apr 30, 2026
3 checks passed

dsmedia deleted the add-validation branch April 30, 2026 12:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add tests to validate dataset files against datapackage.json#782

feat: add tests to validate dataset files against datapackage.json#782
dsmedia merged 6 commits intovega:mainfrom
dsmedia:add-validation

dsmedia commented Apr 19, 2026 •

edited

Loading

Uh oh!

domoritz left a comment

Uh oh!

dsmedia commented Apr 28, 2026

Uh oh!

domoritz left a comment

Uh oh!

dsmedia commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

dsmedia commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Scope and known gaps

Uh oh!

domoritz left a comment

Choose a reason for hiding this comment

Uh oh!

dsmedia commented Apr 28, 2026

Uh oh!

domoritz left a comment

Choose a reason for hiding this comment

Uh oh!

dsmedia commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dsmedia commented Apr 19, 2026 •

edited

Loading