feat: add tests to validate dataset files against datapackage.json#782
Merged
feat: add tests to validate dataset files against datapackage.json#782
Conversation
Adds scripts/validate_datapackage.py for verifying data files against the descriptor. Optional, local, not run in CI — the existing workflow (taplo + ruff + npm build) is unchanged. Runs two phases in sequence. Phase 1 is pure-Python: recomputes each file's size and git-blob SHA-1 and compares against datapackage.json. We do this ourselves because frictionless-py can't cover either case reliably — its byte-count check returns None for tabular JSON / arrow / parquet resources, and its hash check only supports md5 and sha256 (our descriptor uses sha1). Phase 2 hands each resource to frictionless-py for schema and row validation, with byte-count and hash-count skipped (phase 1 covered them) and serial execution because frictionless's parallel path silently ignores Checklist.skip_errors. Known-expected phase-2 failures — movies.json's intentional schema mismatch (documented pedagogy) and flights-200k.arrow (no arrow parser upstream) — are declared in _data/validate_datapackage.toml and marked with a yellow warning in the output without tripping the exit code. Removing an entry from that file re-enables strict checking and surfaces any regression in a PR. Bumps frictionless>=5.18.0 to >=5.18.1 to pick up the parallel-validate fix in frictionlessdata/frictionless-py#1722; v5.18.0 threw a runtime error on --parallel for non-FK packages. The script's PEP 723 deps also include the pandas extra as a workaround for frictionlessdata/frictionless-py#1773 / #1774 (the parquet extra alone doesn't pull pandas, which the parquet parser needs); that extra will be redundant once #1774 releases.
The lockfile incorrectly recorded taplo with `source = { registry = ".wheels" }`
and only the aarch64 wheel — leaked from a local WSL2 workaround when the
frictionless bump triggered a re-lock. CI on x86_64 couldn't install it.
Restore taplo's PyPI source with all platform wheels so CI passes.
frictionless[parquet]>=5.18.1 switched from fastparquet to pyarrow for parquet support. pyarrow.Table.to_pandas() needs pandas at runtime but pyarrow doesn't declare it as a transitive dep, so build_datapackage.py failed in CI with "No module named 'pandas'" when inferring parquet schemas. On main this worked only because frictionless[parquet]>=5.18.0 pulled fastparquet, which in turn pulled pandas. Add pandas as an explicit top-level dep so the default install tree satisfies frictionless's parquet backend.
domoritz
requested changes
Apr 27, 2026
Member
domoritz
left a comment
There was a problem hiding this comment.
Wouldn't it be cleaner to write these checks as unit tests with pytest? The logic for rendering the failures looks very similar to what I would expect from a unit testing framework.
@domoritz observed on PR vega#782 that the script reimplements pytest's reporting layer (custom rich panels, manual progress bars, custom expected-failures filtering). vega-datasets is a metadata + data hub — every line of tracked code should pay rent — so move the validator into pytest where pytest already provides reporting, parametrization, xfail strict, and standard CLI ergonomics. Two tiers map onto pytest's fast/slow split: * Default — file existence, declared bytes, git-blob SHA-1. Stdlib only, sub-second over 73 resources. Covers what frictionless-py doesn't today (byte-count returns None for tabular JSON / arrow / parquet; hash-count supports only md5 / sha256; descriptor uses sha1). * Slow (`pytest --run-slow`) — frictionless schema and row validation per resource. Default is full read (matching the script); pass `--limit-rows N` to cap during iteration. flights-3m.parquet is ~minutes at full read. The expected-failures allowlist (movies — intentional pedagogy; flights_200k_arrow — no upstream parser) stays in _data/validate_datapackage.toml. xfail marks are emitted at parametrize time via pytest.param(resource, marks=[xfail(strict=True)]) — not via brittle ID-string matching in pytest_collection_modifyitems. Strict mode flips XFAIL to FAIL the moment an upstream issue resolves, prompting allowlist removal. Tests assert descriptor contracts, they don't skip on missing fields. Every resource has path / bytes / sha1: hash today; if a future descriptor regression drops one, the test fails loudly rather than silently SKIPping. parallel=False is preserved with a code comment explaining the load-bearing rationale: frictionless's parallel path silently ignores Checklist.skip_errors, which would re-surface byte-count and hash-count errors phase 1 already covers. Net: -330 lines from deleting scripts/validate_datapackage.py (replaced by 213 lines under tests/), pytest>=9 promoted from transitive in uv.lock to declared dev-group dep. The rich PEP 723 inline dep on the script goes away; rich itself stays in uv.lock transitively via frictionless -> typer -> rich.
Member
Author
Yep, makes sense -- converted. |
domoritz
approved these changes
Apr 29, 2026
domoritz
approved these changes
Apr 29, 2026
Member
domoritz
left a comment
There was a problem hiding this comment.
Should we run this in GitHub actions?
Member
Author
Yes, definitely. Will do as a follow-up PR. |
Replace the misattributed reference to frictionlessdata/frictionless-py#1435 with a pointer to vega#758, the in-repo issue tracking the actual cause: descriptor resource paths are bare filenames under data/ rather than relative to the descriptor location. #1435 is about remote descriptors with local data (CLI --basepath silently ignored on remote URLs); our descriptor is local and fixing #1435 upstream would not remove this workaround. vega#758 is the correct upstream tracker and chains to vega/altair#3946 for the cross-repo coordination required before the descriptor paths can be migrated.
`pytest.param` is a function (factory), not a type, so `-> pytest.param` is an invalid type annotation (flagged by ty as `invalid-type-form`). Pytest 9.0.3 doesn't expose `ParameterSet` at the public top level, so fall back to `-> Any` with an inline comment naming the actual type.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds
tests/test_datapackage.pyandtests/conftest.py— a two-tier pytest validator for verifying data files againstdatapackage.json. Advances the work introduced in #631 to adopt the Data Package v2 standard; full v2 conformance requires follow-up changes (see Scope and known gaps below). Frictionless tooling for datapackage is still nascent, so this PR incorporates some customizations to ensure proper coverage. CI integration is committed as a follow-up PR (per review thread).The validator runs in two tiers, each parametrized over every resource in the descriptor.
Fast tier (default) — pure-Python, sub-second across all 70+ resources. Verifies file existence, declared
bytes, and git-blob SHA-1 against on-disk content. Stdlib only. We do this ourselves because of current limitations infrictionless-py(5.19.0): its byte-count check returnsNonefor tabular JSON / arrow / parquet resources, and its hash check only supports md5 and sha256 (our descriptor uses sha1).Slow tier (
pytest --run-slow) — frictionless schema and row validation per resource. Multi-minute onflights-3mat full read; pass--limit-rows Nto cap row reads for quick iteration. Byte-count and hash-count are skipped viaChecklist(skip_errors=...)(the fast tier covers them more completely), and serial execution is load-bearing because frictionless's parallel path silently ignoresChecklist.skip_errors. Known-expected slow-tier failures fall into two categories and are declared in_data/validate_datapackage.toml: pedagogical inconsistencies (e.g.moviescarries intentional schema quirks documented as teaching material) and infrastructure gaps (e.g.flights_200k_arrowhas no upstream arrow parser). Each entry is markedxfail(strict=True)at parametrize time. Removing an entry re-enables strict checking; if the upstream issue resolves, the run flips XFAIL → XPASS and fails, prompting allowlist removal. Frictionless tooling is evolving (as noted in feat: improve Data Package metadata compliance with CKAN licenses and field schemas #755) and over time fewer customizations may be required.This bump also switches frictionless's parquet backend from
fastparquettopyarrow, which exposes a known gap (frictionlessdata/frictionless-py#1773, fix proposed in frictionlessdata/frictionless-py#1774): theparquetextra doesn't pullpandas, but the parquet parser callspyarrow.Table.to_pandas().pyproject.tomladdspandas>=2.2.3as a top-level dep so CI'snpm run buildstep (which callsbuild_datapackage.py) doesn't fail withNo module named 'pandas'. Onmainthis worked only because>=5.18.0pulledfastparquet→pandastransitively; the bump severed that chain. This workaround becomes redundant once #1774 releases.Scope and known gaps
This PR validates consistency between data files and their descriptor entries: existence, byte size, git-blob SHA-1, and frictionless schema/row checks. It does not validate
datapackage.jsonitself against the Data Package v2 JSON Schema.Known gaps for follow-up work:
Resource paths use repo-local shorthand. The descriptor uses bare filenames under
data/rather than paths relative to the descriptor location. This is tracked in Help needed: fixing non-standardresource.pathin vega-datasets altair#3946. Until that migration lands, the slow validation tier usesbasepath=str(DATA).The descriptor is not fully v2-conformant today, and does not declare
$schema. A JSON Schema check against the published v2 profile fails because 12 of 73 resources declaretype: "json"ortype: "file", while v2 restrictsResource.typestrictly to"table"(see https://datapackage.org/standard/data-resource/#type). Upstream work to formalize JSON Data Resource support — promoting it from recipe to standard — is tracked in frictionlessdata/datapackage#937, and would directly address 9 of the 12 non-table cases here once it lands.