ci: run datapackage validation tests#784
Merged
Merged
Conversation
Wire the pytest suite from vega#782 into the existing Test workflow so descriptor/data drift fails CI rather than slipping through. Includes the slow tier (--run-slow) — the fast tier alone catches bytes/sha1 drift that npm run build would re-trip anyway, while the slow tier is the unique value-add catching schema-vs-data drift. Step is placed after npm run build so tests validate the freshly rebuilt datapackage.json (catches build_datapackage.py regressions, not just committed-state drift). Local timing: 290 passed, 2 xfailed in ~3m32s on WSL2 ARM (flights_3m is the long pole). The two xfails are the pre-existing allowlisted movies + flights_200k_arrow entries from _data/validate_datapackage.toml. Closes the follow-up commitment from vega#782 review (vega#782 (comment)).
Member
Author
|
@domoritz |
Member
|
Yeah, validating a sample seems fine. |
domoritz
approved these changes
May 10, 2026
Per @domoritz on vega#784: flights_3m is the only resource above ~200K rows, so capping at 250K keeps every other dataset fully validated while cutting the Test job from ~4m43s to ~40s. Renamed --run-slow → --runslow to match the dominant Python convention (pytest docs example, ~1.3K GitHub hits vs ~500 for --run-slow). Updated CONTRIBUTING.md to reflect that the slow tier now runs in CI and added per-test docstrings naming each fast-tier failure mode.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Wires the pytest validator from #782 into the existing
Testworkflow, so dataset-vs-descriptor drift fails CI on every PR rather than slipping through.What this validates
"Validating a datapackage" can mean a few different things. Here's what this PR does and doesn't cover:
What this PR covers — checking that every data file still matches what
datapackage.jsonsays about it.Two tiers:
datapackage.json."--runslow, ~32s in CI): every column matches its declared type, required cells aren't missing, declared limits aren't violated. Catches "a data refresh addedN/Atext into a column we said was numbers."What this PR doesn't cover:
datapackage.jsonitself follows the Data Package format (missing required fields, values that don't match the spec's allowed list). Tracked as a follow-up in Validate datapackage.json against the Data Package v2 JSON Schema #785, with the deferral reason and concrete unblock path documented there. Briefly: a strict v2 schema check fails today on 12 of our 73 resources (type: "json"or"file", both rejected by the v2 enum which only permits"table"); the cleanest path is upstream frictionlessdata/datapackage#937 formalizing JSON Data Resource support.check-datapackage(from the seedcase-project; mentioned by @joelostblom on Help needed: fixing non-standardresource.pathin vega-datasets altair#3946) is the natural tool for that follow-up.Why
--limit-rows 250000flights_3m.parquetis the only resource above ~200K rows; the next largest (flights_200k_*) sit right at the cap. Capping at 250K leaves every other resource fully validated and cuts CI from ~4m43s to ~40s. Confirmed with @domoritz it is OK here to validate a sample.Why
--runslow(was--run-slow)Matches the pytest docs example. Updated in
tests/conftest.py,pyproject.toml, thetests/test_datapackage.pydocstring, andCONTRIBUTING.mdvs initial commit.Why we drive frictionless-py from pytest instead of using its CLI directly
Our tests import
frictionless.Packageandfrictionless.Checklistand call.validate()per resource. The alternative would be shelling out tofrictionless validate datapackage.json. We don't, because of three frictionless gaps (as offrictionless5.19.0) and three workflow needs the CLI doesn't satisfy:byte-countreturnsNonefor tabular JSON / arrow / parquet — roughly half our resources would silently skip byte verification.hash-countonly supports md5 and sha256; our descriptor uses sha1 (git-blob compatible).Checklist.skip_errorsis silently ignored on the parallel code path, so we passparallel=False.xfailallowlist formovies(intentional pedagogical quirks) andflights_200k_arrow(no upstream parser). If upstream ever fixes them, the tests flip XFAIL → XPASS and prompt allowlist removal.pytestshould stay sub-second for the inner loop.path: "cars.json") instead of the v2-required descriptor-relative paths; CLI invocation would need the samebasepathworkaround anyway. The non-conformance is tracked in Fixresource.pathto be spec-compliant #758, and Help needed: fixing non-standardresource.pathin vega-datasets altair#3946 is the cross-repo coordination ask to make altair handle both path formats during migration.We use frictionless's validation engine (
Checklist,Package.validate); pytest provides the harness, parametrization, fast/slow split, xfail tracking, and CI integration.Test plan
uv run pytest tests/ -v --runslow --limit-rows 250000→ 290 passed, 2 xfailed in ~32suv run pytest tests/ -v(fast tier alone) → 219 passed, 73 skipped in <1s