ci: run datapackage validation tests by dsmedia · Pull Request #784 · vega/vega-datasets

dsmedia · 2026-05-09T01:50:34Z

Wires the pytest validator from #782 into the existing Test workflow, so dataset-vs-descriptor drift fails CI on every PR rather than slipping through.

What this validates

"Validating a datapackage" can mean a few different things. Here's what this PR does and doesn't cover:

What this PR covers — checking that every data file still matches what datapackage.json says about it.

Two tiers:

Fast (runs by default, under a second): every file exists, is the size we recorded, and has the content we recorded. Catches "someone edited a CSV and forgot to regenerate datapackage.json."
Slow (--runslow, ~32s in CI): every column matches its declared type, required cells aren't missing, declared limits aren't violated. Catches "a data refresh added N/A text into a column we said was numbers."

What this PR doesn't cover:

Whether datapackage.json itself follows the Data Package format (missing required fields, values that don't match the spec's allowed list). Tracked as a follow-up in Validate datapackage.json against the Data Package v2 JSON Schema #785, with the deferral reason and concrete unblock path documented there. Briefly: a strict v2 schema check fails today on 12 of our 73 resources (type: "json" or "file", both rejected by the v2 enum which only permits "table"); the cleanest path is upstream frictionlessdata/datapackage#937 formalizing JSON Data Resource support. check-datapackage (from the seedcase-project; mentioned by @joelostblom on Help needed: fixing non-standard resource.path in vega-datasets altair#3946) is the natural tool for that follow-up.

Why `--limit-rows 250000`

flights_3m.parquet is the only resource above ~200K rows; the next largest (flights_200k_*) sit right at the cap. Capping at 250K leaves every other resource fully validated and cuts CI from ~4m43s to ~40s. Confirmed with @domoritz it is OK here to validate a sample.

Why `--runslow` (was `--run-slow`)

Matches the pytest docs example. Updated in tests/conftest.py, pyproject.toml, the tests/test_datapackage.py docstring, and CONTRIBUTING.md vs initial commit.

Why we drive frictionless-py from pytest instead of using its CLI directly

Our tests import frictionless.Package and frictionless.Checklist and call .validate() per resource. The alternative would be shelling out to frictionless validate datapackage.json. We don't, because of three frictionless gaps (as of frictionless 5.19.0) and three workflow needs the CLI doesn't satisfy:

byte-count returns None for tabular JSON / arrow / parquet — roughly half our resources would silently skip byte verification.
hash-count only supports md5 and sha256; our descriptor uses sha1 (git-blob compatible).
Checklist.skip_errors is silently ignored on the parallel code path, so we pass parallel=False.
We need a per-resource strict xfail allowlist for movies (intentional pedagogical quirks) and flights_200k_arrow (no upstream parser). If upstream ever fixes them, the tests flip XFAIL → XPASS and prompt allowlist removal.
We want a fast/slow tier split — bare pytest should stay sub-second for the inner loop.
Our descriptor uses bare filenames (path: "cars.json") instead of the v2-required descriptor-relative paths; CLI invocation would need the same basepath workaround anyway. The non-conformance is tracked in Fix resource.path to be spec-compliant #758, and Help needed: fixing non-standard resource.path in vega-datasets altair#3946 is the cross-repo coordination ask to make altair handle both path formats during migration.

We use frictionless's validation engine (Checklist, Package.validate); pytest provides the harness, parametrization, fast/slow split, xfail tracking, and CI integration.

Test plan

Local: uv run pytest tests/ -v --runslow --limit-rows 250000 → 290 passed, 2 xfailed in ~32s
Local: uv run pytest tests/ -v (fast tier alone) → 219 passed, 73 skipped in <1s
CI run matches local outcomes
runs in under 2 minutes

Wire the pytest suite from vega#782 into the existing Test workflow so descriptor/data drift fails CI rather than slipping through. Includes the slow tier (--run-slow) — the fast tier alone catches bytes/sha1 drift that npm run build would re-trip anyway, while the slow tier is the unique value-add catching schema-vs-data drift. Step is placed after npm run build so tests validate the freshly rebuilt datapackage.json (catches build_datapackage.py regressions, not just committed-state drift). Local timing: 290 passed, 2 xfailed in ~3m32s on WSL2 ARM (flights_3m is the long pole). The two xfails are the pre-existing allowlisted movies + flights_200k_arrow entries from _data/validate_datapackage.toml. Closes the follow-up commitment from vega#782 review (vega#782 (comment)).

dsmedia · 2026-05-09T02:10:20Z

@domoritz flights_3m is the slow one here. if we add --limit-rows 250000 it would cut the Test job from ~4m43s to ~40s, and since no other dataset has more than 250K rows, everything else stays fully validated. Which would you prefer?

domoritz · 2026-05-10T21:39:47Z

Yeah, validating a sample seems fine.

@domoritz

Per @domoritz on vega#784: flights_3m is the only resource above ~200K rows, so capping at 250K keeps every other dataset fully validated while cutting the Test job from ~4m43s to ~40s. Renamed --run-slow → --runslow to match the dominant Python convention (pytest docs example, ~1.3K GitHub hits vs ~500 for --run-slow). Updated CONTRIBUTING.md to reflect that the slow tier now runs in CI and added per-test docstrings naming each fast-tier failure mode.

dsmedia marked this pull request as ready for review May 9, 2026 02:10

domoritz approved these changes May 10, 2026

View reviewed changes

dsmedia mentioned this pull request May 10, 2026

Validate datapackage.json against the Data Package v2 JSON Schema #785

Open

5 tasks

dsmedia merged commit fd88f62 into vega:main May 11, 2026
6 checks passed

dsmedia deleted the ci-validate-datapackage branch May 11, 2026 00:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ci: run datapackage validation tests#784

ci: run datapackage validation tests#784
dsmedia merged 2 commits into
vega:mainfrom
dsmedia:ci-validate-datapackage

dsmedia commented May 9, 2026 •

edited

Loading

Uh oh!

dsmedia commented May 9, 2026

Uh oh!

domoritz commented May 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

dsmedia commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this validates

Why --limit-rows 250000

Why --runslow (was --run-slow)

Why we drive frictionless-py from pytest instead of using its CLI directly

Test plan

Uh oh!

dsmedia commented May 9, 2026

Uh oh!

domoritz commented May 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dsmedia commented May 9, 2026 •

edited

Loading

Why `--limit-rows 250000`

Why `--runslow` (was `--run-slow`)