Skip to content

ci: run datapackage validation tests#784

Merged
dsmedia merged 2 commits into
vega:mainfrom
dsmedia:ci-validate-datapackage
May 11, 2026
Merged

ci: run datapackage validation tests#784
dsmedia merged 2 commits into
vega:mainfrom
dsmedia:ci-validate-datapackage

Conversation

@dsmedia
Copy link
Copy Markdown
Member

@dsmedia dsmedia commented May 9, 2026

Wires the pytest validator from #782 into the existing Test workflow, so dataset-vs-descriptor drift fails CI on every PR rather than slipping through.

What this validates

"Validating a datapackage" can mean a few different things. Here's what this PR does and doesn't cover:

What this PR covers — checking that every data file still matches what datapackage.json says about it.

Two tiers:

  • Fast (runs by default, under a second): every file exists, is the size we recorded, and has the content we recorded. Catches "someone edited a CSV and forgot to regenerate datapackage.json."
  • Slow (--runslow, ~32s in CI): every column matches its declared type, required cells aren't missing, declared limits aren't violated. Catches "a data refresh added N/A text into a column we said was numbers."

What this PR doesn't cover:

Why --limit-rows 250000

flights_3m.parquet is the only resource above ~200K rows; the next largest (flights_200k_*) sit right at the cap. Capping at 250K leaves every other resource fully validated and cuts CI from ~4m43s to ~40s. Confirmed with @domoritz it is OK here to validate a sample.

Why --runslow (was --run-slow)

Matches the pytest docs example. Updated in tests/conftest.py, pyproject.toml, the tests/test_datapackage.py docstring, and CONTRIBUTING.md vs initial commit.

Why we drive frictionless-py from pytest instead of using its CLI directly

Our tests import frictionless.Package and frictionless.Checklist and call .validate() per resource. The alternative would be shelling out to frictionless validate datapackage.json. We don't, because of three frictionless gaps (as of frictionless 5.19.0) and three workflow needs the CLI doesn't satisfy:

  • byte-count returns None for tabular JSON / arrow / parquet — roughly half our resources would silently skip byte verification.
  • hash-count only supports md5 and sha256; our descriptor uses sha1 (git-blob compatible).
  • Checklist.skip_errors is silently ignored on the parallel code path, so we pass parallel=False.
  • We need a per-resource strict xfail allowlist for movies (intentional pedagogical quirks) and flights_200k_arrow (no upstream parser). If upstream ever fixes them, the tests flip XFAIL → XPASS and prompt allowlist removal.
  • We want a fast/slow tier split — bare pytest should stay sub-second for the inner loop.
  • Our descriptor uses bare filenames (path: "cars.json") instead of the v2-required descriptor-relative paths; CLI invocation would need the same basepath workaround anyway. The non-conformance is tracked in Fix resource.path to be spec-compliant #758, and Help needed: fixing non-standard resource.path in vega-datasets altair#3946 is the cross-repo coordination ask to make altair handle both path formats during migration.

We use frictionless's validation engine (Checklist, Package.validate); pytest provides the harness, parametrization, fast/slow split, xfail tracking, and CI integration.

Test plan

  • Local: uv run pytest tests/ -v --runslow --limit-rows 250000 → 290 passed, 2 xfailed in ~32s
  • Local: uv run pytest tests/ -v (fast tier alone) → 219 passed, 73 skipped in <1s
  • CI run matches local outcomes
  • runs in under 2 minutes

Wire the pytest suite from vega#782 into the existing Test workflow so
descriptor/data drift fails CI rather than slipping through. Includes
the slow tier (--run-slow) — the fast tier alone catches bytes/sha1
drift that npm run build would re-trip anyway, while the slow tier is
the unique value-add catching schema-vs-data drift.

Step is placed after npm run build so tests validate the freshly
rebuilt datapackage.json (catches build_datapackage.py regressions,
not just committed-state drift).

Local timing: 290 passed, 2 xfailed in ~3m32s on WSL2 ARM (flights_3m
is the long pole). The two xfails are the pre-existing allowlisted
movies + flights_200k_arrow entries from _data/validate_datapackage.toml.

Closes the follow-up commitment from vega#782 review (vega#782 (comment)).
@dsmedia
Copy link
Copy Markdown
Member Author

dsmedia commented May 9, 2026

@domoritz flights_3m is the slow one here. if we add --limit-rows 250000 it would cut the Test job from ~4m43s to ~40s, and since no other dataset has more than 250K rows, everything else stays fully validated. Which would you prefer?

@dsmedia dsmedia marked this pull request as ready for review May 9, 2026 02:10
@domoritz
Copy link
Copy Markdown
Member

Yeah, validating a sample seems fine.

Per @domoritz on vega#784: flights_3m is the only resource above ~200K
rows, so capping at 250K keeps every other dataset fully validated
while cutting the Test job from ~4m43s to ~40s.

Renamed --run-slow → --runslow to match the dominant Python convention
(pytest docs example, ~1.3K GitHub hits vs ~500 for --run-slow).
Updated CONTRIBUTING.md to reflect that the slow tier now runs in CI
and added per-test docstrings naming each fast-tier failure mode.
@dsmedia dsmedia merged commit fd88f62 into vega:main May 11, 2026
6 checks passed
@dsmedia dsmedia deleted the ci-validate-datapackage branch May 11, 2026 00:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants